MarkovRNA

MarkovRNA generates a null distribution of nucleotide sequences using a k-th order Markov chain to simulate sequence scrambling, while accounting for non-independence relationships between positions.

The goal is to generate length-matched “null” RNA sequences that preserve local sequence statistics (e.g. base composition, dinucleotide dependencies, etc) while destroying higher-order or long-range structure. This is useful for testing for functional enrichment, i.e., RNA structure and/or motifs.

Background on Markov chains

A $k$ th order Markov chain is a simple generative model for sequences in which the probability of the next symbol depends only on the previous $k$ symbols. In the case of RNA, each symbol is one of the four bases ($A$, $U$, $C$, and $G$).

Let a sequence of length $L$ be denoted by $(X_1, X_2, \dots, X_L)$, where each $X_i \in {{A, C, G, U}}$.

We define the length $k$ context at position $i$ as:

$$ C_i = X_{i-k}, X_{i-k+1}, \dots, X_{i-1}. $$

The k-th order Markov assumption is that the current base depends only on the previous $k$ bases:

$$ \Pr(X_i = b \mid X_1, \dots, X_{i-1})=\Pr(X_i = b \mid C_i), $$

for any base $b \in {A, C, G, U}$ and positions $i > k$.

MarkovRNA estimates these conditional probabilities directly from data using maximum likelihood estimation, with optional pseudocount smoothing. New sequences are generated by initializing the first $k$ bases, following by using a sliding window to generate each next base.

In practice,

order = 0 preserves overall base composition (i.i.d. model)
order = 1 preserves dinucleotide-conditioned structure
higher orders preserve increasingly local sequence dependencies

Setup and implementation

Clone the repo and enter the directory:

git clone https://github.com/jlfine/MarkovRNA  
cd MarkovRNA

Create and activate a conda environment:

conda env create -f environment.yml  
conda activate markovrna

One simulation run will take an input sequence file and generate matched 'null' sequences.

Try running the example below, using test data obtained from Rfam: https://rfam.org/family/RF00001 (already present in repo):

python scripts/generate.py \
  --in_fasta data/RF00001.fa \
  --order 2 \
  --train_frac 0.50 \
  --max_length 200 \
  --seed 1

This will read the RNA sequences from the input fasta, then generate null sequences under the specified model, and will write output sequences and summary statistics to results.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MarkovRNA

Background on Markov chains

Setup and implementation

About

Uh oh!

Releases

Packages

Languages

License

jlfine/MarkovRNA

Folders and files

Latest commit

History

Repository files navigation

MarkovRNA

Background on Markov chains

Setup and implementation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages