Skip to content

Simulate nucleic acid sequences with stochastic processes

License

Notifications You must be signed in to change notification settings

jlfine/MarkovRNA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MarkovRNA

MarkovRNA generates a null distribution of nucleotide sequences using a k-th order Markov chain to simulate sequence scrambling, while accounting for non-independence relationships between positions.

The goal is to generate length-matched “null” RNA sequences that preserve local sequence statistics (e.g. base composition, dinucleotide dependencies, etc) while destroying higher-order or long-range structure. This is useful for testing for functional enrichment, i.e., RNA structure and/or motifs.

Background on Markov chains

A $k$ th order Markov chain is a simple generative model for sequences in which the probability of the next symbol depends only on the previous $k$ symbols. In the case of RNA, each symbol is one of the four bases ($A$, $U$, $C$, and $G$).

Let a sequence of length $L$ be denoted by $(X_1, X_2, \dots, X_L)$, where each $X_i \in {{A, C, G, U}}$.

We define the length $k$ context at position $i$ as:

$$ C_i = X_{i-k}, X_{i-k+1}, \dots, X_{i-1}. $$

The k-th order Markov assumption is that the current base depends only on the previous $k$ bases:

$$ \Pr(X_i = b \mid X_1, \dots, X_{i-1})=\Pr(X_i = b \mid C_i), $$

for any base $b \in {A, C, G, U}$ and positions $i > k$.

MarkovRNA estimates these conditional probabilities directly from data using maximum likelihood estimation, with optional pseudocount smoothing. New sequences are generated by initializing the first $k$ bases, following by using a sliding window to generate each next base.

In practice,

  • order = 0 preserves overall base composition (i.i.d. model)
  • order = 1 preserves dinucleotide-conditioned structure
  • higher orders preserve increasingly local sequence dependencies

Setup and implementation

Clone the repo and enter the directory:

git clone https://github.com/jlfine/MarkovRNA  
cd MarkovRNA  

Create and activate a conda environment:

conda env create -f environment.yml  
conda activate markovrna  

One simulation run will take an input sequence file and generate matched 'null' sequences.

Try running the example below, using test data obtained from Rfam: https://rfam.org/family/RF00001 (already present in repo):

python scripts/generate.py \
  --in_fasta data/RF00001.fa \
  --order 2 \
  --train_frac 0.50 \
  --max_length 200 \
  --seed 1

This will read the RNA sequences from the input fasta, then generate null sequences under the specified model, and will write output sequences and summary statistics to results.

About

Simulate nucleic acid sequences with stochastic processes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages