MarkovRNA generates a null distribution of nucleotide sequences using a k-th order Markov chain to simulate sequence scrambling, while accounting for non-independence relationships between positions.
The goal is to generate length-matched “null” RNA sequences that preserve local sequence statistics (e.g. base composition, dinucleotide dependencies, etc) while destroying higher-order or long-range structure. This is useful for testing for functional enrichment, i.e., RNA structure and/or motifs.
A
Let a sequence of length
We define the length
The k-th order Markov assumption is that the current base depends only on the previous
for any base
MarkovRNA estimates these conditional probabilities directly from data using maximum likelihood estimation, with optional pseudocount smoothing. New sequences are generated by initializing the first
In practice,
order = 0preserves overall base composition (i.i.d. model)order = 1preserves dinucleotide-conditioned structure- higher orders preserve increasingly local sequence dependencies
Clone the repo and enter the directory:
git clone https://github.com/jlfine/MarkovRNA
cd MarkovRNA Create and activate a conda environment:
conda env create -f environment.yml
conda activate markovrna One simulation run will take an input sequence file and generate matched 'null' sequences.
Try running the example below, using test data obtained from Rfam: https://rfam.org/family/RF00001 (already present in repo):
python scripts/generate.py \
--in_fasta data/RF00001.fa \
--order 2 \
--train_frac 0.50 \
--max_length 200 \
--seed 1This will read the RNA sequences from the input fasta, then generate null sequences under the specified model, and will write output sequences and summary statistics to results.