- CMDTarget (Comparative Modeling based Denoising for sRNA Target prediction) is a computational tool for sRNA target prediction and evolutionary analysis.
- While conceptually similar to CopraRNA, CMDTarget has several distinct features:
- CMDTarget explicitly leverages information of evolutionary conservation, while unlike CopraRNA, it does not force homolog sRNA-mRNA pairs across different species to have the same interaction score by explicitly reconstructing interaction scores of both extant and ancestral sRNA-mRNA pairs.
- CMDTarget includes a module for Hfq binding prediction, and could optionally incorporate predicted Hfq binding scores for sRNA target prediction.
- If you have multiple sRNAs, or a large genome set for comparative analysis, comparative genomics based sRNA target prediction can be computationally intensive. CMDTarget is built with Snakemake and runs efficiently on HPC clusters (Slurm, SGE, etc.).
- We recommend using conda to automatically resolve dependencies
conda env create -f environment.yml
#alternatively, the following command should work
#conda install python snakemake ete3 intarna mafft cd-hit FastTree mmseqs2 hmmer infernal pyfaidx biopython tqdm ypytorch-cuda pytorch -c pytorch -c nvidia -c conda-forge -c bioconda
- IntaRNA: for interaction energy calculation
- snakemake: for pipeline management
- ete3: for tree operation
- scipy: implement L-BFGS-B for optimizing scores in tips and internal nodes, also required by ete3
- cd-hit: for sequence clustering
- mafft: for MSA construction
- FastTree: for 16S rRNA tree building
- mmseqs: for homolog search of sRNA and protein
- pyfaidx: for genome sequence processing
- hmmer: for detection of rpoB marker gene
- infernal: cmsearch for 16S rRNA sequence detection, required is using 16S rRNAs as phylogenetic marker
- pytorch: for Hfq binding score prediction
- The required inputs of CMDTarget include:
- Genome sequences
- Genome annotations in gff format
- Homologous sRNA sequences
- These inputs should be organized as follows:
indir: input directory, should contain the following subdirectories{indir}/fasta: Genome sequence in fasta format, named as '{genome_id}.fa'{indir}/gff: genome annotation in gff format, named as '{genome_id}.gff'{indir}/{hits}/hits.groupped- organized like '{indir}/{hits}/hits.groupped/{genome_id}/{sRNA_id}.fa', each file contains 1 sRNA sequence
- homologous sRNA sequences share the same id
- An easy-to-use script for preparing input data was provided, see below.
- You have multiple ways to curate your genome set for comparative analysis. Here we provide a simple strategy.
- For testing, you can simply use 10 Escherichia genome sequences and annotations in
genomes.tar.gz, including E.coli (GCF_000005845.2) - Suppose you have a genome of interest (called the query genome), you need to select several phylogenetically related genomes for comparative analysis.
- You may start from GTDB genomes of the clade that contains these genomes. You may only keep RefSeq genomes, or genomes with completeness greater than a certain cutoff. You can then get the assembly IDs (starting with "GCF") and save them in a text file.
python scripts/select-GTDB-genomes.py --clade g__Escherichia --output genomes/Escherichia.txt
-
If your query genome(s) are not present in this list, please manually add them
-
Then fetch sequences and annotations of these genomes.
#First download ncbi assembly summary.
wget -P genomes https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
# fetch genomes of interests
mkdir -p genomes/gff genomes/fasta
scripts/fetch-refseq-genomes.py --genome-ids genomes/Escherichia.txt --fasta-directory genomes/fasta --gff-directory genomes/gff --summary genomes/assembly_summary_refseq.txt
- Reformat CDS coordinate from gff to bed
bash run/gff2bed.sh
- Determine which genomes contain sRNA(s) for target prediction.
To do so, first put your sRNAs in a fasta file.
#cat sRNAs.fa
>GCF_000005845.2:RyhB
GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTTTTTTT
>GCF_000005845.2:Spot42
GTAGGGTACAGAGGTAAGATGTTCTATCTTTCAGACCTTTTACTTCACGTAATCGGATTTGGCTGAATATTTTAGCCGCCCCAGTCAGTAATGACTGGGGCGTTTTTTATT
>GCF_000005845.2:CyaR
GCTGAAAAACATAACCCATAAAATGCTAGCTGTACCAGGAACCACCTCCTTAGCCTGTGTAATCTCCCTTACACGGGCTTATTTTTT
Perform the homolog search, and extract hit sequences:
# perform homolog search, homolog sequence saved in genomes/sRNA-hits/hits.fa
scripts/sRNA-homolog-search.py -q sRNAs.fa -gd genomes/fasta -od genomes/sRNA-hits --threads 12 -gi genomes/Escherichia.txt
# group sRNAs by homolog
scripts/group-sRNAs-by-homolog.py -i genomes/sRNA-hits/hits.fa -od genomes/sRNA-hits/hits
# further group sRNA by genomes, make each file contain a single sequence
scripts/split-sRNA-homolog.py -id genomes/sRNA-hits/hits -od genomes/sRNA-hits/hits.groupped
- Extract genome ids with hits
cut -f 2 genomes/sRNA-hits/hits.tsv | cut -f 1 -d ':' | sort -u > genomes/Escherichia.with.hits.txt
- Here is an example config file in json format:
#example.json
{ "indir": "genomes",
"outdir": "output/Enterobacteriaceae",
"hits": "sRNA-hits",
"genome-set-name": "Escherichia",
"genome-ids": "genomes/Escherichia.with.hits.txt",
"query-ids": ["GCF_000005845.2"],
"sRNA-ids": "sRNA-ids.txt",
"marker": "rpoB",
"weights": {"pair zscore": 0.7, "hfq zscore": 0.3},
"denoise": {"srm": 10, "nvm": 10, "rescale": 0, "reroot": 1},
"steps": ["marker detection","single species scoring","comparative denoising"]}
genome-ids:- A text file specifying the genomes used in comparative analysis.
- One genome id per line
- For all genome_id with '{indir}/fasta/{genome_id}.fa', '{indir}/gff/{genome_id}.gff' present, only those specified in this file are used
query-ids:- A list specifying the query genome(s). The genome should be present in
genome-ids
- A list specifying the query genome(s). The genome should be present in
sRNA-ids:- A text file specifying sRNAs to consider
- For all sRNAs with
{indir}/{hits}/hits.groupped/{genome_id}/{sRNA_id}.fapresent, only those present in this file are considered
marker:- Phylogenetic marker for comparative analysis
weights: weights of energy scores and hfq scoresdenoise: parameter for denoisingsteps: steps to run. one or more steps can be specified.- "marker detection": detect phylogenetic marker
- "single species scoring": scoring binding using single species mode
- "comparative denoising": run comparative denoising. As this step depends on "marker detection" and "single species scoring", if you only specify this step, the other two steps will be called
- Once the data and config file are ready, the comparative analysis is fully automatic
- We recommend first running the "marker detection" step to make sure the marker gene is present in genomes that will be used for downstream analysis
- example.marker.detection.json
{ "indir": "genomes",
"outdir": "output/Enterobacteriaceae",
"hits": "sRNA-hits",
"genome-set-name": "Escherichia",
"genome-ids": "genomes/Escherichia.with.hits.txt",
"query-ids": ["GCF_000005845.2"],
"sRNA-ids": "sRNA-ids.txt",
"marker": "rpoB",
"weights": {"pair zscore": 0.7, "hfq zscore": 0.3},
"denoise": {"srm": 10, "nvm": 10, "rescale": 0, "reroot": 1},
"steps": ["marker detection"]}
snakemake --configfile example.marker.detection.json --jobs 16
- Marker genes were extracted to
{indir}/{marker}/{genome_id}.fa
- Define the genome set to use for downstream analysis
- You may keep all genomes with marker genes
ls genomes/rpoB/ | grep '.fa$' | sed 's/.fa$//' > genome-ids.txt
- When you have many genomes, up to 8 genomes with <0.95 sequence identity is generally sufficient
# combine marker sequences
scripts/combine-fasta.py -i genomes/rpoB -o genomes/Escherichia.rpoB.fa -gi genomes/Escherichia.with.hits.txt
# get sequence identity to query marker, the results are saved in Escherichia.sim2query/scores.txt
# here we use 0.99 sequence identity cutoff as an example to reserve enough genomes
# if you have diverse genomes, --cutoff 0.95 is a better default
scripts/select-genome-by-divergence.py -i genomes/Escherichia.rpoB.fa --query-ids GCF_000005845.2 --cutoff 0.99 -od Escherichia.sim2query/
# prepare refined genome ids
cut -f 1 -d ':' Escherichia.sim2query/scores.txt > genomes/genome-ids.refined.txt
- Refine your genome set by modifying "genome-ids" in config file
example.comp.scoring.jsonfor interaction prediction, then run single genome scoring and comparative analysis
{ "indir": "genomes",
"outdir": "output/Enterobacteriaceae",
"hits": "sRNA-hits",
"genome-set-name": "ent9", # an alias genome set used for comparative analysis
"genome-ids": "genomes/genome-ids.refined.txt",
"query-ids": ["GCF_000005845.2"], # genome for sRNA target prediction, multiple genomes can be provided
"sRNA-ids": "sRNA-ids.txt",
"marker": "rpoB",
"weights": {"pair zscore": 0.7, "hfq zscore": 0.3},
"denoise": {"srm": 10, "nvm": 10, "rescale": 0, "reroot": 1},
"steps": ["single species scoring","comparative denoising"]}
#--resources gpu=1 restrict jobs for runing Hfq score prediction
snakemake --configfile example.comp.scoring.json --jobs 16 --resources gpu=1
- The final result after comparative denoising:
{outdir}/comparative-analysis-denoised-ml/wt.hfq--{genome_set}/{params}/{sRNA_id}.txt. The default "params" is "pair.zscore-0.7_hfq.zscore-0.3_marker-rpoB_srm-10_nvm-10_rescale-0_reroot-1"
- This is the main results of comparative denoising pipeline
- This table contains 5 fields.
- query id: protein id in the query genome, which were used for grouping the records
- target id: indicates the record is the interaction score between which gene and the sRNA considered
- raw: the score before comparative denoising
- denoised: the score after comparative denoising
- tree: the phylogenetic tree with reconstructed denoised interaction scores. Only records with 'query id' = 'target id' contain this field.
- (target id, denoised) is the denoised interaction scores between sRNA and targets in all genomes
- If you are only interested in sRNA-target interactions in the query genome, you can directly use (target id, denoised) of rows with 'query id' = 'target id'
- Raw hfq scores:
{indir}/hfq
- This directory contains hfq binding scores of the leader sequence (the -200 to +100 nt region relative to the start codon) of each gene
{indir}/hfq/{asm_id}.txtcontains three columns:- gene id: ID of each gene, formatted like 'NP_414542.1-b0001-thrL', that is '{refseq_id}-{locus_id}-{gene_name}'
- score: raw logits of the prediction, higher indicates stronger binding
- zscore: normalized binding scores, higher indicates stronger binding
- Raw binding energy:
{outdir}/{asm_id}/energies
- This directory contains binding energies of an sRNA and leaders of different genes within a genome
{outdir}/{asm_id}/energies/{sRNA_id}.txtcontains 8 columns:- sRNA id: ID of the sRNA
- sstart: start of the predicted binding site at sRNA side
- send: end of the predicted binding site at sRNA side
- target id: ID of target gene (named as '{refseq_id}-{locus_id}-{gene_name}')
- tstart: start of the predicted binding site at mRNA side, the coordinate is relative to the (-200,+100) leader sequence
- tend: end of the predicted binding site at mRNA side
- energy: IntaRNA predicted raw energy
- Single species combined scores by genome:
{outdir}/{asm_id}/combined.wt.hfq/{sRNA_id}.txt
{outdir}/{asm_id}/combined.wt.hfq/{sRNA_id}.txtcontains 13 columns:- protein id: ID of target protein, named as '{refseq_id}-{locus_id}-{gene_name}'
- sstart: start of the predicted binding site at sRNA side
- send: end of the predicted binding site at sRNA side
- target id: same as 'protein id', can be omitted
- tstart: start of the predicted binding site at mRNA side, the coordinate is relative to the (-200,+100) leader sequence
- tend: end of the predicted binding site at mRNA side
- energy: raw binding energies
- sRNA pvalue: P-values of the interaction, using all genes as background for normalization
- sRNA zscore: Z-scores of the interaction, using all genes as background for normalization
- pair pvalue: P-values of the interaction, using predicted distribution of dinucleotide shuffled sequences as background
- pair zscore: Z-score of the interaction, using predicted distribution of dinucleotide shuffled sequences as background
- hfq score: raw logits of hfq binding prediction
- hfq zscore: Z-scores of hfq binding prediction
- Single species combined scores by homolog:
{outdir}/scores-by-homolog-pair/wt.hfq--{genome-set-name}/{sRNA_id}.txt
- This file collates interactions between homologous sRNAs and various targets in different genomes, and groups the entries by homolog relationship to protein in the query genome
- It contains 14 columns. The 'query id' field is specific to this homolog-centric table, indicating the homolog relationship to a protein in the query genome. The meaning of the remaining 13 columns is identical to those in
{outdir}/{asm_id}/combined.wt.hfq/{sRNA_id}.txt
- The phylogenetic tree:
{outdir}/phylogeny/{genome-set-name}/{marker}.nwk