CMDTarget

What is CMDTarget?

CMDTarget (Comparative Modeling based Denoising for sRNA Target prediction) is a computational tool for sRNA target prediction and evolutionary analysis.
While conceptually similar to CopraRNA, CMDTarget has several distinct features:
- CMDTarget explicitly leverages information of evolutionary conservation, while unlike CopraRNA, it does not force homolog sRNA-mRNA pairs across different species to have the same interaction score by explicitly reconstructing interaction scores of both extant and ancestral sRNA-mRNA pairs.
- CMDTarget includes a module for Hfq binding prediction, and could optionally incorporate predicted Hfq binding scores for sRNA target prediction.
- If you have multiple sRNAs, or a large genome set for comparative analysis, comparative genomics based sRNA target prediction can be computationally intensive. CMDTarget is built with Snakemake and runs efficiently on HPC clusters (Slurm, SGE, etc.).

Installation and dependencies

Dependencies

We recommend using conda to automatically resolve dependencies

conda env create -f environment.yml 
#alternatively, the following command should work
#conda install python snakemake ete3 intarna mafft cd-hit FastTree mmseqs2 hmmer infernal pyfaidx biopython tqdm ypytorch-cuda pytorch -c pytorch -c nvidia -c conda-forge -c bioconda

IntaRNA: for interaction energy calculation
snakemake: for pipeline management
ete3: for tree operation
scipy: implement L-BFGS-B for optimizing scores in tips and internal nodes, also required by ete3
cd-hit: for sequence clustering
mafft: for MSA construction
FastTree: for 16S rRNA tree building
mmseqs: for homolog search of sRNA and protein
pyfaidx: for genome sequence processing
hmmer: for detection of rpoB marker gene
infernal: cmsearch for 16S rRNA sequence detection, required is using 16S rRNAs as phylogenetic marker
pytorch: for Hfq binding score prediction

Using CMDTarget

Input of CMDTarget

The required inputs of CMDTarget include:
- Genome sequences
- Genome annotations in gff format
- Homologous sRNA sequences
These inputs should be organized as follows:
- indir: input directory, should contain the following subdirectories
  - {indir}/fasta: Genome sequence in fasta format, named as '{genome_id}.fa'
  - {indir}/gff: genome annotation in gff format, named as '{genome_id}.gff'
  - {indir}/{hits}/hits.groupped
    - organized like '{indir}/{hits}/hits.groupped/{genome_id}/{sRNA_id}.fa', each file contains 1 sRNA sequence
    - homologous sRNA sequences share the same id
An easy-to-use script for preparing input data was provided, see below.

Prepare input for CMDTarget

You have multiple ways to curate your genome set for comparative analysis. Here we provide a simple strategy.
For testing, you can simply use 10 Escherichia genome sequences and annotations in genomes.tar.gz, including E.coli (GCF_000005845.2)
Suppose you have a genome of interest (called the query genome), you need to select several phylogenetically related genomes for comparative analysis.
You may start from GTDB genomes of the clade that contains these genomes. You may only keep RefSeq genomes, or genomes with completeness greater than a certain cutoff. You can then get the assembly IDs (starting with "GCF") and save them in a text file.

python scripts/select-GTDB-genomes.py --clade g__Escherichia --output genomes/Escherichia.txt

If your query genome(s) are not present in this list, please manually add them
Then fetch sequences and annotations of these genomes.

#First download ncbi assembly summary.
wget -P genomes https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
# fetch genomes of interests
mkdir -p genomes/gff genomes/fasta
scripts/fetch-refseq-genomes.py --genome-ids genomes/Escherichia.txt --fasta-directory genomes/fasta --gff-directory genomes/gff --summary genomes/assembly_summary_refseq.txt

Reformat CDS coordinate from gff to bed

bash run/gff2bed.sh

Determine which genomes contain sRNA(s) for target prediction.

To do so, first put your sRNAs in a fasta file.

#cat sRNAs.fa
>GCF_000005845.2:RyhB
GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTTTTTTT
>GCF_000005845.2:Spot42
GTAGGGTACAGAGGTAAGATGTTCTATCTTTCAGACCTTTTACTTCACGTAATCGGATTTGGCTGAATATTTTAGCCGCCCCAGTCAGTAATGACTGGGGCGTTTTTTATT
>GCF_000005845.2:CyaR
GCTGAAAAACATAACCCATAAAATGCTAGCTGTACCAGGAACCACCTCCTTAGCCTGTGTAATCTCCCTTACACGGGCTTATTTTTT

Perform the homolog search, and extract hit sequences:

# perform homolog search, homolog sequence saved in genomes/sRNA-hits/hits.fa
scripts/sRNA-homolog-search.py -q sRNAs.fa -gd genomes/fasta -od genomes/sRNA-hits --threads 12 -gi genomes/Escherichia.txt
# group sRNAs by homolog
scripts/group-sRNAs-by-homolog.py -i genomes/sRNA-hits/hits.fa -od genomes/sRNA-hits/hits
# further group sRNA by genomes, make each file contain a single sequence
scripts/split-sRNA-homolog.py -id genomes/sRNA-hits/hits -od genomes/sRNA-hits/hits.groupped

Extract genome ids with hits

cut -f 2 genomes/sRNA-hits/hits.tsv | cut -f 1 -d ':' | sort -u > genomes/Escherichia.with.hits.txt

Prepare config file

Here is an example config file in json format:

#example.json
{ "indir": "genomes",
  "outdir": "output/Enterobacteriaceae",
  "hits": "sRNA-hits",
  "genome-set-name": "Escherichia",
  "genome-ids": "genomes/Escherichia.with.hits.txt",
  "query-ids": ["GCF_000005845.2"],
  "sRNA-ids": "sRNA-ids.txt",
  "marker": "rpoB",
  "weights": {"pair zscore": 0.7, "hfq zscore": 0.3},
  "denoise": {"srm": 10, "nvm": 10, "rescale": 0, "reroot": 1},
  "steps": ["marker detection","single species scoring","comparative denoising"]}

genome-ids:
- A text file specifying the genomes used in comparative analysis.
- One genome id per line
- For all genome_id with '{indir}/fasta/{genome_id}.fa', '{indir}/gff/{genome_id}.gff' present, only those specified in this file are used
query-ids:
- A list specifying the query genome(s). The genome should be present in genome-ids
sRNA-ids:
- A text file specifying sRNAs to consider
- For all sRNAs with {indir}/{hits}/hits.groupped/{genome_id}/{sRNA_id}.fa present, only those present in this file are considered
marker:
- Phylogenetic marker for comparative analysis
weights: weights of energy scores and hfq scores
denoise: parameter for denoising
steps: steps to run. one or more steps can be specified.
- "marker detection": detect phylogenetic marker
- "single species scoring": scoring binding using single species mode
- "comparative denoising": run comparative denoising. As this step depends on "marker detection" and "single species scoring", if you only specify this step, the other two steps will be called

Run the comparative sRNA target prediction pipeline

Once the data and config file are ready, the comparative analysis is fully automatic

We recommend first running the "marker detection" step to make sure the marker gene is present in genomes that will be used for downstream analysis

example.marker.detection.json

{ "indir": "genomes",
  "outdir": "output/Enterobacteriaceae",
  "hits": "sRNA-hits",
  "genome-set-name": "Escherichia",
  "genome-ids": "genomes/Escherichia.with.hits.txt",
  "query-ids": ["GCF_000005845.2"],
  "sRNA-ids": "sRNA-ids.txt",
  "marker": "rpoB",
  "weights": {"pair zscore": 0.7, "hfq zscore": 0.3},
  "denoise": {"srm": 10, "nvm": 10, "rescale": 0, "reroot": 1},
  "steps": ["marker detection"]}

snakemake --configfile example.marker.detection.json --jobs 16

Marker genes were extracted to {indir}/{marker}/{genome_id}.fa

Define the genome set to use for downstream analysis

You may keep all genomes with marker genes

ls genomes/rpoB/ | grep '.fa$' | sed 's/.fa$//' > genome-ids.txt

When you have many genomes, up to 8 genomes with <0.95 sequence identity is generally sufficient

# combine marker sequences
scripts/combine-fasta.py -i genomes/rpoB -o genomes/Escherichia.rpoB.fa -gi  genomes/Escherichia.with.hits.txt

# get sequence identity to query marker, the results are saved in Escherichia.sim2query/scores.txt
# here we use 0.99 sequence identity cutoff as an example to reserve enough genomes
# if you have diverse genomes, --cutoff 0.95 is a better default
scripts/select-genome-by-divergence.py -i genomes/Escherichia.rpoB.fa --query-ids GCF_000005845.2 --cutoff 0.99 -od Escherichia.sim2query/

# prepare refined genome ids
cut -f 1 -d ':' Escherichia.sim2query/scores.txt > genomes/genome-ids.refined.txt

Refine your genome set by modifying "genome-ids" in config file example.comp.scoring.json for interaction prediction, then run single genome scoring and comparative analysis

{ "indir": "genomes",
  "outdir": "output/Enterobacteriaceae",
  "hits": "sRNA-hits",
  "genome-set-name": "ent9", # an alias genome set used for comparative analysis
  "genome-ids": "genomes/genome-ids.refined.txt", 
  "query-ids": ["GCF_000005845.2"], # genome for sRNA target prediction, multiple genomes can be provided
  "sRNA-ids": "sRNA-ids.txt",
  "marker": "rpoB",
  "weights": {"pair zscore": 0.7, "hfq zscore": 0.3},
  "denoise": {"srm": 10, "nvm": 10, "rescale": 0, "reroot": 1},
  "steps": ["single species scoring","comparative denoising"]}

#--resources gpu=1 restrict jobs for runing Hfq score prediction
snakemake --configfile example.comp.scoring.json --jobs 16 --resources gpu=1

Interpretation of the results

The final result after comparative denoising: {outdir}/comparative-analysis-denoised-ml/wt.hfq--{genome_set}/{params}/{sRNA_id}.txt. The default "params" is "pair.zscore-0.7_hfq.zscore-0.3_marker-rpoB_srm-10_nvm-10_rescale-0_reroot-1"

This is the main results of comparative denoising pipeline
This table contains 5 fields.
- query id: protein id in the query genome, which were used for grouping the records
- target id: indicates the record is the interaction score between which gene and the sRNA considered
- raw: the score before comparative denoising
- denoised: the score after comparative denoising
- tree: the phylogenetic tree with reconstructed denoised interaction scores. Only records with 'query id' = 'target id' contain this field.
(target id, denoised) is the denoised interaction scores between sRNA and targets in all genomes
If you are only interested in sRNA-target interactions in the query genome, you can directly use (target id, denoised) of rows with 'query id' = 'target id'

Raw hfq scores: {indir}/hfq

This directory contains hfq binding scores of the leader sequence (the -200 to +100 nt region relative to the start codon) of each gene
{indir}/hfq/{asm_id}.txt contains three columns:
- gene id: ID of each gene, formatted like 'NP_414542.1-b0001-thrL', that is '{refseq_id}-{locus_id}-{gene_name}'
- score: raw logits of the prediction, higher indicates stronger binding
- zscore: normalized binding scores, higher indicates stronger binding

Raw binding energy: {outdir}/{asm_id}/energies

This directory contains binding energies of an sRNA and leaders of different genes within a genome
{outdir}/{asm_id}/energies/{sRNA_id}.txt contains 8 columns:
- sRNA id: ID of the sRNA
- sstart: start of the predicted binding site at sRNA side
- send: end of the predicted binding site at sRNA side
- target id: ID of target gene (named as '{refseq_id}-{locus_id}-{gene_name}')
- tstart: start of the predicted binding site at mRNA side, the coordinate is relative to the (-200,+100) leader sequence
- tend: end of the predicted binding site at mRNA side
- energy: IntaRNA predicted raw energy

Single species combined scores by genome: {outdir}/{asm_id}/combined.wt.hfq/{sRNA_id}.txt

{outdir}/{asm_id}/combined.wt.hfq/{sRNA_id}.txt contains 13 columns:
- protein id: ID of target protein, named as '{refseq_id}-{locus_id}-{gene_name}'
- sstart: start of the predicted binding site at sRNA side
- send: end of the predicted binding site at sRNA side
- target id: same as 'protein id', can be omitted
- tstart: start of the predicted binding site at mRNA side, the coordinate is relative to the (-200,+100) leader sequence
- tend: end of the predicted binding site at mRNA side
- energy: raw binding energies
- sRNA pvalue: P-values of the interaction, using all genes as background for normalization
- sRNA zscore: Z-scores of the interaction, using all genes as background for normalization
- pair pvalue: P-values of the interaction, using predicted distribution of dinucleotide shuffled sequences as background
- pair zscore: Z-score of the interaction, using predicted distribution of dinucleotide shuffled sequences as background
- hfq score: raw logits of hfq binding prediction
- hfq zscore: Z-scores of hfq binding prediction

Single species combined scores by homolog: {outdir}/scores-by-homolog-pair/wt.hfq--{genome-set-name}/{sRNA_id}.txt

This file collates interactions between homologous sRNAs and various targets in different genomes, and groups the entries by homolog relationship to protein in the query genome
It contains 14 columns. The 'query id' field is specific to this homolog-centric table, indicating the homolog relationship to a protein in the query genome. The meaning of the remaining 13 columns is identical to those in {outdir}/{asm_id}/combined.wt.hfq/{sRNA_id}.txt

The phylogenetic tree: {outdir}/phylogeny/{genome-set-name}/{marker}.nwk

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
models		models
run		run
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
environment.yml		environment.yml
example.comp.scoring.json		example.comp.scoring.json
example.marker.detection.json		example.marker.detection.json
genomes.tar.gz		genomes.tar.gz
sRNA-ids-test.txt		sRNA-ids-test.txt
sRNA-ids.txt		sRNA-ids.txt
sRNAs.fa		sRNAs.fa
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CMDTarget

What is CMDTarget?

Installation and dependencies

Dependencies

Using CMDTarget

Input of CMDTarget

Prepare input for CMDTarget

Prepare config file

Run the comparative sRNA target prediction pipeline

Interpretation of the results

FAQs

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CMDTarget

What is CMDTarget?

Installation and dependencies

Dependencies

Using CMDTarget

Input of CMDTarget

Prepare input for CMDTarget

Prepare config file

Run the comparative sRNA target prediction pipeline

Interpretation of the results

FAQs

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages