Skip to content

uaauaguga/CMDTarget

Repository files navigation

CMDTarget

What is CMDTarget?

  • CMDTarget (Comparative Modeling based Denoising for sRNA Target prediction) is a computational tool for sRNA target prediction and evolutionary analysis.
  • While conceptually similar to CopraRNA, CMDTarget has several distinct features:
    • CMDTarget explicitly leverages information of evolutionary conservation, while unlike CopraRNA, it does not force homolog sRNA-mRNA pairs across different species to have the same interaction score by explicitly reconstructing interaction scores of both extant and ancestral sRNA-mRNA pairs.
    • CMDTarget includes a module for Hfq binding prediction, and could optionally incorporate predicted Hfq binding scores for sRNA target prediction.
    • If you have multiple sRNAs, or a large genome set for comparative analysis, comparative genomics based sRNA target prediction can be computationally intensive. CMDTarget is built with Snakemake and runs efficiently on HPC clusters (Slurm, SGE, etc.).

Installation and dependencies

Dependencies

  • We recommend using conda to automatically resolve dependencies
conda env create -f environment.yml 
#alternatively, the following command should work
#conda install python snakemake ete3 intarna mafft cd-hit FastTree mmseqs2 hmmer infernal pyfaidx biopython tqdm ypytorch-cuda pytorch -c pytorch -c nvidia -c conda-forge -c bioconda
  • IntaRNA: for interaction energy calculation
  • snakemake: for pipeline management
  • ete3: for tree operation
  • scipy: implement L-BFGS-B for optimizing scores in tips and internal nodes, also required by ete3
  • cd-hit: for sequence clustering
  • mafft: for MSA construction
  • FastTree: for 16S rRNA tree building
  • mmseqs: for homolog search of sRNA and protein
  • pyfaidx: for genome sequence processing
  • hmmer: for detection of rpoB marker gene
  • infernal: cmsearch for 16S rRNA sequence detection, required is using 16S rRNAs as phylogenetic marker
  • pytorch: for Hfq binding score prediction

Using CMDTarget

Input of CMDTarget

  • The required inputs of CMDTarget include:
    • Genome sequences
    • Genome annotations in gff format
    • Homologous sRNA sequences
  • These inputs should be organized as follows:
    • indir: input directory, should contain the following subdirectories
      • {indir}/fasta: Genome sequence in fasta format, named as '{genome_id}.fa'
      • {indir}/gff: genome annotation in gff format, named as '{genome_id}.gff'
      • {indir}/{hits}/hits.groupped
        • organized like '{indir}/{hits}/hits.groupped/{genome_id}/{sRNA_id}.fa', each file contains 1 sRNA sequence
        • homologous sRNA sequences share the same id
  • An easy-to-use script for preparing input data was provided, see below.

Prepare input for CMDTarget

  • You have multiple ways to curate your genome set for comparative analysis. Here we provide a simple strategy.
  • For testing, you can simply use 10 Escherichia genome sequences and annotations in genomes.tar.gz, including E.coli (GCF_000005845.2)
  • Suppose you have a genome of interest (called the query genome), you need to select several phylogenetically related genomes for comparative analysis.
  • You may start from GTDB genomes of the clade that contains these genomes. You may only keep RefSeq genomes, or genomes with completeness greater than a certain cutoff. You can then get the assembly IDs (starting with "GCF") and save them in a text file.
python scripts/select-GTDB-genomes.py --clade g__Escherichia --output genomes/Escherichia.txt
  • If your query genome(s) are not present in this list, please manually add them

  • Then fetch sequences and annotations of these genomes.

#First download ncbi assembly summary.
wget -P genomes https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt
# fetch genomes of interests
mkdir -p genomes/gff genomes/fasta
scripts/fetch-refseq-genomes.py --genome-ids genomes/Escherichia.txt --fasta-directory genomes/fasta --gff-directory genomes/gff --summary genomes/assembly_summary_refseq.txt
  • Reformat CDS coordinate from gff to bed
bash run/gff2bed.sh
  • Determine which genomes contain sRNA(s) for target prediction.

To do so, first put your sRNAs in a fasta file.

#cat sRNAs.fa
>GCF_000005845.2:RyhB
GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTTTTTTT
>GCF_000005845.2:Spot42
GTAGGGTACAGAGGTAAGATGTTCTATCTTTCAGACCTTTTACTTCACGTAATCGGATTTGGCTGAATATTTTAGCCGCCCCAGTCAGTAATGACTGGGGCGTTTTTTATT
>GCF_000005845.2:CyaR
GCTGAAAAACATAACCCATAAAATGCTAGCTGTACCAGGAACCACCTCCTTAGCCTGTGTAATCTCCCTTACACGGGCTTATTTTTT

Perform the homolog search, and extract hit sequences:

# perform homolog search, homolog sequence saved in genomes/sRNA-hits/hits.fa
scripts/sRNA-homolog-search.py -q sRNAs.fa -gd genomes/fasta -od genomes/sRNA-hits --threads 12 -gi genomes/Escherichia.txt
# group sRNAs by homolog
scripts/group-sRNAs-by-homolog.py -i genomes/sRNA-hits/hits.fa -od genomes/sRNA-hits/hits
# further group sRNA by genomes, make each file contain a single sequence
scripts/split-sRNA-homolog.py -id genomes/sRNA-hits/hits -od genomes/sRNA-hits/hits.groupped
  • Extract genome ids with hits
cut -f 2 genomes/sRNA-hits/hits.tsv | cut -f 1 -d ':' | sort -u > genomes/Escherichia.with.hits.txt

Prepare config file

  • Here is an example config file in json format:
#example.json
{ "indir": "genomes",
  "outdir": "output/Enterobacteriaceae",
  "hits": "sRNA-hits",
  "genome-set-name": "Escherichia",
  "genome-ids": "genomes/Escherichia.with.hits.txt",
  "query-ids": ["GCF_000005845.2"],
  "sRNA-ids": "sRNA-ids.txt",
  "marker": "rpoB",
  "weights": {"pair zscore": 0.7, "hfq zscore": 0.3},
  "denoise": {"srm": 10, "nvm": 10, "rescale": 0, "reroot": 1},
  "steps": ["marker detection","single species scoring","comparative denoising"]}
  • genome-ids:
    • A text file specifying the genomes used in comparative analysis.
    • One genome id per line
    • For all genome_id with '{indir}/fasta/{genome_id}.fa', '{indir}/gff/{genome_id}.gff' present, only those specified in this file are used
  • query-ids:
    • A list specifying the query genome(s). The genome should be present in genome-ids
  • sRNA-ids:
    • A text file specifying sRNAs to consider
    • For all sRNAs with {indir}/{hits}/hits.groupped/{genome_id}/{sRNA_id}.fa present, only those present in this file are considered
  • marker:
    • Phylogenetic marker for comparative analysis
  • weights: weights of energy scores and hfq scores
  • denoise: parameter for denoising
  • steps: steps to run. one or more steps can be specified.
    • "marker detection": detect phylogenetic marker
    • "single species scoring": scoring binding using single species mode
    • "comparative denoising": run comparative denoising. As this step depends on "marker detection" and "single species scoring", if you only specify this step, the other two steps will be called

Run the comparative sRNA target prediction pipeline

  • Once the data and config file are ready, the comparative analysis is fully automatic
  1. We recommend first running the "marker detection" step to make sure the marker gene is present in genomes that will be used for downstream analysis
  • example.marker.detection.json
{ "indir": "genomes",
  "outdir": "output/Enterobacteriaceae",
  "hits": "sRNA-hits",
  "genome-set-name": "Escherichia",
  "genome-ids": "genomes/Escherichia.with.hits.txt",
  "query-ids": ["GCF_000005845.2"],
  "sRNA-ids": "sRNA-ids.txt",
  "marker": "rpoB",
  "weights": {"pair zscore": 0.7, "hfq zscore": 0.3},
  "denoise": {"srm": 10, "nvm": 10, "rescale": 0, "reroot": 1},
  "steps": ["marker detection"]}
snakemake --configfile example.marker.detection.json --jobs 16 
  • Marker genes were extracted to {indir}/{marker}/{genome_id}.fa
  1. Define the genome set to use for downstream analysis
  • You may keep all genomes with marker genes
ls genomes/rpoB/ | grep '.fa$' | sed 's/.fa$//' > genome-ids.txt
  • When you have many genomes, up to 8 genomes with <0.95 sequence identity is generally sufficient
# combine marker sequences
scripts/combine-fasta.py -i genomes/rpoB -o genomes/Escherichia.rpoB.fa -gi  genomes/Escherichia.with.hits.txt

# get sequence identity to query marker, the results are saved in Escherichia.sim2query/scores.txt
# here we use 0.99 sequence identity cutoff as an example to reserve enough genomes
# if you have diverse genomes, --cutoff 0.95 is a better default
scripts/select-genome-by-divergence.py -i genomes/Escherichia.rpoB.fa --query-ids GCF_000005845.2 --cutoff 0.99 -od Escherichia.sim2query/

# prepare refined genome ids
cut -f 1 -d ':' Escherichia.sim2query/scores.txt > genomes/genome-ids.refined.txt
  1. Refine your genome set by modifying "genome-ids" in config file example.comp.scoring.json for interaction prediction, then run single genome scoring and comparative analysis
{ "indir": "genomes",
  "outdir": "output/Enterobacteriaceae",
  "hits": "sRNA-hits",
  "genome-set-name": "ent9", # an alias genome set used for comparative analysis
  "genome-ids": "genomes/genome-ids.refined.txt", 
  "query-ids": ["GCF_000005845.2"], # genome for sRNA target prediction, multiple genomes can be provided
  "sRNA-ids": "sRNA-ids.txt",
  "marker": "rpoB",
  "weights": {"pair zscore": 0.7, "hfq zscore": 0.3},
  "denoise": {"srm": 10, "nvm": 10, "rescale": 0, "reroot": 1},
  "steps": ["single species scoring","comparative denoising"]}
#--resources gpu=1 restrict jobs for runing Hfq score prediction
snakemake --configfile example.comp.scoring.json --jobs 16 --resources gpu=1

Interpretation of the results

  1. The final result after comparative denoising: {outdir}/comparative-analysis-denoised-ml/wt.hfq--{genome_set}/{params}/{sRNA_id}.txt. The default "params" is "pair.zscore-0.7_hfq.zscore-0.3_marker-rpoB_srm-10_nvm-10_rescale-0_reroot-1"
  • This is the main results of comparative denoising pipeline
  • This table contains 5 fields.
    • query id: protein id in the query genome, which were used for grouping the records
    • target id: indicates the record is the interaction score between which gene and the sRNA considered
    • raw: the score before comparative denoising
    • denoised: the score after comparative denoising
    • tree: the phylogenetic tree with reconstructed denoised interaction scores. Only records with 'query id' = 'target id' contain this field.
  • (target id, denoised) is the denoised interaction scores between sRNA and targets in all genomes
  • If you are only interested in sRNA-target interactions in the query genome, you can directly use (target id, denoised) of rows with 'query id' = 'target id'
  1. Raw hfq scores: {indir}/hfq
  • This directory contains hfq binding scores of the leader sequence (the -200 to +100 nt region relative to the start codon) of each gene
  • {indir}/hfq/{asm_id}.txt contains three columns:
    • gene id: ID of each gene, formatted like 'NP_414542.1-b0001-thrL', that is '{refseq_id}-{locus_id}-{gene_name}'
    • score: raw logits of the prediction, higher indicates stronger binding
    • zscore: normalized binding scores, higher indicates stronger binding
  1. Raw binding energy: {outdir}/{asm_id}/energies
  • This directory contains binding energies of an sRNA and leaders of different genes within a genome
  • {outdir}/{asm_id}/energies/{sRNA_id}.txt contains 8 columns:
    • sRNA id: ID of the sRNA
    • sstart: start of the predicted binding site at sRNA side
    • send: end of the predicted binding site at sRNA side
    • target id: ID of target gene (named as '{refseq_id}-{locus_id}-{gene_name}')
    • tstart: start of the predicted binding site at mRNA side, the coordinate is relative to the (-200,+100) leader sequence
    • tend: end of the predicted binding site at mRNA side
    • energy: IntaRNA predicted raw energy
  1. Single species combined scores by genome: {outdir}/{asm_id}/combined.wt.hfq/{sRNA_id}.txt
  • {outdir}/{asm_id}/combined.wt.hfq/{sRNA_id}.txt contains 13 columns:
    • protein id: ID of target protein, named as '{refseq_id}-{locus_id}-{gene_name}'
    • sstart: start of the predicted binding site at sRNA side
    • send: end of the predicted binding site at sRNA side
    • target id: same as 'protein id', can be omitted
    • tstart: start of the predicted binding site at mRNA side, the coordinate is relative to the (-200,+100) leader sequence
    • tend: end of the predicted binding site at mRNA side
    • energy: raw binding energies
    • sRNA pvalue: P-values of the interaction, using all genes as background for normalization
    • sRNA zscore: Z-scores of the interaction, using all genes as background for normalization
    • pair pvalue: P-values of the interaction, using predicted distribution of dinucleotide shuffled sequences as background
    • pair zscore: Z-score of the interaction, using predicted distribution of dinucleotide shuffled sequences as background
    • hfq score: raw logits of hfq binding prediction
    • hfq zscore: Z-scores of hfq binding prediction
  1. Single species combined scores by homolog: {outdir}/scores-by-homolog-pair/wt.hfq--{genome-set-name}/{sRNA_id}.txt
  • This file collates interactions between homologous sRNAs and various targets in different genomes, and groups the entries by homolog relationship to protein in the query genome
  • It contains 14 columns. The 'query id' field is specific to this homolog-centric table, indicating the homolog relationship to a protein in the query genome. The meaning of the remaining 13 columns is identical to those in {outdir}/{asm_id}/combined.wt.hfq/{sRNA_id}.txt
  1. The phylogenetic tree: {outdir}/phylogeny/{genome-set-name}/{marker}.nwk

FAQs

Citation

About

Comparative modeling based denoising for sRNA target prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors