Skip to content

Optimizing esl-translate for genomes where alternative start sites and/or unusual codon usage is common? #77

@000generic

Description

@000generic

Hi!

I am using Hmmer3 with bivalve genomes looking for 70-80 aa genes. BUSCO scores for the genomes are in the upper 90s (96-98% complete for BUSCO Metazoa) and most assemblies are chromosome-scale with less than 100 to less than 2000 scaffolds in several cases. So overall, high quality I think.

Working with genome gene model proteomes and using RNAseq target gene family protein sequences from Uniprot for hmmbuild - I scanned the genomes with hmmsearch - and also with jackhmmer for three different initial sequences from three different species in the set - but in all cases and collectively, I failed to find in the genome protein gene models of a given species, sequences known for the species in its RNAseq transcriptome. This happened at scales of, for example, finding 4 when 11 are known in a species.

Given this, I then used esl-translate to do 6-frame translations and fed this into hmmsearch based on an issue thread in Hmmer's github. For example:

esl-translate /home/eric_edsinger/fhl/mytilus-2024/species24/genomes/USEME-species24-assembly-fastas/Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic.fna > output/3-Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic-esl_6_frames.aa &&

hmmsearch  -o output/3-Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic-ALL -A output/3-Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic-ALIGNMENTS --tblout output/3-Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic-HITS -E 1e-5 --incE 1e-5 --max --cpu 5 input/MFP3-build_01.hmm output/3-Mytilus_californianus_GCF_021869535.1-xbMytCali1.0.p_genomic-esl_6_frames.aa 

and while I do find what maybe be additional hits in some cases, including in a species that previously had zero hits by the above protein-based methods - I am noticing that in cases where a protein gene model exists, I get typically two hits one at the start and a second further down per detected gene in the genome. However in some cases - and entirely for the species that had no protein-based hits, I lack the start-site hit and get only the second downstream hit - all of this being based on the alignments produced by Hmmer3.

This has me wondering if esl-translate is missing start sites or mistranslating things in the bivalve genomes.

Along these lines, bivalves are known to have unusual synonymous codon usage bias - and to commonly use alternative start codons - so for example, one study noted in their work:

"Alternative start codons were considered functional because they are common in Bivalvia.
ORFs were annotated starting from the first available start codon (ATG, ATA, or ATC)
downstream of the preceding gene, and ending with the first stop codon in frame (TAA or
TAG)"

I was wondering if esl-translate might be using only one or two of these three start codons - or if its weighting of all three might be biased towards genetic models or vertebrates / is it possible to have all three start codons considered with equal weight (if weighting is even involved).

Its possible something similar happened in the genome projects for their structural annotation such that proteins known in the transcriptome go missing in the final genome model calls - and this is why things are not being found by Hmmer3 in the genome gene models. But assuming the genome assemblies are fairly good - this still leaves esl-translate-hmmsearch unresolved.

In the quoted paper above they used ORF Finder to find reading frames - and I could opt for this upstream of Hmmer3 - but I like the idea of using esl, since it seems like it is being developed in parallel to Hmmer.

Its entirely possible I am misunderstanding or misstating something fundamental in all of this - I'm self-taught on it / its not my expertise yet.

Any thoughts or guidance would be greatly appreciated!

Thank you very much :)

Eric

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions