Skip to content

Integration of UTRs from reference-based transcriptome assemblies into existing genome annotations

Notifications You must be signed in to change notification settings

SimonHegele/LUTR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

When comparing transcripts of the Bible in its original language and the Bible in Latin, Martin Luther realized that some of its meaning had been lost in translation. Similarly, UnTranslated Regions (UTRs) are lost in translation. However, to us biologist and bioinformaticians they are of great importance as they have a variety of regulatory functions and a genome annotation without them wouldn't really be complete, right?

Protein ortholog-based gene prediction enables the transfer of detailed gene structure and functional annotations across species by leveraging evolutionary conservation. However, genome annotations from such methods typically lack the less conserved UTRs. LUTR supplements these by using matching transcripts from reference-based transcriptome assemblies.

LUTR succeeds UTRpy

1 Installation

git clone https://github.com/SimonHegele/LUTR
cd LUTR
conda create -n lutr python=3.13
conda activate lutr
pip install .

Additional dependencies required to use the assembly pipeline:

2 Usage

usage: LUTR [-h] [-mf] [-mb] [-nm] [-ma] [-s] [-mupl] [-r] [-t] [-v] [-l] prediction assembly outdir

                    ----------------------------------------------------------------------------------------------------
                      ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   █░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                      ██░   ██░   ██░   ██░   ██░                                          ██░   ██░   ██░   ██░   ██░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░   █░      █░     █░███████░  ██░       █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                    █░   ██░   ██░   ██░   ██░   █░  |█░    █|█░   █░   |█░   |█░ █░     █░   ██░   ██░   ██░   ██░   █░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░   |█░     |█░   █░   |█░   |█░  █░     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                      ██░   ██░   ██░   ██░   ██░    |█░     |█░   █░   |█░   |█░ █░       ██░   ██░   ██░   ██░   ██░
                      ██░   ██░   ██░   ██░   ██░    |█░     |█░   █░   |█░   |█░█░        ██░   ██░   ██░   ██░   ██░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░   |█░     |█░   █░   |█░   |█░ █░      █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                    █░   ██░   ██░   ██░   ██░   █░  |█░   █░|█░   █░   |█░   |█░  █░█░  █░   ██░   ██░   ██░   ██░   █░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░   █ ████░   ████░    |█░  |█░    █░    █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                      ██░   ██░   ██░   ██░   ██░                                          ██░   ██░   ██░   ██░   ██░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                      ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   █░
                    ----------------------------------------------------------------------------------------------------

                    UTR-extensions for transcripts in functional annotations from orthology-based gene prediction tools
                    using matching transcripts in structural annotations from reference-based transcriptome assembly.

                    Additional parameter explanation:
                    *1 By default, LUTR only allows transcript matches in which all other exons between the first and
                       last matched exon must also be matched exactly. This is based on the assumption that UTRs can
                       affect splicing within the coding regions of a transcript, e.g. by changing its secondary
                       structure.
                    *2 By default, LUTR assignes assembled transcripts only to the matching predicted transcripts they
                       share the most bases with. Assigning them to all matching predicted transcripts can increase the
                       number of annotated UTRs but bears the risk of an increased rate of false positives.
                    ----------------------------------------------------------------------------------------------------


positional arguments:
  prediction                    Annotation from gene prediction (GFF)
  assembly                      Annotation from transcriptome assembly (GFF)
  outdir                        Output directory (Must not exist already)

options:
  -h, --help                    show this help message and exit

Transcript matching:
  -mf, --mftm                   Minimum fraction of the predicted transcript to be matched [default: 0.9]
  -mb, --mtbm                   Minimum bases of the predicted transcript to be matched [default: 10,000]
  -nm, --no_match_middle        Allows missing / additional middle exons in the transcrpt matching *1
  -ma, --match_all              Assign assembled transcripts to all matching predicted transcripts *2
  -mupl, --max_utr_piece_len    Limits the length of UTR-pieces allowed in UTR-variants # TODO

Transcript selection:
  -s , --select                 How to select from multiple UTR-variants [choices: shortest, longest, all] [default: all]
  -r, --remove_unsupported      Remove unmatched transcripts

Performance:
  -t , --threads                Number of parallel processes to use [Default:8]

Others:
  -v , --verbosity              Report progress every 10^v genes [default: 0]
  -l , --log_level              [default: info]

2.1 Input

Input annotations are required to be in GFF3 format. agat_sp_convert_sp_gxf2gxf.pl from AGAT can be used to convert from GTF to GFF.

To generate an assembly annotation matching your predicted one you can use the assembly pipeline described in section 3.

2.2 Output

Contents of the output directory:

Item Explanation
lutr.gff Final annotation
lutr.param File with run parameters
lutr.log File with log messages
tmpdir "Temporary" directory
step_1 indicates completion of step 1

The "temporary" directory stores the annotations for each gene, however it is not deleted automatically. This allows skipping step 1 when re-running LUTR with different parameters.

Running agat_sp_convert_sp_gxf2gxf.pl from [AGAT] (https://github.com/NBISweden/AGAT) on lutr.gff is recommended for the removal of duplicated isoforms

3 Additional scripts

LUTR comes with two additional scripts:

assembly -h
noUTR -h

assembly
Assembly pipeline

  • Short read mapping with STAR (or input of mapped reads)
  • Long read mapping with Minimap2 (or input of mapped reads)
  • Assembly with StringTie

noUTR
Generating UTR-clipped annotations of coding genes
Caution: Duplicates from isoforms using identical CDS but different UTR are not removed. This can be done with agat_convert_gxf2gxf.pl!

4 LUTR workflow

Step 1: Gene pair slice extraction

LUTR creates GFF-slices for all predicted genes and their overlapping assembled genes.

  1. GFF-File loading as pandas.DataFrame
  2. GFF-File chromosome slice generation
  3. Identification of pairs of overlapping predicted and assembled genes
  4. GFF-File gene slice generation for gene pairs*

Step 2: Gene pair slice processing

Transcript matching, removal of unsupported transcripts and UTR-variant generation

  1. Matching pairs of predicted and assembled transcripts exon by exon according to user defined parameters
  2. Removal of unmatched predicted transcripts (only if parameter --remove was set.
  3. Generation of UTR-variants if matching assembled transcripts extend predicted ones
  4. Length based selection of UTR-variants (according to --select parameter)

5 Testing LUTR

5.1 Test data

5.1.1 Prediction

An idealized "prediction" of coding sequences was created from the mouse GRCm39 reference annotation.

  1. noUTR.py was used to remove UTRs from the annotation.
  2. agat_sp_convert_sp_gxf2gxf.pl from AGAT was used to remove duplicated transcript isoforms.

5.1.2. Assemblies

Assembly 1: perfect “assembly” / goal annotation (full reference gene records)

Assembly 2: ~600 million Illumina read pairs (SRA-ID: PRJNA375882), mapped with STAR ~ 60 million r2c2 nanopore reads (SRA-ID: PRJNA971991), mapped with Minimap2 assembled with StringTie3 in conservative mode

5.2 Computational ressources

Test 1:
required RAM: ~4.8GB
required Time: 05m26s (64 threads, server) / 30m20s (8 threads, laptop)

Test 2:
required RAM: ~3.9GB
required Time: 04m35s (64 threads, server) / 21m54 (8 threads, laptop)

The required time required to process a gene is highly dependent on the number of isoforms annotated.
While most genes are processed within a fraction of a second, genes with many isoforms may take more than a minute to process.

5.3 Results

5.2.1 Basic Annotation statistics

Annotation statistics reported by agat_sp_statistics.pl from AGAT:

Test 1:

Annotation # genes # mRNA # mRNA, UTR # mRNA, UTR (both sides) Mean 5'-UTR length Mean 5'-UTR length
Reference 22192 96192 94380 93379 614 1625
Reference (UTR clipped) 22192 68114 0 0 0 0
LUTR 22192 98317 96428 95425 617 1637

Test 2:

Annotation # genes # mRNA # mRNA, UTR # mRNA, UTR (both sides) Mean 5'-UTR length Mean 5'-UTR length
Assembly 33902 ? ? ? ? ?
Reference (UTR clipped) 22192 68114 0 0 0 0
LUTR 22192 94728 79557 79121 9735 1828

5.2.2 Structural analysis of isoforms

Annotation statistics reported by Sqanti3

Figure 1 Figure 2

Figure 1: Sqanti3 results for test 1 (idealized assembly)      Figure 2: Figure 2: Sqanti3 results for test 2 (real assembly)

FSM = Full splice match
ISM = Incomplete splice match
NIC = Novel isoform using combinations of known splice-sites
NNC = Novel isoform using unknown splice-sites

5.3.3 Interpretation

The reconstruction of the reference annotation from the idealized data is very close to the original.
At the same time LUTR inherits the problems of the methods used to create the input annotations.

6. Comparison to predecessor UTRpy

LUTR uses hasmaps for parent-child relationships
After loading GFF-files a map that reflects parent to child relationships is constructed.
This allows fast extraction of records.

LUTR improves parallelization:
UTRpy: Splitting of annotations into chromosome slices
-> Number of usable CPU-cores limited by the number of chromosomes
-> Workload is not distributed evenly among processes
LUTR: Splitting of annotations into chromosome slices and into gene slices second
-> Number of usable CPU-cores practically unlimited
-> Even distribution of workload among processes

LUTR improves transcript matching:
UTRpy:
-> Attempts matching for each pair of assembled and predicted transcript on a chromosome.
-> Only allows only perfect transcript matches
LUTR:
-> Attempts matching only for pairs of assembled and predicted transcript when the assembled and predicted genes overlap
-> Acknowledes potential incompleteness of assembled transcriptomes or transcripts and provides parameters allowing imperfect matches

LUTR is significantly more efficient, especially for annotations from gene predictions, where genes may have multiple annotated isoforms.

Citing LUTR

Currently there is no publication on LUTR, please refer to this repository.

Acknowledgements

I would like to thank the Leibniz Institute on Aging – Fritz Lipmann Institute (FLI) and the chair of Prof. Dr. Steve Hoffmann in particular for providing me with the computational ressources required for the development and testing of this tool.

About

Integration of UTRs from reference-based transcriptome assemblies into existing genome annotations

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages