When comparing transcripts of the Bible in its original language and the Bible in Latin, Martin Luther realized that some of its meaning had been lost in translation. Similarly, UnTranslated Regions (UTRs) are lost in translation. However, to us biologist and bioinformaticians they are of great importance as they have a variety of regulatory functions and a genome annotation without them wouldn't really be complete, right?
Protein ortholog-based gene prediction enables the transfer of detailed gene structure and functional annotations across species by leveraging evolutionary conservation. However, genome annotations from such methods typically lack the less conserved UTRs. LUTR supplements these by using matching transcripts from reference-based transcriptome assemblies.
LUTR succeeds UTRpy
git clone https://github.com/SimonHegele/LUTR
cd LUTR
conda create -n lutr python=3.13
conda activate lutr
pip install .
Additional dependencies required to use the assembly pipeline:
usage: LUTR [-h] [-mf] [-mb] [-nm] [-ma] [-s] [-mupl] [-r] [-t] [-v] [-l] prediction assembly outdir
----------------------------------------------------------------------------------------------------
██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ █░
█░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░
█░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░███████░ ██░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
█░ ██░ ██░ ██░ ██░ █░ |█░ █|█░ █░ |█░ |█░ █░ █░ ██░ ██░ ██░ ██░ █░
█░ █░ █░ █░ █░ █░ █░ █░ █░ █░ |█░ |█░ █░ |█░ |█░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
██░ ██░ ██░ ██░ ██░ |█░ |█░ █░ |█░ |█░ █░ ██░ ██░ ██░ ██░ ██░
██░ ██░ ██░ ██░ ██░ |█░ |█░ █░ |█░ |█░█░ ██░ ██░ ██░ ██░ ██░
█░ █░ █░ █░ █░ █░ █░ █░ █░ █░ |█░ |█░ █░ |█░ |█░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
█░ ██░ ██░ ██░ ██░ █░ |█░ █░|█░ █░ |█░ |█░ █░█░ █░ ██░ ██░ ██░ ██░ █░
█░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █ ████░ ████░ |█░ |█░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░
█░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ ██░ █░
----------------------------------------------------------------------------------------------------
UTR-extensions for transcripts in functional annotations from orthology-based gene prediction tools
using matching transcripts in structural annotations from reference-based transcriptome assembly.
Additional parameter explanation:
*1 By default, LUTR only allows transcript matches in which all other exons between the first and
last matched exon must also be matched exactly. This is based on the assumption that UTRs can
affect splicing within the coding regions of a transcript, e.g. by changing its secondary
structure.
*2 By default, LUTR assignes assembled transcripts only to the matching predicted transcripts they
share the most bases with. Assigning them to all matching predicted transcripts can increase the
number of annotated UTRs but bears the risk of an increased rate of false positives.
----------------------------------------------------------------------------------------------------
positional arguments:
prediction Annotation from gene prediction (GFF)
assembly Annotation from transcriptome assembly (GFF)
outdir Output directory (Must not exist already)
options:
-h, --help show this help message and exit
Transcript matching:
-mf, --mftm Minimum fraction of the predicted transcript to be matched [default: 0.9]
-mb, --mtbm Minimum bases of the predicted transcript to be matched [default: 10,000]
-nm, --no_match_middle Allows missing / additional middle exons in the transcrpt matching *1
-ma, --match_all Assign assembled transcripts to all matching predicted transcripts *2
-mupl, --max_utr_piece_len Limits the length of UTR-pieces allowed in UTR-variants # TODO
Transcript selection:
-s , --select How to select from multiple UTR-variants [choices: shortest, longest, all] [default: all]
-r, --remove_unsupported Remove unmatched transcripts
Performance:
-t , --threads Number of parallel processes to use [Default:8]
Others:
-v , --verbosity Report progress every 10^v genes [default: 0]
-l , --log_level [default: info]
Input annotations are required to be in GFF3 format. agat_sp_convert_sp_gxf2gxf.pl from AGAT can be used to convert from GTF to GFF.
To generate an assembly annotation matching your predicted one you can use the assembly pipeline described in section 3.
Contents of the output directory:
| Item | Explanation |
|---|---|
| lutr.gff | Final annotation |
| lutr.param | File with run parameters |
| lutr.log | File with log messages |
| tmpdir | "Temporary" directory |
| step_1 | indicates completion of step 1 |
The "temporary" directory stores the annotations for each gene, however it is not deleted automatically. This allows skipping step 1 when re-running LUTR with different parameters.
Running agat_sp_convert_sp_gxf2gxf.pl from [AGAT] (https://github.com/NBISweden/AGAT) on lutr.gff is recommended for the removal of duplicated isoforms
LUTR comes with two additional scripts:
assembly -h
noUTR -h
assembly
Assembly pipeline
- Short read mapping with STAR (or input of mapped reads)
- Long read mapping with Minimap2 (or input of mapped reads)
- Assembly with StringTie
noUTR
Generating UTR-clipped annotations of coding genes
Caution: Duplicates from isoforms using identical CDS but different UTR are not removed. This can be done with agat_convert_gxf2gxf.pl!
LUTR creates GFF-slices for all predicted genes and their overlapping assembled genes.
- GFF-File loading as pandas.DataFrame
- GFF-File chromosome slice generation
- Identification of pairs of overlapping predicted and assembled genes
- GFF-File gene slice generation for gene pairs*
Transcript matching, removal of unsupported transcripts and UTR-variant generation
- Matching pairs of predicted and assembled transcripts exon by exon according to user defined parameters
- Removal of unmatched predicted transcripts (only if parameter --remove was set.
- Generation of UTR-variants if matching assembled transcripts extend predicted ones
- Length based selection of UTR-variants (according to --select parameter)
An idealized "prediction" of coding sequences was created from the mouse GRCm39 reference annotation.
- noUTR.py was used to remove UTRs from the annotation.
- agat_sp_convert_sp_gxf2gxf.pl from AGAT was used to remove duplicated transcript isoforms.
Assembly 1: perfect “assembly” / goal annotation (full reference gene records)
Assembly 2: ~600 million Illumina read pairs (SRA-ID: PRJNA375882), mapped with STAR ~ 60 million r2c2 nanopore reads (SRA-ID: PRJNA971991), mapped with Minimap2 assembled with StringTie3 in conservative mode
Test 1:
required RAM: ~4.8GB
required Time: 05m26s (64 threads, server) / 30m20s (8 threads, laptop)
Test 2:
required RAM: ~3.9GB
required Time: 04m35s (64 threads, server) / 21m54 (8 threads, laptop)
The required time required to process a gene is highly dependent on the number of isoforms annotated.
While most genes are processed within a fraction of a second, genes with many isoforms may take more than a minute to process.
Annotation statistics reported by agat_sp_statistics.pl from AGAT:
Test 1:
| Annotation | # genes | # mRNA | # mRNA, UTR | # mRNA, UTR (both sides) | Mean 5'-UTR length | Mean 5'-UTR length |
|---|---|---|---|---|---|---|
| Reference | 22192 | 96192 | 94380 | 93379 | 614 | 1625 |
| Reference (UTR clipped) | 22192 | 68114 | 0 | 0 | 0 | 0 |
| LUTR | 22192 | 98317 | 96428 | 95425 | 617 | 1637 |
Test 2:
| Annotation | # genes | # mRNA | # mRNA, UTR | # mRNA, UTR (both sides) | Mean 5'-UTR length | Mean 5'-UTR length |
|---|---|---|---|---|---|---|
| Assembly | 33902 | ? | ? | ? | ? | ? |
| Reference (UTR clipped) | 22192 | 68114 | 0 | 0 | 0 | 0 |
| LUTR | 22192 | 94728 | 79557 | 79121 | 9735 | 1828 |
Annotation statistics reported by Sqanti3
Figure 1: Sqanti3 results for test 1 (idealized assembly) Figure 2: Figure 2: Sqanti3 results for test 2 (real assembly)
FSM = Full splice match
ISM = Incomplete splice match
NIC = Novel isoform using combinations of known splice-sites
NNC = Novel isoform using unknown splice-sites
The reconstruction of the reference annotation from the idealized data is very close to the original.
At the same time LUTR inherits the problems of the methods used to create the input annotations.
LUTR uses hasmaps for parent-child relationships
After loading GFF-files a map that reflects parent to child relationships is constructed.
This allows fast extraction of records.
LUTR improves parallelization:
UTRpy: Splitting of annotations into chromosome slices
-> Number of usable CPU-cores limited by the number of chromosomes
-> Workload is not distributed evenly among processes
LUTR: Splitting of annotations into chromosome slices and into gene slices second
-> Number of usable CPU-cores practically unlimited
-> Even distribution of workload among processes
LUTR improves transcript matching:
UTRpy:
-> Attempts matching for each pair of assembled and predicted transcript on a chromosome.
-> Only allows only perfect transcript matches
LUTR:
-> Attempts matching only for pairs of assembled and predicted transcript when the assembled and predicted genes overlap
-> Acknowledes potential incompleteness of assembled transcriptomes or transcripts and provides parameters allowing imperfect matches
LUTR is significantly more efficient, especially for annotations from gene predictions, where genes may have multiple annotated isoforms.
Currently there is no publication on LUTR, please refer to this repository.
I would like to thank the Leibniz Institute on Aging – Fritz Lipmann Institute (FLI) and the chair of Prof. Dr. Steve Hoffmann in particular for providing me with the computational ressources required for the development and testing of this tool.




