3 Additional scripts

When comparing transcripts of the Bible in its original language and the Bible in Latin, Martin Luther realized that some of its meaning had been lost in translation. Similarly, UnTranslated Regions (UTRs) are lost in translation. However, to us biologist and bioinformaticians they are of great importance as they have a variety of regulatory functions and a genome annotation without them wouldn't really be complete, right?

Protein ortholog-based gene prediction enables the transfer of detailed gene structure and functional annotations across species by leveraging evolutionary conservation. However, genome annotations from such methods typically lack the less conserved UTRs. LUTR supplements these by using matching transcripts from reference-based transcriptome assemblies.

LUTR succeeds UTRpy

1 Installation

git clone https://github.com/SimonHegele/LUTR
cd LUTR
conda create -n lutr python=3.13
conda activate lutr
pip install .

Additional dependencies required to use the assembly pipeline:

2 Usage

usage: LUTR [-h] [-mf] [-mb] [-nm] [-ma] [-s] [-mupl] [-r] [-t] [-v] [-l] prediction assembly outdir

                    ----------------------------------------------------------------------------------------------------
                      ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   █░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                      ██░   ██░   ██░   ██░   ██░                                          ██░   ██░   ██░   ██░   ██░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░   █░      █░     █░███████░  ██░       █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                    █░   ██░   ██░   ██░   ██░   █░  |█░    █|█░   █░   |█░   |█░ █░     █░   ██░   ██░   ██░   ██░   █░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░   |█░     |█░   █░   |█░   |█░  █░     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                      ██░   ██░   ██░   ██░   ██░    |█░     |█░   █░   |█░   |█░ █░       ██░   ██░   ██░   ██░   ██░
                      ██░   ██░   ██░   ██░   ██░    |█░     |█░   █░   |█░   |█░█░        ██░   ██░   ██░   ██░   ██░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░   |█░     |█░   █░   |█░   |█░ █░      █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                    █░   ██░   ██░   ██░   ██░   █░  |█░   █░|█░   █░   |█░   |█░  █░█░  █░   ██░   ██░   ██░   ██░   █░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░   █ ████░   ████░    |█░  |█░    █░    █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                      ██░   ██░   ██░   ██░   ██░                                          ██░   ██░   ██░   ██░   ██░
                     █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░ █░
                      ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   ██░   █░
                    ----------------------------------------------------------------------------------------------------

                    UTR-extensions for transcripts in functional annotations from orthology-based gene prediction tools
                    using matching transcripts in structural annotations from reference-based transcriptome assembly.

                    Additional parameter explanation:
                    *1 By default, LUTR only allows transcript matches in which all other exons between the first and
                       last matched exon must also be matched exactly. This is based on the assumption that UTRs can
                       affect splicing within the coding regions of a transcript, e.g. by changing its secondary
                       structure.
                    *2 By default, LUTR assignes assembled transcripts only to the matching predicted transcripts they
                       share the most bases with. Assigning them to all matching predicted transcripts can increase the
                       number of annotated UTRs but bears the risk of an increased rate of false positives.
                    ----------------------------------------------------------------------------------------------------


positional arguments:
  prediction                    Annotation from gene prediction (GFF)
  assembly                      Annotation from transcriptome assembly (GFF)
  outdir                        Output directory (Must not exist already)

options:
  -h, --help                    show this help message and exit

Transcript matching:
  -mf, --mftm                   Minimum fraction of the predicted transcript to be matched [default: 0.9]
  -mb, --mtbm                   Minimum bases of the predicted transcript to be matched [default: 10,000]
  -nm, --no_match_middle        Allows missing / additional middle exons in the transcrpt matching *1
  -ma, --match_all              Assign assembled transcripts to all matching predicted transcripts *2
  -mupl, --max_utr_piece_len    Limits the length of UTR-pieces allowed in UTR-variants # TODO

Transcript selection:
  -s , --select                 How to select from multiple UTR-variants [choices: shortest, longest, all] [default: all]
  -r, --remove_unsupported      Remove unmatched transcripts

Performance:
  -t , --threads                Number of parallel processes to use [Default:8]

Others:
  -v , --verbosity              Report progress every 10^v genes [default: 0]
  -l , --log_level              [default: info]

2.1 Input

Input annotations are required to be in GFF3 format. agat_sp_convert_sp_gxf2gxf.pl from AGAT can be used to convert from GTF to GFF.

To generate an assembly annotation matching your predicted one you can use the assembly pipeline described in section 3.

2.2 Output

Contents of the output directory:

Item	Explanation
lutr.gff	Final annotation
lutr.param	File with run parameters
lutr.log	File with log messages
tmpdir	"Temporary" directory
step_1	indicates completion of step 1

The "temporary" directory stores the annotations for each gene, however it is not deleted automatically. This allows skipping step 1 when re-running LUTR with different parameters.

Running agat_sp_convert_sp_gxf2gxf.pl from [AGAT] (https://github.com/NBISweden/AGAT) on lutr.gff is recommended for the removal of duplicated isoforms

3 Additional scripts

LUTR comes with two additional scripts:

assembly -h
noUTR -h

assembly
Assembly pipeline

Short read mapping with STAR (or input of mapped reads)
Long read mapping with Minimap2 (or input of mapped reads)
Assembly with StringTie

noUTR
Generating UTR-clipped annotations of coding genes
Caution: Duplicates from isoforms using identical CDS but different UTR are not removed. This can be done with agat_convert_gxf2gxf.pl!

4 LUTR workflow

Step 1: Gene pair slice extraction

LUTR creates GFF-slices for all predicted genes and their overlapping assembled genes.

GFF-File loading as pandas.DataFrame
GFF-File chromosome slice generation
Identification of pairs of overlapping predicted and assembled genes
GFF-File gene slice generation for gene pairs*

Step 2: Gene pair slice processing

Transcript matching, removal of unsupported transcripts and UTR-variant generation

Matching pairs of predicted and assembled transcripts exon by exon according to user defined parameters
Removal of unmatched predicted transcripts (only if parameter --remove was set.
Generation of UTR-variants if matching assembled transcripts extend predicted ones
Length based selection of UTR-variants (according to --select parameter)

5 Testing LUTR

5.1 Test data

5.1.1 Prediction

An idealized "prediction" of coding sequences was created from the mouse GRCm39 reference annotation.

noUTR.py was used to remove UTRs from the annotation.
agat_sp_convert_sp_gxf2gxf.pl from AGAT was used to remove duplicated transcript isoforms.

5.1.2. Assemblies

Assembly 1: perfect “assembly” / goal annotation (full reference gene records)

Assembly 2: ~600 million Illumina read pairs (SRA-ID: PRJNA375882), mapped with STAR ~ 60 million r2c2 nanopore reads (SRA-ID: PRJNA971991), mapped with Minimap2 assembled with StringTie3 in conservative mode

5.2 Computational ressources

Test 1:
required RAM: ~4.8GB
required Time: 05m26s (64 threads, server) / 30m20s (8 threads, laptop)

Test 2:
required RAM: ~3.9GB
required Time: 04m35s (64 threads, server) / 21m54 (8 threads, laptop)

The required time required to process a gene is highly dependent on the number of isoforms annotated.
While most genes are processed within a fraction of a second, genes with many isoforms may take more than a minute to process.

5.3 Results

5.2.1 Basic Annotation statistics

Annotation statistics reported by agat_sp_statistics.pl from AGAT:

Test 1:

Annotation	# genes	# mRNA	# mRNA, UTR	# mRNA, UTR (both sides)	Mean 5'-UTR length	Mean 5'-UTR length
Reference	22192	96192	94380	93379	614	1625
Reference (UTR clipped)	22192	68114	0	0	0	0
LUTR	22192	98317	96428	95425	617	1637

Test 2:

Annotation	# genes	# mRNA	# mRNA, UTR	# mRNA, UTR (both sides)	Mean 5'-UTR length	Mean 5'-UTR length
Assembly	33902	?	?	?	?	?
Reference (UTR clipped)	22192	68114	0	0	0	0
LUTR	22192	94728	79557	79121	9735	1828

5.2.2 Structural analysis of isoforms

Annotation statistics reported by Sqanti3

_{Figure 1: Sqanti3 results for test 1 (idealized assembly)} _{Figure 2: Figure 2: Sqanti3 results for test 2 (real assembly)}

FSM = Full splice match
ISM = Incomplete splice match
NIC = Novel isoform using combinations of known splice-sites
NNC = Novel isoform using unknown splice-sites

5.3.3 Interpretation

The reconstruction of the reference annotation from the idealized data is very close to the original.
At the same time LUTR inherits the problems of the methods used to create the input annotations.

6. Comparison to predecessor UTRpy

LUTR uses hasmaps for parent-child relationships
After loading GFF-files a map that reflects parent to child relationships is constructed.
This allows fast extraction of records.

LUTR improves parallelization:
UTRpy: Splitting of annotations into chromosome slices
-> Number of usable CPU-cores limited by the number of chromosomes
-> Workload is not distributed evenly among processes
LUTR: Splitting of annotations into chromosome slices and into gene slices second
-> Number of usable CPU-cores practically unlimited
-> Even distribution of workload among processes

LUTR improves transcript matching:
UTRpy:
-> Attempts matching for each pair of assembled and predicted transcript on a chromosome.
-> Only allows only perfect transcript matches
LUTR:
-> Attempts matching only for pairs of assembled and predicted transcript when the assembled and predicted genes overlap
-> Acknowledes potential incompleteness of assembled transcriptomes or transcripts and provides parameters allowing imperfect matches

LUTR is significantly more efficient, especially for annotations from gene predictions, where genes may have multiple annotated isoforms.

Citing LUTR

Currently there is no publication on LUTR, please refer to this repository.

Acknowledgements

I would like to thank the Leibniz Institute on Aging – Fritz Lipmann Institute (FLI) and the chair of Prof. Dr. Steve Hoffmann in particular for providing me with the computational ressources required for the development and testing of this tool.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
figures		figures
src		src
tests		tests
README.md		README.md
pytest.ini		pytest.ini
setup.py		setup.py
test_assembly.gff		test_assembly.gff
test_prediction.gff		test_prediction.gff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1 Installation

2 Usage

2.1 Input

2.2 Output

3 Additional scripts

4 LUTR workflow

Step 1: Gene pair slice extraction

Step 2: Gene pair slice processing

5 Testing LUTR

5.1 Test data

5.1.1 Prediction

5.1.2. Assemblies

5.2 Computational ressources

5.3 Results

5.2.1 Basic Annotation statistics

5.2.2 Structural analysis of isoforms

5.3.3 Interpretation

6. Comparison to predecessor UTRpy

Citing LUTR

Acknowledgements

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

SimonHegele/LUTR

Folders and files

Latest commit

History

Repository files navigation

1 Installation

2 Usage

2.1 Input

2.2 Output

3 Additional scripts

4 LUTR workflow

Step 1: Gene pair slice extraction

Step 2: Gene pair slice processing

5 Testing LUTR

5.1 Test data

5.1.1 Prediction

5.1.2. Assemblies

5.2 Computational ressources

5.3 Results

5.2.1 Basic Annotation statistics

5.2.2 Structural analysis of isoforms

5.3.3 Interpretation

6. Comparison to predecessor UTRpy

Citing LUTR

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages