For those working on genome transposon annotation, EDTA (Extensive de novo TE Annotator) is a familiar name. It is currently recognized as one of the most accurate annotation pipelines, but the original workflow often presents several frustrating issues during execution:
1️⃣Efficiency bottleneck: Serial execution leads to low CPU utilization, such as in the initial rawTE search phase.
2️⃣Catastrophic crashes: After running for days, it may suddenly throw an error with no solution, even though some errors can be ignored without affecting the results.
3️⃣Black-box operations: A tangled mix of lengthy Shell and Perl commands makes parameter adjustments cumbersome and the code difficult to read.
To address these problems, we recently undertook a complete overhaul of EDTA—introducing the Nextflow workflow engine alongside Shell scripts, breathing new life into this well-established software.
⭐️If you encounter any issues, feel free to ask in the issue section. Please also support the original authors. If you use EDTA, kindly cite it:
Ou S., Su W., Liao Y., Chougule K., Agda J. R. A., Hellinga A. J., Lugo C. S. B., Elliott T. A., Ware D., Peterson T., Jiang N.✉, Hirsch C. N.✉ and Hufford M. B.✉ (2019). Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline. Genome Biol. 20(1): 275.
🚀For other modified versions of the software, please see: https://github.com/linyuiz/zgtools?tab=readme-ov-file#redesigned-software
To install, first download the latest distribution tarball:zgtools-EDTA_*.tar.gz (not one of the Source code files!) from the github release page:https://github.com/linyuiz/EDTA_update/releases.
##EDTA install
mamba create -n EDTA_2.3 && conda activate EDTA_2.3
wget https://github.com/oushujun/EDTA/blob/master/EDTA_2.3.yml && sed -i '1d' EDTA_2.3.yml
mamba env update -f EDTA_2.3.yml
mamba install pandas<3 tir-learner=3.0.7 repeatmodeler=2.0.5 #issue: https://github.com/oushujun/EDTA/issues/616#issuecomment-3855060533
##nextflow install
mamba create -n nextflow && conda activate nextflow
mamba install -c conda-forge -c bioconda nextflow==22.10.6
##zgtools install
tar -zxvf zgtools-EDTA_v2.3.0-4.tar.gz
cd zgtools-EDTA_v2.3.0-4 && chmod +x zg*
./zgtools EDTA_update
#If zg-EDTA_update cannot be found, please edit $ZG_BIN in zgtoolsYou just need to soft link zgtools to your usual bin folder such as【~/bin】, or use an absolute path such as【/project/softawre/zgtools EDTA_update】.
Usage:
zgtools EDTA_update genome.fa 1.3e-8 60 5 RepeatModeler2-families.fa curated.TElib.fa slurm EDTA_2.3 /opt/conda
genomoe.fa --Genome File
1.3e-8 --Neutral mutation rate(Example: 1.3e-8 from rice, 7e-9 from atha)
60 --Each Task Threads
5 --Parallel Task Num
*-families.fa --RepeatModeler2 Library(default: none)
curated.TElib.fa --Input Curated TE Library(default: none)
slurm --Local/Slurm Mode(local/slurm)
EDTA_2.3 --Conda Env Name
/opt/conda --Conda Path(Must Have: your_path/bin/activate)
Example1:
zgtools EDTA_update genome.fa 1.3e-8 60 2 none none local EDTA_2.3 /opt/conda
Exmaple2:
zgtools EDTA_update genome.fa 1.3e-8 60 5 none curated.TElib.fa slurm EDTA_2.3 /opt/conda
⭐️Note regarding genome ID format: Chromosome identifiers should follow formats like Chr1, chr1, Chr1A, or ChrA1. Unanchored sequences (contigs/scaffolds) should be named using the format scaffold1, scaffold2, etc.
🚩Note that the total Threads are threads multiplied by Parallel Task Num, for example: 60 x 3 = 180 threads.
🚩For a multi-node Slurm cluster, the EDTA conda environment must be installed in the same path on each node to ensure functionality. Alternatively, you can package all the EDTA_update scripts into a single image and distribute the Slurm tasks using that image.
🚩If you need a reliable TE library, you can check out: https://github.com/simonorozcoarias/PanTEon/
🚩Or you can download these two files, unzip them, and then concatenate their contents using cat: https://www.girinst.org/server/RepBase/protected/repeatmaskerlibraries/RepBaseRepeatMaskerEdition-20181026.tar.gz and
wget https://www.dfam.org/releases/current/families/Dfam-RepeatMasker.lib.gz.
This is the command【zgtools EDTA_update genome.fa 7e-9 60 5 RM2-families.fa Plant.TElib.fa slurm EDTA_2.3 /opt/conda】runtime log:
#######Data#######
Genome: /test/13.EDTA_update/plant.genome.fa
Neutral mutation rate: 7e-9
Each Task Threads: 60
Parallel Task Num: 5
Exist RepeatModeler Lib: /test/13.EDTA_update/genome.fa.mod-families.fa
Curated TE lib: /test/13.EDTA_update/Plant.TElib.fa
Local/Slurm Mode: local
Conda Env Name: EDTA_2.3
Conda Path: /opt/conda
#######Run#######
1. transcode genome ...
Genome Size: 755,620,956 bp
2. denovo discover raw TEs ...
2.1. parallel discover TEs, threads: 60
2.2. deal with rawTE output ...
2.3. check rawTE results ...
2.4. modify LTR insert time ...
LTR insert time file: /test/13.EDTA_update/output_of_EDTA_update/LTR_insert_time.txt
3. filter raw TE candidates and the make stage 1 library ...
3.1. purify raw LTR/Helitron/TIR ...
3.2. clean other TEs ...
3.3. clean LINEs and LTRs in SINEs ...
3.4. clean LTRs and nonLTRs in TIRs and Helitrons ...
3.6. check stg1 raw library ...
4. merge other TE library ...
4.1. identify remaining TEs in the filtered RM2 library ...
4.2. remove known TEs in the EDTA library ...
5. Post-library annotate ...
5.1. split genome ...
5.2. annotate TEs using RepeatMasker ...
5.3. merge RepeatMasker output ...
5.4. make summary table for the non-overlapping annotation ...
5.5. generate masked genome ...
#######Results#######
Output: /test/13.EDTA_update/output_of_EDTA_update/
Order Count bpMasked %masked
DNA 162,902 53,855,048 7.13
LINE 22,060 6,791,674 0.90
SINE 14,720 3,272,878 0.43
LTR 339,733 221,404,889 29.30
LTR/Copia 89,659 62,250,968 8.24
LTR/Gypsy 173,449 129,528,813 17.14
LTR/ERV 1,555 164,971 0.02
LTR/unknown 67,643 32,182,124 4.26
Unknown 127,859 43,488,266 5.76
Total 684,962 328,023,621 43.41
The Nextflow execution trace in the diagram has been hidden. For the specific time consumed by each process, please refer to the actual run .log file.
⭐️The above tests were conducted on four nodes, each with 1TB of memory and 256 threads.
The output EDTA.TElib.fa is recommended to be adjusted using TEtrimmer for better TE annotation results.
The output files are basically consistent with the EDTA output results, and the ⭐️-marked files are those commonly used by most people.
├── 01.EDTA.raw
│ ├── genome.fa.mod.EDTA.intact.raw.fa ⭐️
│ ├── genome.fa.mod.EDTA.intact.raw.gff3 ⭐️
│ ├── genome.fa.mod.Helitron.intact.raw.bed
│ ├── genome.fa.mod.Helitron.intact.raw.fa
│ ├── genome.fa.mod.Helitron.intact.raw.fa.anno.list
│ ├── genome.fa.mod.Helitron.intact.raw.gff3
│ ├── genome.fa.mod.LINE.raw.fa
│ ├── genome.fa.mod.LTR.intact.raw.fa
│ ├── genome.fa.mod.LTR.intact.raw.fa.anno.list
│ ├── genome.fa.mod.LTR.intact.raw.gff3
│ ├── genome.fa.mod.LTR.raw.fa
│ ├── genome.fa.mod.RM2.fa
│ ├── genome.fa.mod.SINE.raw.fa
│ ├── genome.fa.mod.TIR.intact.raw.bed
│ ├── genome.fa.mod.TIR.intact.raw.fa
│ ├── genome.fa.mod.TIR.intact.raw.fa.anno.list
│ └── genome.fa.mod.TIR.intact.raw.gff3
├── 02.EDTA.combine
│ ├── genome.fa.mod.EDTA.fa.stg1
│ ├── genome.fa.mod.EDTA.intact.fa.cln
│ ├── genome.fa.mod.Helitron.intact.raw.fa.cln
│ ├── genome.fa.mod.Helitron.intact.raw.fa.int.cln
│ ├── genome.fa.mod.LINE.raw.fa
│ ├── genome.fa.mod.LTR.intact.raw.fa.cln
│ ├── genome.fa.mod.LTR.raw.fa.cln
│ ├── genome.fa.mod.SINE.raw.fa.cln
│ ├── genome.fa.mod.TIR.intact.raw.fa.cln
│ └── genome.fa.mod.TIR.intact.raw.fa.int.cln
├── 03.EDTA.final
│ ├── genome.fa.mod.EDTA.TElib.fa
│ ├── genome.fa.mod.EDTA.TElib.merge.fa ⭐️ (HQlib + EDTA_denovo_lib)
│ └── genome.fa.mod.EDTA.TElib.novel.fa ⭐️
├── 04.EDTA.anno
│ ├── genome.fa.mod.EDTA.TEanno.gff3 ⭐️
│ ├── genome.fa.mod.EDTA.TEanno.out ⭐️
│ ├── genome.repeat_hard_masked.fa ⭐️
│ └── genome.repeat_soft_masked.fa ⭐️
│ └── genome.fa.mod.EDTA.TEanno.sum ⭐️
└── LTR_insert_time.txt ⭐️
