Skip to content

Create cleanup_nested_mmseq.pl for speedup by mmseqs#590

Open
starskyzheng wants to merge 1 commit intooushujun:masterfrom
starskyzheng:patch-1
Open

Create cleanup_nested_mmseq.pl for speedup by mmseqs#590
starskyzheng wants to merge 1 commit intooushujun:masterfrom
starskyzheng:patch-1

Conversation

@starskyzheng
Copy link
Copy Markdown

@starskyzheng starskyzheng commented Jul 11, 2025

#Update: 07/11/2025 by Zeyu Zheng, MMSeqs optimization for faster sequence search

PR checklist

  • PR to nextflow_reboot branch
  • conda and container directives.
  • Docker container + singularity container (optional)
  • Flow meta.id with each data channel
  • Use nf-core resource labels such as process_high
  • Used nf-core module
  • Use versions.yml or versions topic
  • No process in the main.nf. We can have a process in a sub-workflow file

#Update: 07/11/2025 by Zeyu Zheng, MMSeqs optimization for faster sequence search
@oushujun
Copy link
Copy Markdown
Owner

Hello Zeyu,

Thank you for your contribution! It looks like a nice integration of a much faster aligner. Can you please provide some comparisons between the old and new versions? Some benchmarks on the rice genome will also be great.

Thanks!
Shujun

@oushujun
Copy link
Copy Markdown
Owner

Hello Zeyu,

Sorry for the delay. We've just got a chance to test this script. It appears to produce highly similar results to the original cleanup_nested script in rice. However, the speed of the mmseqs script is much slower on our end despite similar CPU usage:

Original cleanup_nested:

Wed Jan 14 09:17:44 EST 2026 Clean up nested insertions and redundancy. Working on iteration 0
Wed Jan 14 09:23:52 EST 2026 Clean up nested insertions and redundancy. Working on iteration 1
Wed Jan 14 09:29:36 EST 2026 Clean up nested insertions and redundancy. Working on iteration 2
Wed Jan 14 09:35:19 EST 2026 Clean up nested insertions and redundancy. Working on iteration 3
Wed Jan 14 09:41:03 EST 2026 Clean up nested insertions and redundancy. Working on iteration 4
Saturated at iter4, automatically stop.

    Command being timed: "perl /anvil/projects/x-bio250178/Jung/EDTA/EDTA/bin/cleanup_nested.pl -in /anvil/projects/x-bio250178/Jung/Benchmark/rerun_1_13/all.con.fasta.mod.EDTA.final/all.con.fasta.mod.EDTA.raw.fa -threads 20 -minlen 80 -cov 0.95 -blastplus /home/x-yna1/miniconda3/envs/EDTA/bin/"
    User time (seconds): 32480.63
    System time (seconds): 1477.29
    Percent of CPU this job got: 1949%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 29:01.61
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 96964
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 50151
    Minor (reclaiming a frame) page faults: 259370273
    Voluntary context switches: 5533808
    Involuntary context switches: 310936
    Swaps: 0
    File system inputs: 282624
    File system outputs: 80
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

New cleanup_nested_mmseqs:

Wed Jan 14 09:51:45 EST 2026 Clean up nested insertions and redundancy. Working on iteration 0
Wed Jan 14 10:01:22 EST 2026 Clean up nested insertions and redundancy. Working on iteration 1
Wed Jan 14 10:11:55 EST 2026 Clean up nested insertions and redundancy. Working on iteration 2
Wed Jan 14 10:22:35 EST 2026 Clean up nested insertions and redundancy. Working on iteration 3
Wed Jan 14 10:33:44 EST 2026 Clean up nested insertions and redundancy. Working on iteration 4
Wed Jan 14 10:44:42 EST 2026 Clean up nested insertions and redundancy. Working on iteration 5
Wed Jan 14 10:55:45 EST 2026 Clean up nested insertions and redundancy. Working on iteration 6
Wed Jan 14 11:06:54 EST 2026 Clean up nested insertions and redundancy. Working on iteration 7
Wed Jan 14 11:18:05 EST 2026 Clean up nested insertions and redundancy. Working on iteration 8
Wed Jan 14 11:29:13 EST 2026 Clean up nested insertions and redundancy. Working on iteration 9
Wed Jan 14 11:38:22 EST 2026 Clean up nested insertions and redundancy. Working on iteration 10
Wed Jan 14 11:46:26 EST 2026 Clean up nested insertions and redundancy. Working on iteration 11
Wed Jan 14 11:56:55 EST 2026 Clean up nested insertions and redundancy. Working on iteration 12
Wed Jan 14 12:07:32 EST 2026 Clean up nested insertions and redundancy. Working on iteration 13
Wed Jan 14 12:15:35 EST 2026 Clean up nested insertions and redundancy. Working on iteration 14
Wed Jan 14 12:24:32 EST 2026 Clean up nested insertions and redundancy. Working on iteration 15
Wed Jan 14 12:35:02 EST 2026 Clean up nested insertions and redundancy. Working on iteration 16
Wed Jan 14 12:45:43 EST 2026 Clean up nested insertions and redundancy. Working on iteration 17
Saturated at iter17, automatically stop.

    Command being timed: "perl /anvil/projects/x-bio250178/Jung/EDTA/EDTA/bin/cleanup_nested_mmseq.pl -in /anvil/projects/x-bio250178/Jung/Benchmark/rerun_1_13/all.con.fasta.mod.EDTA.final/all.con.fasta.mod.EDTA.raw.fa -threads 20 -minlen 80 -cov 0.95"
    User time (seconds): 33301.79
    System time (seconds): 174294.05
    Percent of CPU this job got: 1875%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 3:04:27
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 8601424
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 31686442
    Minor (reclaiming a frame) page faults: 31440184506
    Voluntary context switches: 108774932
    Involuntary context switches: 5241674
    Swaps: 0
    File system inputs: 5952024
    File system outputs: 375919232
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

The new script has more iterations, and if you just count the same initial iterations (iter0-4), the new script took a longer time (29 min vs 42 min). Can you please double-check the speed-up aspect of the script?

Thank you for your contribution!
Shujun

oushujun added a commit that referenced this pull request Mar 2, 2026
… The script has not been used in the EDTA pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants