BREAKING CHANGE: v0.0.5 -> chaintools antirepeat x117 speedup impl#4
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Full refactor of the
antirepeattool. On a ~200 MB soft-masked chain file with.2bitreference/query it previously took ~12 h; it now completes in minutes, withbyte-identical output.
Root cause
antirepeatused a lazy.2bitaccess path: for every chain below--no-check-scoreit re-read the sequence from disk and, per fetch, the
twobitreader scanned thechromosome's soft-mask/N block list from the start (a linear
skip_while, no binarysearch). On repeat-masked genomes — hundreds of thousands of mask blocks per chromosome —
each of the ~millions of per-chain fetches re-scanned a large prefix, and a cost that
grows with chromosome length dominated the run. Parallelism was also off by default.
Performance
.2bitreference/query are now fully decodedinto memory at startup (the soft-mask/N scan is paid once per chromosome instead of once
per chain), so every per-chain access is an in-memory lookup. This is the single biggest
win and turns the ~12 h run into minutes.
loads just the sequences it references, bounding peak memory on fragmented assemblies
(stdin input falls back to loading everything).
parallelfeature is now part of theclibuild, so chainsare filtered across all cores out of the box; the previous
--threadsstartup error(when built without the feature) is gone.
of copying each chain's span; minus-strand queries are reverse-complemented on the fly
during the walk rather than copied and reversed. On large-span, repeat-driven chains this
is a further ~9.6× faster and ~2× lower peak memory in benchmarking.
chain's aligned blocks.
Changed
AntiRepeatEngine::chain_passesno longer takes aSequenceCache; sequence access is nowthrough the new
SequenceResolver::chromosome()borrowing accessor.scoretool also benefits from the in-memory.2bitpreload, as it shares thesequence resolver.
Notes
content, gzip input/output, and any thread count, via a randomized differential test and
full old-vs-new output diffing.