fix: four bugs in integrate and exon_fasta_to_twobit modules by anijudy · Pull Request #33 · hillerlab/TOGA2

anijudy · 2026-05-13T16:25:16Z

Found and fixed four bugs while running TOGA2 on RHEL8 (glibc 2.28)
with three references against a query genome.

`exon_fasta_to_twobit.py`

--fatotwobit_binary silently ignored: Line 166 hardcodes
f"faToTwoBit {tmp_path} {output}" instead of using self.fa2twobit,
so the --fatotwobit_binary CLI flag has no effect at the SLEASY step.
Fixed to f"{self.fa2twobit} {tmp_path} {output}".

`integrate.py` (`prepare_sequence_files` / `_read_fasta`)

gzip read mode: gzip.open was called with "rb" (bytes), but
_read_fasta uses str.startswith, causing
TypeError: startswith first arg must be bytes or a tuple of bytes, not str.
Fixed to "rt".
projection lookup missing reference prefix: self.final_projections
stores names as "{species}.{projection}" (e.g. hg38.XM_011510358.3#NBAS#328),
but _read_fasta looked up the bare projection name without the prefix —
so nothing ever matched and protein.fa.gz / nucleotide.fa.gz were
always empty. Fixed to check f"{ref}.{proj}".
gzip write mode: the write step used open() instead of gzip.open(),
producing plain-text files with a .fa.gz extension (invalid gzip).
Fixed to gzip.open(..., "wt"/"at").

`constants.py` (Nextflow template)

DSL2 incompatibility: the joblist guard and Channel.fromPath call
were at the top level, which is not valid in Nextflow DSL2. Moved both
inside the workflow {} block.

exon_fasta_to_twobit.py: - Use self.fa2twobit instead of hardcoded "faToTwoBit" so that --fatotwobit_binary is actually respected at the SLEASY step integrate.py (prepare_sequence_files / _read_fasta): - Open gzipped input FASTAs in text mode ("rt") not binary ("rb"); _read_fasta uses str.startswith which fails on bytes - Check f"{ref}.{proj}" against self.final_projections instead of bare proj; final_projections stores names with the species prefix, so the bare lookup never matched and protein/nucleotide outputs were always empty - Write integrated FASTAs with gzip.open instead of open so the .fa.gz output files are valid gzip archives constants.py (Nextflow template): - Move joblist guard and Channel.fromPath call inside the workflow {} block, required for Nextflow DSL2 compatibility

ReverendCasy · 2026-05-13T21:01:50Z

Thank you very much for your contribution. Sequence extraction for the “integrate” mode has been mostly tested with the BigBed input, so reading from FA did not get enough attention so far.

ReverendCasy merged commit be8f28c into hillerlab:develop May 13, 2026
4 checks passed

ReverendCasy added a commit that referenced this pull request May 13, 2026

doubling changes from #33

6688c9b

ReverendCasy added a commit that referenced this pull request May 13, 2026

doubling changes from #33

793b6be

ReverendCasy added a commit that referenced this pull request May 13, 2026

doubling changes from #33

9ce9495

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: four bugs in integrate and exon_fasta_to_twobit modules#33

fix: four bugs in integrate and exon_fasta_to_twobit modules#33
ReverendCasy merged 1 commit into
hillerlab:developfrom
anijudy:fix/integrate-sequence-files

anijudy commented May 13, 2026

Uh oh!

ReverendCasy commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anijudy commented May 13, 2026

exon_fasta_to_twobit.py

integrate.py (prepare_sequence_files / _read_fasta)

constants.py (Nextflow template)

Uh oh!

ReverendCasy commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`exon_fasta_to_twobit.py`

`integrate.py` (`prepare_sequence_files` / `_read_fasta`)

`constants.py` (Nextflow template)