Skip to content

fix: four bugs in integrate and exon_fasta_to_twobit modules#33

Merged
ReverendCasy merged 1 commit into
hillerlab:developfrom
anijudy:fix/integrate-sequence-files
May 13, 2026
Merged

fix: four bugs in integrate and exon_fasta_to_twobit modules#33
ReverendCasy merged 1 commit into
hillerlab:developfrom
anijudy:fix/integrate-sequence-files

Conversation

@anijudy

@anijudy anijudy commented May 13, 2026

Copy link
Copy Markdown

Found and fixed four bugs while running TOGA2 on RHEL8 (glibc 2.28)
with three references against a query genome.

exon_fasta_to_twobit.py

  • --fatotwobit_binary silently ignored: Line 166 hardcodes
    f"faToTwoBit {tmp_path} {output}" instead of using self.fa2twobit,
    so the --fatotwobit_binary CLI flag has no effect at the SLEASY step.
    Fixed to f"{self.fa2twobit} {tmp_path} {output}".

integrate.py (prepare_sequence_files / _read_fasta)

  • gzip read mode: gzip.open was called with "rb" (bytes), but
    _read_fasta uses str.startswith, causing
    TypeError: startswith first arg must be bytes or a tuple of bytes, not str.
    Fixed to "rt".
  • projection lookup missing reference prefix: self.final_projections
    stores names as "{species}.{projection}" (e.g. hg38.XM_011510358.3#NBAS#328),
    but _read_fasta looked up the bare projection name without the prefix —
    so nothing ever matched and protein.fa.gz / nucleotide.fa.gz were
    always empty. Fixed to check f"{ref}.{proj}".
  • gzip write mode: the write step used open() instead of gzip.open(),
    producing plain-text files with a .fa.gz extension (invalid gzip).
    Fixed to gzip.open(..., "wt"/"at").

constants.py (Nextflow template)

  • DSL2 incompatibility: the joblist guard and Channel.fromPath call
    were at the top level, which is not valid in Nextflow DSL2. Moved both
    inside the workflow {} block.

exon_fasta_to_twobit.py:
- Use self.fa2twobit instead of hardcoded "faToTwoBit" so that
  --fatotwobit_binary is actually respected at the SLEASY step

integrate.py (prepare_sequence_files / _read_fasta):
- Open gzipped input FASTAs in text mode ("rt") not binary ("rb");
  _read_fasta uses str.startswith which fails on bytes
- Check f"{ref}.{proj}" against self.final_projections instead of bare
  proj; final_projections stores names with the species prefix, so the
  bare lookup never matched and protein/nucleotide outputs were always empty
- Write integrated FASTAs with gzip.open instead of open so the .fa.gz
  output files are valid gzip archives

constants.py (Nextflow template):
- Move joblist guard and Channel.fromPath call inside the workflow {}
  block, required for Nextflow DSL2 compatibility
@ReverendCasy

Copy link
Copy Markdown
Collaborator

Thank you very much for your contribution. Sequence extraction for the “integrate” mode has been mostly tested with the BigBed input, so reading from FA did not get enough attention so far.

@ReverendCasy ReverendCasy merged commit be8f28c into hillerlab:develop May 13, 2026
4 checks passed
ReverendCasy added a commit that referenced this pull request May 13, 2026
ReverendCasy added a commit that referenced this pull request May 13, 2026
ReverendCasy added a commit that referenced this pull request May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants