Skip to content

Implement FASTA/FASTQ parser for #60#97

Closed
crashfrog wants to merge 2 commits into
mainfrom
worktree-agent-a6f7a8dd
Closed

Implement FASTA/FASTQ parser for #60#97
crashfrog wants to merge 2 commits into
mainfrom
worktree-agent-a6f7a8dd

Conversation

@crashfrog
Copy link
Copy Markdown
Member

Summary

Implements the FASTA/FASTQ parser (parse_sequences) for phraya-io, fulfilling all acceptance criteria for issue #60:

  • Parses FASTA files (single/multiple sequences, wrapped lines)
  • Parses FASTQ files (4-line format with quality scores)
  • Auto-detects format via magic bytes ('>' for FASTA, '@' for FASTQ)
  • Supports gzip-compressed files (.fa.gz, .fq.gz)
  • Returns iterator for memory-efficient streaming
  • Validates quality score length matches sequence length
  • Handles empty files gracefully
  • Rejects malformed files with clear ParseError messages
  • Accepts IUPAC ambiguity codes (N, R, Y, S, W, K, M, B, D, H, V)
  • Case-insensitive DNA base validation

Implementation Details

  • Validates entire file upfront (fails fast on errors)
  • Returns Box<dyn Iterator<Item = Sequence>> for consistent interface
  • Detects gzip via both file extension and magic bytes (0x1f 0x8b)
  • Handles whitespace-only lines gracefully
  • Parses sequence ID and optional description
  • Stores quality scores as raw bytes in Sequence type

Test Results

  • All 42 acceptance tests passing (14 happy path, 22 edge cases, 6 malformed file rejection tests)
  • No existing tests broken
  • Implementation ready for Phase 1 MVP

🤖 Generated with Claude Code

crashfrog and others added 2 commits May 28, 2026 17:23
Add comprehensive acceptance tests for FASTA/FASTQ parser covering:
- Valid FASTA files (single/multiple sequences, wrapped lines)
- Valid FASTQ files (4-line format with quality scores)
- Auto-detection via magic bytes ('>' for FASTA, '@' for FASTQ)
- Gzip compression support (transparent decompression)
- Iterator-based streaming for memory efficiency
- Quality score validation (length must match sequence length)
- Empty file handling (returns empty iterator)
- Malformed file detection (missing quality, wrong line count, invalid characters)
- Edge cases (long sequences, special characters, IUPAC codes, case insensitivity)
- Real-world formats (NCBI FASTA headers, Illumina FASTQ headers)

All tests currently fail (RED phase) as implementation does not exist yet.

Test stats: 13 passed (error-expecting tests), 30 failed (feature tests)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement parse_sequences() function that:
- Auto-detects FASTA ('>') vs FASTQ ('@') format via magic bytes
- Supports gzip-compressed files (.gz extension and 0x1f 0x8b magic bytes)
- Validates entire file upfront before returning iterator
- Returns error-checked iterator that yields Sequence objects
- Handles wrapped sequence lines (FASTA) and 4-line FASTQ format
- Validates quality score length matches sequence length (FASTQ)
- Accepts IUPAC ambiguity codes (N, R, Y, S, W, K, M, B, D, H, V)
- Case-insensitive DNA base validation
- Parses sequence ID, optional description, bases, and quality scores
- Gracefully handles empty files, whitespace-only files
- Provides clear error messages for malformed files

All 42 acceptance tests pass (happy path + error cases + edge cases).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@crashfrog
Copy link
Copy Markdown
Member Author

Closing stale PR. Work has been superseded or merged via alternative approach.

@crashfrog crashfrog closed this May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant