Implement FASTA/FASTQ parser for #60#97
Closed
crashfrog wants to merge 2 commits into
Closed
Conversation
Add comprehensive acceptance tests for FASTA/FASTQ parser covering:
- Valid FASTA files (single/multiple sequences, wrapped lines)
- Valid FASTQ files (4-line format with quality scores)
- Auto-detection via magic bytes ('>' for FASTA, '@' for FASTQ)
- Gzip compression support (transparent decompression)
- Iterator-based streaming for memory efficiency
- Quality score validation (length must match sequence length)
- Empty file handling (returns empty iterator)
- Malformed file detection (missing quality, wrong line count, invalid characters)
- Edge cases (long sequences, special characters, IUPAC codes, case insensitivity)
- Real-world formats (NCBI FASTA headers, Illumina FASTQ headers)
All tests currently fail (RED phase) as implementation does not exist yet.
Test stats: 13 passed (error-expecting tests), 30 failed (feature tests)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement parse_sequences() function that:
- Auto-detects FASTA ('>') vs FASTQ ('@') format via magic bytes
- Supports gzip-compressed files (.gz extension and 0x1f 0x8b magic bytes)
- Validates entire file upfront before returning iterator
- Returns error-checked iterator that yields Sequence objects
- Handles wrapped sequence lines (FASTA) and 4-line FASTQ format
- Validates quality score length matches sequence length (FASTQ)
- Accepts IUPAC ambiguity codes (N, R, Y, S, W, K, M, B, D, H, V)
- Case-insensitive DNA base validation
- Parses sequence ID, optional description, bases, and quality scores
- Gracefully handles empty files, whitespace-only files
- Provides clear error messages for malformed files
All 42 acceptance tests pass (happy path + error cases + edge cases).
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Member
Author
|
Closing stale PR. Work has been superseded or merged via alternative approach. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the FASTA/FASTQ parser (parse_sequences) for phraya-io, fulfilling all acceptance criteria for issue #60:
Implementation Details
Test Results
🤖 Generated with Claude Code