🤖 Written by Claude
Split out from #84 (which keeps the docs + recommendation work).
Problem
RefSeq transcript sequences can differ from the genome (substitutions and non-frameshifting indels). When FastaSeqFetcher builds a transcript sequence by pasting genome exons together, those differences are applied silently, so the transcript a user resolves against is not the real RefSeq transcript. There is currently no signal exposed to warn or refuse in that case.
Note the mismatch info is not in the CIGAR/Gap string — Gap only uses M/I/D/F/R, and M ("aligned") already includes mismatches. The real signals live in GFF attributes/notes.
Where the signal is (RefSeq)
Annotation GFF:
exception=annotated by transcript or proteomic data
Note=...The RefSeq transcript has N non-frameshifting indel compared to this genomic sequence
Note=...The RefSeq protein has N non-frameshifting indel compared to this genomic sequence
Alignment GFF (cDNA_match features):
num_mismatch, pct_identity_gap, pct_coverage
Today cdot only stores the last free-text Note in a note field and parses none of this; FastaSeqFetcher ignores it. Ensembl transcripts always match the genome, so this only affects RefSeq.
Scope
- Data/generation — parse
exception=, the "N non-frameshifting indel compared to this genomic sequence" notes, and (where available) the alignment-GFF num_mismatch / pct_identity_gap / pct_coverage; store structured per-transcript (distinct from the existing free-text note).
- Schema/format — decide the field shape (e.g. a
genome_transcript_mismatch flag and/or pct_identity), bump the schema version, update cdot/models.py + the JSON-schema docs, and keep back-compat loading of older data.
- Client —
FastaSeqFetcher warns/refuses on known-mismatch transcripts, configurable (off / warn / raise), defaulting to warn when the field is present.
Optional cheap interim (no data regen): when both a genome-built and a SeqRepo sequence are available for a transcript, compare them and warn on diff.
Acceptance criteria
🤖 Written by Claude
Split out from #84 (which keeps the docs + recommendation work).
Problem
RefSeq transcript sequences can differ from the genome (substitutions and non-frameshifting indels). When
FastaSeqFetcherbuilds a transcript sequence by pasting genome exons together, those differences are applied silently, so the transcript a user resolves against is not the real RefSeq transcript. There is currently no signal exposed to warn or refuse in that case.Note the mismatch info is not in the CIGAR/
Gapstring —Gaponly usesM/I/D/F/R, andM("aligned") already includes mismatches. The real signals live in GFF attributes/notes.Where the signal is (RefSeq)
Annotation GFF:
exception=annotated by transcript or proteomic dataNote=...The RefSeq transcript has N non-frameshifting indel compared to this genomic sequenceNote=...The RefSeq protein has N non-frameshifting indel compared to this genomic sequenceAlignment GFF (
cDNA_matchfeatures):num_mismatch,pct_identity_gap,pct_coverageToday cdot only stores the last free-text
Notein anotefield and parses none of this;FastaSeqFetcherignores it. Ensembl transcripts always match the genome, so this only affects RefSeq.Scope
exception=, the "N non-frameshifting indel compared to this genomic sequence" notes, and (where available) the alignment-GFFnum_mismatch/pct_identity_gap/pct_coverage; store structured per-transcript (distinct from the existing free-textnote).genome_transcript_mismatchflag and/orpct_identity), bump the schema version, updatecdot/models.py+ the JSON-schema docs, and keep back-compat loading of older data.FastaSeqFetcherwarns/refuses on known-mismatch transcripts, configurable (off/warn/raise), defaulting to warn when the field is present.Optional cheap interim (no data regen): when both a genome-built and a SeqRepo sequence are available for a transcript, compare them and warn on diff.
Acceptance criteria
FastaSeqFetchercan warn/refuse on mismatched transcripts (configurable)tests/test_data/already carries the "non-frameshifting indel" notes andexception=attributes)