Capture & surface RefSeq genome/transcript mismatches (warn/refuse in FastaSeqFetcher)

🤖 Written by Claude

Split out from #84 (which keeps the docs + recommendation work).

## Problem

RefSeq transcript sequences can differ from the genome (substitutions and non-frameshifting indels). When `FastaSeqFetcher` builds a transcript sequence by pasting genome exons together, those differences are applied silently, so the transcript a user resolves against is **not** the real RefSeq transcript. There is currently no signal exposed to warn or refuse in that case.

Note the mismatch info is **not** in the CIGAR/`Gap` string — `Gap` only uses `M/I/D/F/R`, and `M` ("aligned") already includes mismatches. The real signals live in GFF attributes/notes.

## Where the signal is (RefSeq)

Annotation GFF:
- `exception=annotated by transcript or proteomic data`
- `Note=...The RefSeq transcript has N non-frameshifting indel compared to this genomic sequence`
- `Note=...The RefSeq protein has N non-frameshifting indel compared to this genomic sequence`

Alignment GFF (`cDNA_match` features):
- `num_mismatch`, `pct_identity_gap`, `pct_coverage`

Today cdot only stores the last free-text `Note` in a `note` field and parses none of this; `FastaSeqFetcher` ignores it. Ensembl transcripts always match the genome, so this only affects RefSeq.

## Scope

1. **Data/generation** — parse `exception=`, the "N non-frameshifting indel compared to this genomic sequence" notes, and (where available) the alignment-GFF `num_mismatch` / `pct_identity_gap` / `pct_coverage`; store structured per-transcript (distinct from the existing free-text `note`).
2. **Schema/format** — decide the field shape (e.g. a `genome_transcript_mismatch` flag and/or `pct_identity`), bump the schema version, update `cdot/models.py` + the JSON-schema docs, and keep back-compat loading of older data.
3. **Client** — `FastaSeqFetcher` warns/refuses on known-mismatch transcripts, configurable (`off` / `warn` / `raise`), defaulting to warn when the field is present.

Optional cheap interim (no data regen): when both a genome-built and a SeqRepo sequence are available for a transcript, compare them and warn on diff.

## Acceptance criteria

- [ ] RefSeq generation captures the genome↔transcript mismatch signal into a structured field
- [ ] Schema version bumped; new field documented; old data still loads
- [ ] `FastaSeqFetcher` can warn/refuse on mismatched transcripts (configurable)
- [ ] Tests cover a known-mismatch RefSeq transcript (a test GFF in `tests/test_data/` already carries the "non-frameshifting indel" notes and `exception=` attributes)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Capture & surface RefSeq genome/transcript mismatches (warn/refuse in FastaSeqFetcher) #115

Problem

Where the signal is (RefSeq)

Scope

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Capture & surface RefSeq genome/transcript mismatches (warn/refuse in FastaSeqFetcher) #115

Description

Problem

Where the signal is (RefSeq)

Scope

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions