Skip to content

Capture & surface RefSeq genome/transcript mismatches (warn/refuse in FastaSeqFetcher) #115

Description

@davmlaw

🤖 Written by Claude

Split out from #84 (which keeps the docs + recommendation work).

Problem

RefSeq transcript sequences can differ from the genome (substitutions and non-frameshifting indels). When FastaSeqFetcher builds a transcript sequence by pasting genome exons together, those differences are applied silently, so the transcript a user resolves against is not the real RefSeq transcript. There is currently no signal exposed to warn or refuse in that case.

Note the mismatch info is not in the CIGAR/Gap string — Gap only uses M/I/D/F/R, and M ("aligned") already includes mismatches. The real signals live in GFF attributes/notes.

Where the signal is (RefSeq)

Annotation GFF:

  • exception=annotated by transcript or proteomic data
  • Note=...The RefSeq transcript has N non-frameshifting indel compared to this genomic sequence
  • Note=...The RefSeq protein has N non-frameshifting indel compared to this genomic sequence

Alignment GFF (cDNA_match features):

  • num_mismatch, pct_identity_gap, pct_coverage

Today cdot only stores the last free-text Note in a note field and parses none of this; FastaSeqFetcher ignores it. Ensembl transcripts always match the genome, so this only affects RefSeq.

Scope

  1. Data/generation — parse exception=, the "N non-frameshifting indel compared to this genomic sequence" notes, and (where available) the alignment-GFF num_mismatch / pct_identity_gap / pct_coverage; store structured per-transcript (distinct from the existing free-text note).
  2. Schema/format — decide the field shape (e.g. a genome_transcript_mismatch flag and/or pct_identity), bump the schema version, update cdot/models.py + the JSON-schema docs, and keep back-compat loading of older data.
  3. ClientFastaSeqFetcher warns/refuses on known-mismatch transcripts, configurable (off / warn / raise), defaulting to warn when the field is present.

Optional cheap interim (no data regen): when both a genome-built and a SeqRepo sequence are available for a transcript, compare them and warn on diff.

Acceptance criteria

  • RefSeq generation captures the genome↔transcript mismatch signal into a structured field
  • Schema version bumped; new field documented; old data still loads
  • FastaSeqFetcher can warn/refuse on mismatched transcripts (configurable)
  • Tests cover a known-mismatch RefSeq transcript (a test GFF in tests/test_data/ already carries the "non-frameshifting indel" notes and exception= attributes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions