Skip to content

Reduce extractor type-mislabelling via MB-aware validation pass and MB-anchored prompt examples #103

@rafacm

Description

@rafacm

Context

PR #100's Django Reinhardt cutover left 57/85 entities in wikidata_status=pending. Beyond the Whisper-hallucination class (tracked in #102), a meaningful fraction were correctly transcribed but mis-labelled by the extractor — for example, "Carnegie Hall" was extracted under entity_type city despite being a venue (which exists in musicbrainz.place).

Once the type is wrong, foreground MB resolution looks in the wrong table, finds nothing, the entity lands without an MBID, and the background Wikidata enrichment is also working with the wrong P31 class — so it can't recover. Better grounding at extraction time fixes this at the source for the entire pipeline.

The MB DB is already loaded locally and is itself the canonical truth we trust. We can use it both as a grounding signal in the extraction prompt and as a post-extraction validation gate.

Proposals

1. MB-aware type validation pass (highest leverage)

After the LLM extraction step but before resolution, run a cross-table MB lookup: for each (name, entity_type) extracted, check whether the name exists under the expected MB table or under a different MB table.

Three branches:

Outcome Action
Exact match in expected table Pass through unchanged.
No match in expected table, exact match in exactly one other MB-backed table Reclassify to that table's entity_type. ("Carnegie Hall" → drop from city, attach as music_venue.)
Match in multiple other tables, or no match anywhere Either re-prompt the extractor with the conflict surfaced, or pass through unchanged for the resolver to handle.

This costs only Postgres queries against the already-loaded MB DB. It also self-reinforces: the extractor's mistakes get corrected by the canonical source, and the corrected entities_json flows into resolution and embedding.

Where it lands: a new pipeline step between extract and resolve, or a section at the end of episodes/extractor.py.

Tests: "Carnegie Hall as city" → reclassified to music_venue, "Miles Davis as city" → no match in area, exact match in artist/Person → reclassified to musician, "Liberschi" → no MB match anywhere → passed through (no info gained).

2. MB-anchored prompt examples (cheap, complementary)

episodes/initial_entity_types.yaml already has an examples field for each type (used inside the extraction system prompt). Today these are hand-written — small set, possibly stale.

Generate per-type examples directly from MB at app-startup or via a one-shot management command:

  • music_venue — sample top-N from musicbrainz.place (jazz-relevant or just popular).
  • musician — sample from musicbrainz.artist filtered by type=Person.
  • etc.

The extractor sees realistic, MB-faithful examples instead of curated ones. The classification distribution shifts toward labels that actually exist in MB.

Where it lands: a new manage.py refresh_entity_type_examples that pulls a sample from each MB table and updates EntityType.examples. Run after every MB dump re-import.

3. Reasoning prompt for ambiguous classifications (medium)

For each extracted entity, ask the extractor to briefly justify why this name fits this type. Chain-of-thought catches mis-labels in non-thinking models. Adds output tokens; can be turned off later for cost.

Where it lands: extend RESOLUTION_RESPONSE_SCHEMA (and the corresponding extraction schema) with a justification: str field per entity, log it, and use it for debugging the next round of mis-labels.

4. Confidence scoring (cheap, complementary)

Have the extractor emit confidence: low|medium|high per entity. Drop low before resolution, or surface them in the admin for review. Pure prompt engineering + schema update.

5. (Skip for now) Per-type extraction passes

Running 14 separate narrow extraction calls ("find only musicians", "find only venues") would improve precision but blow up LLM costs ~14×. Not justifiable until simpler steps are exhausted.

Suggested implementation order

  1. MB-aware type validation pass (Replace Processing Pipeline table with Mermaid flowchart #3 in extraction priority list, biggest payoff). Wire it as a post-extraction filter that runs cross-table MB lookups via the existing episodes/musicbrainz.py plumbing. Add comprehensive tests with hand-rolled fixtures.
  2. manage.py refresh_entity_type_examples: pull MB samples to populate EntityType.examples. One-shot, idempotent, run after MB re-imports.
  3. Re-ingest the Django Reinhardt episode and measure: how many of the 57 PENDING entities resolve after these two changes? Compare against the baseline.
  4. Add reasoning + confidence fields only if recall is still lagging.

Out of scope

Linked PR: #100.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions