Reduce extractor type-mislabelling via MB-aware validation pass and MB-anchored prompt examples

## Context

PR #100's Django Reinhardt cutover left 57/85 entities in `wikidata_status=pending`. Beyond the Whisper-hallucination class (tracked in #102), a meaningful fraction were correctly transcribed but mis-labelled by the extractor — for example, \"Carnegie Hall\" was extracted under entity_type `city` despite being a venue (which exists in `musicbrainz.place`).

Once the type is wrong, foreground MB resolution looks in the wrong table, finds nothing, the entity lands without an MBID, and the background Wikidata enrichment is also working with the wrong P31 class — so it can't recover. Better grounding at extraction time fixes this at the source for the entire pipeline.

The MB DB is already loaded locally and is itself the canonical truth we trust. We can use it both as a *grounding signal* in the extraction prompt and as a *post-extraction validation gate*.

## Proposals

### 1. MB-aware type validation pass (highest leverage)

After the LLM extraction step but before resolution, run a cross-table MB lookup: for each `(name, entity_type)` extracted, check whether the name exists under the *expected* MB table or under a *different* MB table.

Three branches:

| Outcome | Action |
|---|---|
| Exact match in expected table | Pass through unchanged. |
| No match in expected table, exact match in exactly one other MB-backed table | **Reclassify** to that table's entity_type. (\"Carnegie Hall\" → drop from `city`, attach as `music_venue`.) |
| Match in multiple other tables, or no match anywhere | Either re-prompt the extractor with the conflict surfaced, or pass through unchanged for the resolver to handle. |

This costs only Postgres queries against the already-loaded MB DB. It also self-reinforces: the extractor's mistakes get corrected by the canonical source, and the corrected `entities_json` flows into resolution and embedding.

**Where it lands**: a new pipeline step between `extract` and `resolve`, or a section at the end of `episodes/extractor.py`.

**Tests**: \"Carnegie Hall as city\" → reclassified to music_venue, \"Miles Davis as city\" → no match in `area`, exact match in `artist`/Person → reclassified to musician, \"Liberschi\" → no MB match anywhere → passed through (no info gained).

### 2. MB-anchored prompt examples (cheap, complementary)

`episodes/initial_entity_types.yaml` already has an `examples` field for each type (used inside the extraction system prompt). Today these are hand-written — small set, possibly stale.

Generate per-type examples directly from MB at app-startup or via a one-shot management command:

- `music_venue` — sample top-N from `musicbrainz.place` (jazz-relevant or just popular).
- `musician` — sample from `musicbrainz.artist` filtered by `type=Person`.
- etc.

The extractor sees realistic, MB-faithful examples instead of curated ones. The classification distribution shifts toward labels that actually exist in MB.

**Where it lands**: a new `manage.py refresh_entity_type_examples` that pulls a sample from each MB table and updates `EntityType.examples`. Run after every MB dump re-import.

### 3. Reasoning prompt for ambiguous classifications (medium)

For each extracted entity, ask the extractor to briefly justify why this name fits this type. Chain-of-thought catches mis-labels in non-thinking models. Adds output tokens; can be turned off later for cost.

**Where it lands**: extend `RESOLUTION_RESPONSE_SCHEMA` (and the corresponding extraction schema) with a `justification: str` field per entity, log it, and use it for debugging the next round of mis-labels.

### 4. Confidence scoring (cheap, complementary)

Have the extractor emit `confidence: low|medium|high` per entity. Drop `low` before resolution, or surface them in the admin for review. Pure prompt engineering + schema update.

### 5. (Skip for now) Per-type extraction passes

Running 14 separate narrow extraction calls (\"find only musicians\", \"find only venues\") would improve precision but blow up LLM costs ~14×. Not justifiable until simpler steps are exhausted.

## Suggested implementation order

1. **MB-aware type validation pass** (#3 in extraction priority list, biggest payoff). Wire it as a post-extraction filter that runs cross-table MB lookups via the existing `episodes/musicbrainz.py` plumbing. Add comprehensive tests with hand-rolled fixtures.
2. **`manage.py refresh_entity_type_examples`**: pull MB samples to populate `EntityType.examples`. One-shot, idempotent, run after MB re-imports.
3. Re-ingest the Django Reinhardt episode and measure: how many of the 57 PENDING entities resolve after these two changes? Compare against the baseline.
4. Add reasoning + confidence fields only if recall is still lagging.

## Out of scope

- Whisper-side fixes — see #102.
- MB search recall (typos, diacritics, fuzzy matching) — see #101.

Linked PR: #100.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce extractor type-mislabelling via MB-aware validation pass and MB-anchored prompt examples #103

Context

Proposals

1. MB-aware type validation pass (highest leverage)

2. MB-anchored prompt examples (cheap, complementary)

3. Reasoning prompt for ambiguous classifications (medium)

4. Confidence scoring (cheap, complementary)

5. (Skip for now) Per-type extraction passes

Suggested implementation order

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Outcome	Action
Exact match in expected table	Pass through unchanged.
No match in expected table, exact match in exactly one other MB-backed table	Reclassify to that table's entity_type. ("Carnegie Hall" → drop from `city`, attach as `music_venue`.)
Match in multiple other tables, or no match anywhere	Either re-prompt the extractor with the conflict surfaced, or pass through unchanged for the resolver to handle.

Reduce extractor type-mislabelling via MB-aware validation pass and MB-anchored prompt examples #103

Description

Context

Proposals

1. MB-aware type validation pass (highest leverage)

2. MB-anchored prompt examples (cheap, complementary)

3. Reasoning prompt for ambiguous classifications (medium)

4. Confidence scoring (cheap, complementary)

5. (Skip for now) Per-type extraction passes

Suggested implementation order

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions