Skip to content

Wikidata enrichment enqueued with string field names instead of entity IDs #138

@rafacm

Description

@rafacm

Symptom

After processing an episode, two enrich_entity_wikidata workflows show as ERROR in the Episode admin's "View workflow steps" page (/admin/episodes/episode/<id>/dbos-steps/), with no step records:

episode-1-run-1-10  ERROR  wikidata_enrichment   2026-05-04 14:25
episode-1-run-1-11  ERROR  wikidata_enrichment   2026-05-04 14:25

The main pipeline workflow (episode-1-run-1) succeeds — the episode reaches ready. Only the downstream Wikidata enrichments fail.

The Django/uvicorn log shows two stack traces, both from Entity.objects.get(pk=entity_id) in episodes/enrichment.py:_fetch_entity:

ValueError: Field 'id' expected a number but got 'status'.
ValueError: Field 'id' expected a number but got 'episode_id'.

Smoking gun

The resolve_step admin row shows that the resolver returned literal field-name strings as entity IDs:

ResolveStepOutput(
  episode_id=1,
  step_name=Episode.Status.RESOLVING,
  entity_ids_to_enrich=('status', 'episode_id')   ← bug
)

Confirmed at the queue level by decoding dbos.workflow_status.inputs (base64 pickle):

# episode-1-run-1-10
{'args': ('status',), 'kwargs': {}}

# episode-1-run-1-11
{'args': ('episode_id',), 'kwargs': {}}

So the strings were already what the resolver enqueued — this is a bug inside resolver.resolve_entities(), not in the workflow plumbing or in DBOS replay/deserialization.

Steps to reproduce

  1. Fresh dev DB:
    uv run python manage.py dbreset --yes
    uv run python manage.py migrate
    uv run python manage.py load_entity_types
  2. Start an ASGI worker:
    uv run uvicorn ragtime.asgi:application --host 127.0.0.1 --port 8000
  3. From a separate terminal, submit any episode whose extracted text plausibly produces words like "status" or "episode" (the ARD URL we used reproduces it):
    uv run python manage.py submit_episode "https://www.ardsounds.de/episode/urn:ard:episode:fdcf93eef8395b35/"
  4. Wait for the pipeline to complete. Check /admin/episodes/episode/<id>/dbos-steps/ — two enrich_entity_wikidata workflows should appear with status ERROR.

Reproducible on main (the resolver/enrichment code paths are unchanged by recent PRs #129/#131).

Diagnostic queries

Run against the dev DB (substitute $RAGTIME_DB_USER for your configured user):

1. Verify EntityType keys are clean — should be the 14 canonical jazz types, no status / episode_id:

docker exec ragtime-postgres-1 psql -U "$RAGTIME_DB_USER" -d ragtime -c \"
  SELECT key, name, is_active FROM episodes_entitytype ORDER BY key;
\"

2. Verify chunk.entities_json top-level keys are clean — same expectation:

docker exec ragtime-postgres-1 psql -U \"$RAGTIME_DB_USER\" -d ragtime -c \"
  SELECT DISTINCT jsonb_object_keys(entities_json::jsonb) AS type_key
  FROM episodes_chunk
  WHERE episode_id = 1 AND entities_json IS NOT NULL
  ORDER BY type_key;
\"

3. List entities actually persisted for the episode — does any have a name that matches the bad strings? Are PKs all integers as expected?

docker exec ragtime-postgres-1 psql -U \"$RAGTIME_DB_USER\" -d ragtime -c \"
  SELECT e.id, e.name, et.key AS type_key, e.wikidata_status, e.wikidata_attempts
  FROM episodes_entity e
  JOIN episodes_entitytype et ON et.id = e.entity_type_id
  WHERE e.id IN (
    SELECT DISTINCT entity_id FROM episodes_entitymention WHERE episode_id = 1
  )
  ORDER BY e.id;
\"

4. Inspect the failed enrichment workflows' pickled inputs — confirms what was on the wire:

docker exec ragtime-postgres-1 psql -U \"$RAGTIME_DB_USER\" -d ragtime -c \"
  SELECT workflow_uuid, status, substring(inputs from 1 for 200) AS inputs_preview
  FROM dbos.workflow_status
  WHERE name LIKE '%enrich_entity_wikidata%'
  ORDER BY created_at DESC LIMIT 5;
\"

The inputs column is base64-encoded pickle. Decode in Python:

import base64, pickle
pickle.loads(base64.b64decode(\"<inputs string>\"))
# → {'args': ('status',), 'kwargs': {}}

5. Inspect the chunk JSON content for entity names (not just type keys) — the names that flow into unique_names in the resolver:

docker exec ragtime-postgres-1 psql -U \"$RAGTIME_DB_USER\" -d ragtime -c \"
  SELECT index, jsonb_pretty(entities_json::jsonb)
  FROM episodes_chunk
  WHERE episode_id = 1 AND entities_json IS NOT NULL
  ORDER BY index LIMIT 1;
\"

Code-level analysis

The only code path that populates entities_to_enrich is in episodes/resolver.py:

# resolver.py:274
entities_to_enrich: set[int] = set()

# resolver.py:284 — the only .add() call
def _maybe_enqueue(entity: Entity) -> None:
    if (...):
        entities_to_enrich.add(entity.pk)

_maybe_enqueue is called from three places, each of which assigns entity from either:

  • Entity.objects.get_or_create(...) (via _get_or_create_entity)
  • existing_by_id[matched_id]
  • existing_by_mbid[mbid]

All three should yield Entity instances with integer pk. The bug must therefore be one of:

  1. Some other call site we haven't found that mutates entities_to_enrich (e.g. .update() with an iterable of strings).
  2. An unexpected object passed to _maybe_enqueue that has .wikidata_id / .wikidata_status / .wikidata_attempts attributes (so the guard succeeds) but whose .pk returns a string. No model in the codebase obviously fits that shape.
  3. An LLM resolution response leaking — e.g. match[\"matched_entity_id\"] returns the string \"status\", which then ends up assigned to entity somehow. The resolution schema constrains it to [\"integer\", \"null\"], but a non-strict provider might let strings through.

The two strings (status, episode_id) match Django Episode-model and EntityMention-FK field names respectively. That co-occurrence suggests something is iterating over a Django model's field-name introspection (_meta.fields, __dict__, etc.) rather than over actual entity IDs — but I haven't located that path. Worth scanning for any code that hands entity._meta or similar to the resolver.

Impact

  • Per occurrence: two Entity rows whose enrichment is permanently stuck — wikidata_status='pending', wikidata_attempts=1 after the failed run. manage.py enrich_entities will retry them, but it'll re-fetch with the same string IDs and fail the same way (the workflow re-uses the persisted bad args via DBOS workflow recovery semantics).
  • Per episode: the main pipeline succeeds and the episode reaches ready, so the user-facing impact is missing Wikidata IDs on a couple of entities. Search-time hydration in vector_store.search_chunks() simply returns None for those entities' Q-IDs.
  • Pre-existing: unchanged by recent PRs Make StepFailed pickles portable across processes (closes #110) #129 / DBOS enqueue-only clients: fix dispatcher race in management commands #131. This bug is likely on every episode that exercises the affected resolver code path.

Acceptance criteria

  • Identify the call site that produces string values in entities_to_enrich. (Most likely a small targeted change once found.)
  • Add a unit test against resolver.resolve_entities() with a fixture that triggers the path — assert the returned list contains only integers (all(isinstance(x, int) for x in ids)).
  • Re-run the reproduction steps above; both enrich_entity_wikidata workflows succeed (or short-circuit cleanly per the resolver-level idempotency rules).

Out of scope

  • Driver-level hardening of _fetch_entity to log + skip on non-int input. Defense-in-depth, but the right fix is to stop the bad data at the source.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions