Skip to content

Validate 0.530 gains on more LoCoMo samples (currently sample-0 only) #20

Description

@StigNorland

Every result from the 2026-06-18 session (graph 0.286→0.530) is on sample 0 (conv-26) only — risk of overfitting to one conversation.

Task: run the best config on samples 1–9 and confirm the gains hold.

bash -ic '.venv/bin/python src/run_locomo.py --sample N --modes graph,baseline --typed --workers 8 --out results/typed_sN.json'
  • Check graph still beats baseline per-category, especially cat2 (temporal) and cat3 (inference) where sample-0 gains were largest.
  • 50-QA (--max-qa 50) first for a fast trend, then full.

Gates trusting the 0.530 headline. See memory dgm-handoff-2026-06-18. Refs #18.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions