Every result from the 2026-06-18 session (graph 0.286→0.530) is on sample 0 (conv-26) only — risk of overfitting to one conversation.
Task: run the best config on samples 1–9 and confirm the gains hold.
bash -ic '.venv/bin/python src/run_locomo.py --sample N --modes graph,baseline --typed --workers 8 --out results/typed_sN.json'
- Check graph still beats baseline per-category, especially cat2 (temporal) and cat3 (inference) where sample-0 gains were largest.
- 50-QA (
--max-qa 50) first for a fast trend, then full.
Gates trusting the 0.530 headline. See memory dgm-handoff-2026-06-18. Refs #18.
Every result from the 2026-06-18 session (graph 0.286→0.530) is on sample 0 (conv-26) only — risk of overfitting to one conversation.
Task: run the best config on samples 1–9 and confirm the gains hold.
--max-qa 50) first for a fast trend, then full.Gates trusting the 0.530 headline. See memory dgm-handoff-2026-06-18. Refs #18.