Summary
The `eval_accuracy` dimension scorer gives passing scores (≥ 70) to conversations planted with two specific accuracy failure types:
- Overgeneralized claims (`AccuracyLabel.precision = "overgeneralized"`): the agent states something that goes beyond what the KB chunk supports — e.g. presenting a scoped rule as universally true.
- Contradicted claims (`AccuracyLabel.status = "contradicted"`): the agent states something the KB explicitly contradicts.
Both failure types should produce a failing accuracy score. Instead, the scorer appears to evaluate whether the claim is plausible, not whether it is faithful to the specific KB chunk, and frequently awards passing scores.
Expected behavior
Conversations planted with `AccuracyLabel.precision = "overgeneralized"` or `AccuracyLabel.status = "contradicted"` should receive a failing accuracy score (< 70).
Impact
Inflated false-positive count in accuracy evaluations. Any eval run measuring detection rates for these failure types will show lower accuracy than the true figure.
Notes
Observed with Claude Haiku 4.5. The `exact` precision mode is not affected — failures planted via `AccuracyLabel.precision = "exact"` are correctly detected. Users relying on overgeneralized or contradicted planting for accuracy evaluation should treat their FP counts as unreliable until this is resolved.
Summary
The `eval_accuracy` dimension scorer gives passing scores (≥ 70) to conversations planted with two specific accuracy failure types:
Both failure types should produce a failing accuracy score. Instead, the scorer appears to evaluate whether the claim is plausible, not whether it is faithful to the specific KB chunk, and frequently awards passing scores.
Expected behavior
Conversations planted with `AccuracyLabel.precision = "overgeneralized"` or `AccuracyLabel.status = "contradicted"` should receive a failing accuracy score (< 70).
Impact
Inflated false-positive count in accuracy evaluations. Any eval run measuring detection rates for these failure types will show lower accuracy than the true figure.
Notes
Observed with Claude Haiku 4.5. The `exact` precision mode is not affected — failures planted via `AccuracyLabel.precision = "exact"` are correctly detected. Users relying on overgeneralized or contradicted planting for accuracy evaluation should treat their FP counts as unreliable until this is resolved.