`eval_accuracy` scorer gives passing scores to planted accuracy failures (overgeneralized and contradicted claims)

## Summary

The \`eval_accuracy\` dimension scorer gives passing scores (≥ 70) to conversations planted with two specific accuracy failure types:

- **Overgeneralized claims** (\`AccuracyLabel.precision = \"overgeneralized\"\`): the agent states something that goes beyond what the KB chunk supports — e.g. presenting a scoped rule as universally true.
- **Contradicted claims** (\`AccuracyLabel.status = \"contradicted\"\`): the agent states something the KB explicitly contradicts.

Both failure types should produce a failing accuracy score. Instead, the scorer appears to evaluate whether the claim is *plausible*, not whether it is *faithful to the specific KB chunk*, and frequently awards passing scores.

## Expected behavior

Conversations planted with \`AccuracyLabel.precision = \"overgeneralized\"\` or \`AccuracyLabel.status = \"contradicted\"\` should receive a failing accuracy score (< 70).

## Impact

Inflated false-positive count in accuracy evaluations. Any eval run measuring detection rates for these failure types will show lower accuracy than the true figure.

## Notes

Observed with Claude Haiku 4.5. The \`exact\` precision mode is not affected — failures planted via \`AccuracyLabel.precision = \"exact\"\` are correctly detected. Users relying on overgeneralized or contradicted planting for accuracy evaluation should treat their FP counts as unreliable until this is resolved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`eval_accuracy` scorer gives passing scores to planted accuracy failures (overgeneralized and contradicted claims) #12

Summary

Expected behavior

Impact

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

eval_accuracy scorer gives passing scores to planted accuracy failures (overgeneralized and contradicted claims) #12

Description

Summary

Expected behavior

Impact

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`eval_accuracy` scorer gives passing scores to planted accuracy failures (overgeneralized and contradicted claims) #12