Skip to content

investigate: qualityScore calculation — arbitrary thresholds, binary metrics, no semantic validation #1

@Acharnite

Description

@Acharnite

Summary

The three quality scoring functions in src/eval/quality.ts (scoreCompression, scoreSummary, scoreContextRelevance) use heuristic-based scoring with arbitrary thresholds and binary on/off checks. They lack semantic validation, gradual penalty scaling, and negative scoring for genuinely poor content.

Current scoring breakdown

scoreCompression(obs) — rates observation compression quality (0-100)

Criterion Points Threshold
facts present +25 any facts
facts >= 3 +10 ≥3 facts
narrative length +20 ≥20 chars
narrative longer +5 ≥50 chars
title length +15 5-120 chars
concepts present +15 any concepts
importance set +10 1-10 range
Max 100

scoreSummary(summary) — rates session summary quality (0-100)

Criterion Points Threshold
title length +20 ≥5 chars
narrative length +25 ≥20 chars
narrative longer +5 ≥100 chars
keyDecisions present +20 any entries
filesModified present +15 any entries
concepts present +15 any entries
Anti-pattern penalty -80 hardcoded pattern list
Max 100

scoreContextRelevance(context, project) — rates retrieval quality (0-100)

Criterion Points Threshold
context non-empty +20 >0 chars
project match +20 case-insensitive includes()
XML content +15 contains <
XML sections >=2 +15 <tag> count
XML sections >=4 +10 <tag> count
length >=100 +10 100 chars
length >=500 +10 500 chars
Max 100

Identified issues

1. Arbitrary thresholds

No data-driven basis for any threshold. Why 20/50/100 chars for narrative length? Why 5-120 for title? Why ≥3 facts?

2. Binary on/off (not gradual)

Having ≥3 facts gives the same +10 as having 50 facts. Narrative of 20 chars gives +25, 19 chars gives 0. No diminishing returns or saturation curves.

3. No negative scoring for bad content

Only the anti-pattern penalty (-80) reduces score. A summary with all fields present but completely incoherent can still score 80+.

4. scoreContextRelevance XML counting is fragile

Uses context.includes("<") and regex-based XML tag counting. Code with generics (<T>, </div>) scores higher than dense plain text. No semantic relevance — just structural heuristics.

5. No cross-field validation

No penalty for contradictions (title says "error", narrative says "success"). Fields scored independently.

6. Anti-pattern detection is hardcoded and limited

Only 5 hardcoded string patterns in scoreSummary. isPromptLeakage() in summarize.ts has a superset of patterns but is used to reject before scoring, not to penalize within the scoring function itself.

Suggested investigation areas

  1. Are the current thresholds backed by real data analysis?
  2. Should scoring be gradual (curve-based) rather than binary?
  3. Should bad content have symmetrical negative scoring to good content's positive scoring?
  4. Can scoreContextRelevance use embedding similarity instead of XML tag counting?
  5. Should the anti-pattern detection be unified between scoreSummary and isPromptLeakage?
  6. Could the scoring inform compression decisions (e.g., dropping low-scoring observations)?
  7. How do the current scores correlate with downstream task quality?

Context

Found while working on summary quality improvements in PR rohitg00#813 (knowledge graph fixes). The anti-pattern penalty in scoreSummary was added as a hotfix, but a systematic review of all three scoring functions would be valuable before further quality work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions