Summary
The three quality scoring functions in src/eval/quality.ts (scoreCompression, scoreSummary, scoreContextRelevance) use heuristic-based scoring with arbitrary thresholds and binary on/off checks. They lack semantic validation, gradual penalty scaling, and negative scoring for genuinely poor content.
Current scoring breakdown
scoreCompression(obs) — rates observation compression quality (0-100)
| Criterion |
Points |
Threshold |
| facts present |
+25 |
any facts |
| facts >= 3 |
+10 |
≥3 facts |
| narrative length |
+20 |
≥20 chars |
| narrative longer |
+5 |
≥50 chars |
| title length |
+15 |
5-120 chars |
| concepts present |
+15 |
any concepts |
| importance set |
+10 |
1-10 range |
| Max |
100 |
|
scoreSummary(summary) — rates session summary quality (0-100)
| Criterion |
Points |
Threshold |
| title length |
+20 |
≥5 chars |
| narrative length |
+25 |
≥20 chars |
| narrative longer |
+5 |
≥100 chars |
| keyDecisions present |
+20 |
any entries |
| filesModified present |
+15 |
any entries |
| concepts present |
+15 |
any entries |
| Anti-pattern penalty |
-80 |
hardcoded pattern list |
| Max |
100 |
|
scoreContextRelevance(context, project) — rates retrieval quality (0-100)
| Criterion |
Points |
Threshold |
| context non-empty |
+20 |
>0 chars |
| project match |
+20 |
case-insensitive includes() |
| XML content |
+15 |
contains < |
| XML sections >=2 |
+15 |
<tag> count |
| XML sections >=4 |
+10 |
<tag> count |
| length >=100 |
+10 |
100 chars |
| length >=500 |
+10 |
500 chars |
| Max |
100 |
|
Identified issues
1. Arbitrary thresholds
No data-driven basis for any threshold. Why 20/50/100 chars for narrative length? Why 5-120 for title? Why ≥3 facts?
2. Binary on/off (not gradual)
Having ≥3 facts gives the same +10 as having 50 facts. Narrative of 20 chars gives +25, 19 chars gives 0. No diminishing returns or saturation curves.
3. No negative scoring for bad content
Only the anti-pattern penalty (-80) reduces score. A summary with all fields present but completely incoherent can still score 80+.
4. scoreContextRelevance XML counting is fragile
Uses context.includes("<") and regex-based XML tag counting. Code with generics (<T>, </div>) scores higher than dense plain text. No semantic relevance — just structural heuristics.
5. No cross-field validation
No penalty for contradictions (title says "error", narrative says "success"). Fields scored independently.
6. Anti-pattern detection is hardcoded and limited
Only 5 hardcoded string patterns in scoreSummary. isPromptLeakage() in summarize.ts has a superset of patterns but is used to reject before scoring, not to penalize within the scoring function itself.
Suggested investigation areas
- Are the current thresholds backed by real data analysis?
- Should scoring be gradual (curve-based) rather than binary?
- Should bad content have symmetrical negative scoring to good content's positive scoring?
- Can
scoreContextRelevance use embedding similarity instead of XML tag counting?
- Should the anti-pattern detection be unified between
scoreSummary and isPromptLeakage?
- Could the scoring inform compression decisions (e.g., dropping low-scoring observations)?
- How do the current scores correlate with downstream task quality?
Context
Found while working on summary quality improvements in PR rohitg00#813 (knowledge graph fixes). The anti-pattern penalty in scoreSummary was added as a hotfix, but a systematic review of all three scoring functions would be valuable before further quality work.
Summary
The three quality scoring functions in
src/eval/quality.ts(scoreCompression,scoreSummary,scoreContextRelevance) use heuristic-based scoring with arbitrary thresholds and binary on/off checks. They lack semantic validation, gradual penalty scaling, and negative scoring for genuinely poor content.Current scoring breakdown
scoreCompression(obs)— rates observation compression quality (0-100)scoreSummary(summary)— rates session summary quality (0-100)scoreContextRelevance(context, project)— rates retrieval quality (0-100)<<tag>count<tag>countIdentified issues
1. Arbitrary thresholds
No data-driven basis for any threshold. Why 20/50/100 chars for narrative length? Why 5-120 for title? Why ≥3 facts?
2. Binary on/off (not gradual)
Having ≥3 facts gives the same +10 as having 50 facts. Narrative of 20 chars gives +25, 19 chars gives 0. No diminishing returns or saturation curves.
3. No negative scoring for bad content
Only the anti-pattern penalty (-80) reduces score. A summary with all fields present but completely incoherent can still score 80+.
4.
scoreContextRelevanceXML counting is fragileUses
context.includes("<")and regex-based XML tag counting. Code with generics (<T>,</div>) scores higher than dense plain text. No semantic relevance — just structural heuristics.5. No cross-field validation
No penalty for contradictions (title says "error", narrative says "success"). Fields scored independently.
6. Anti-pattern detection is hardcoded and limited
Only 5 hardcoded string patterns in
scoreSummary.isPromptLeakage()insummarize.tshas a superset of patterns but is used to reject before scoring, not to penalize within the scoring function itself.Suggested investigation areas
scoreContextRelevanceuse embedding similarity instead of XML tag counting?scoreSummaryandisPromptLeakage?Context
Found while working on summary quality improvements in PR rohitg00#813 (knowledge graph fixes). The anti-pattern penalty in
scoreSummarywas added as a hotfix, but a systematic review of all three scoring functions would be valuable before further quality work.