investigate: qualityScore calculation — arbitrary thresholds, binary metrics, no semantic validation

## Summary

The three quality scoring functions in `src/eval/quality.ts` (`scoreCompression`, `scoreSummary`, `scoreContextRelevance`) use heuristic-based scoring with arbitrary thresholds and binary on/off checks. They lack semantic validation, gradual penalty scaling, and negative scoring for genuinely poor content.

## Current scoring breakdown

### `scoreCompression(obs)` — rates observation compression quality (0-100)

| Criterion | Points | Threshold |
|-----------|--------|-----------|
| facts present | +25 | any facts |
| facts >= 3 | +10 | ≥3 facts |
| narrative length | +20 | ≥20 chars |
| narrative longer | +5 | ≥50 chars |
| title length | +15 | 5-120 chars |
| concepts present | +15 | any concepts |
| importance set | +10 | 1-10 range |
| **Max** | **100** | |

### `scoreSummary(summary)` — rates session summary quality (0-100)

| Criterion | Points | Threshold |
|-----------|--------|-----------|
| title length | +20 | ≥5 chars |
| narrative length | +25 | ≥20 chars |
| narrative longer | +5 | ≥100 chars |
| keyDecisions present | +20 | any entries |
| filesModified present | +15 | any entries |
| concepts present | +15 | any entries |
| Anti-pattern penalty | **-80** | hardcoded pattern list |
| **Max** | **100** | |

### `scoreContextRelevance(context, project)` — rates retrieval quality (0-100)

| Criterion | Points | Threshold |
|-----------|--------|-----------|
| context non-empty | +20 | >0 chars |
| project match | +20 | case-insensitive includes() |
| XML content | +15 | contains `<` |
| XML sections >=2 | +15 | `<tag>` count |
| XML sections >=4 | +10 | `<tag>` count |
| length >=100 | +10 | 100 chars |
| length >=500 | +10 | 500 chars |
| **Max** | **100** | |

## Identified issues

### 1. Arbitrary thresholds
No data-driven basis for any threshold. Why 20/50/100 chars for narrative length? Why 5-120 for title? Why ≥3 facts?

### 2. Binary on/off (not gradual)
Having ≥3 facts gives the same +10 as having 50 facts. Narrative of 20 chars gives +25, 19 chars gives 0. No diminishing returns or saturation curves.

### 3. No negative scoring for bad content
Only the anti-pattern penalty (-80) reduces score. A summary with all fields present but completely incoherent can still score 80+.

### 4. `scoreContextRelevance` XML counting is fragile
Uses `context.includes("<")` and regex-based XML tag counting. Code with generics (`<T>`, `</div>`) scores higher than dense plain text. No semantic relevance — just structural heuristics.

### 5. No cross-field validation
No penalty for contradictions (title says "error", narrative says "success"). Fields scored independently.

### 6. Anti-pattern detection is hardcoded and limited
Only 5 hardcoded string patterns in `scoreSummary`. `isPromptLeakage()` in `summarize.ts` has a superset of patterns but is used to reject *before* scoring, not to penalize within the scoring function itself.

## Suggested investigation areas

1. Are the current thresholds backed by real data analysis?
2. Should scoring be gradual (curve-based) rather than binary?
3. Should bad content have symmetrical negative scoring to good content's positive scoring?
4. Can `scoreContextRelevance` use embedding similarity instead of XML tag counting?
5. Should the anti-pattern detection be unified between `scoreSummary` and `isPromptLeakage`?
6. Could the scoring inform compression decisions (e.g., dropping low-scoring observations)?
7. How do the current scores correlate with downstream task quality?

## Context

Found while working on summary quality improvements in PR #813 (knowledge graph fixes). The anti-pattern penalty in `scoreSummary` was added as a hotfix, but a systematic review of all three scoring functions would be valuable before further quality work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate: qualityScore calculation — arbitrary thresholds, binary metrics, no semantic validation #1

Summary

Current scoring breakdown

`scoreCompression(obs)` — rates observation compression quality (0-100)

`scoreSummary(summary)` — rates session summary quality (0-100)

`scoreContextRelevance(context, project)` — rates retrieval quality (0-100)

Identified issues

1. Arbitrary thresholds

2. Binary on/off (not gradual)

3. No negative scoring for bad content

4. `scoreContextRelevance` XML counting is fragile

5. No cross-field validation

6. Anti-pattern detection is hardcoded and limited

Suggested investigation areas

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Criterion	Points	Threshold
facts present	+25	any facts
facts >= 3	+10	≥3 facts
narrative length	+20	≥20 chars
narrative longer	+5	≥50 chars
title length	+15	5-120 chars
concepts present	+15	any concepts
importance set	+10	1-10 range
Max	100

Criterion	Points	Threshold
title length	+20	≥5 chars
narrative length	+25	≥20 chars
narrative longer	+5	≥100 chars
keyDecisions present	+20	any entries
filesModified present	+15	any entries
concepts present	+15	any entries
Anti-pattern penalty	-80	hardcoded pattern list
Max	100

Criterion	Points	Threshold
context non-empty	+20	>0 chars
project match	+20	case-insensitive includes()
XML content	+15	contains `<`
XML sections >=2	+15	`<tag>` count
XML sections >=4	+10	`<tag>` count
length >=100	+10	100 chars
length >=500	+10	500 chars
Max	100

investigate: qualityScore calculation — arbitrary thresholds, binary metrics, no semantic validation #1

Description

Summary

Current scoring breakdown

scoreCompression(obs) — rates observation compression quality (0-100)

scoreSummary(summary) — rates session summary quality (0-100)

scoreContextRelevance(context, project) — rates retrieval quality (0-100)

Identified issues

1. Arbitrary thresholds

2. Binary on/off (not gradual)

3. No negative scoring for bad content

4. scoreContextRelevance XML counting is fragile

5. No cross-field validation

6. Anti-pattern detection is hardcoded and limited

Suggested investigation areas

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`scoreCompression(obs)` — rates observation compression quality (0-100)

`scoreSummary(summary)` — rates session summary quality (0-100)

`scoreContextRelevance(context, project)` — rates retrieval quality (0-100)

4. `scoreContextRelevance` XML counting is fragile