Skip to content

docs: THE-EVIDENCE-GAP.md to main#44

Merged
ericckzhou merged 1 commit into
mainfrom
dev
May 25, 2026
Merged

docs: THE-EVIDENCE-GAP.md to main#44
ericckzhou merged 1 commit into
mainfrom
dev

Conversation

@ericckzhou
Copy link
Copy Markdown
Owner

Summary

Promotes the categorical wedge doc from dev to main.

PR #43 shipped `docs/THE-EVIDENCE-GAP.md` — a ~1000-word forwardable framing of why capability scores and reliability evidence answer different questions, with every claim backed by verified session ids in the bundled case-study replay stores.

Capability scores survive. The evidence that produced them usually does not.

What ships

Single dev commit forward: `4103328` (PR #43 merged into dev).

File Change
`docs/THE-EVIDENCE-GAP.md` NEW (~1000 words, 5 sections)
`README.md` +1 cross-link in "Why this matters" section, +1 entry in "Further reading"
`docs/EVIDENCE.md` +1 callout in §1 pointing back at the wedge doc
`docs/case-studies/README.md` +1 cross-link in recursion-principle blockquote

4 files, +114 lines, 0 deletions.

Version

No version bump. Docs-only. Remains v0.4.0.

Risk

None. Pure documentation addition + four 1-line cross-link additions. No code, no schema, no behavior change.

Post-merge

Reset dev to match main per CLAUDE.md workflow.

New ~1000-word doc framing the gap capability scores leave open:
they preserve the score, not the evidence that produced it. For
stochastic systems, that asymmetry is operationally load-bearing —
production failures answer the question the score discards.

Structure (5 sections):
1. The shape of a capability score (what HumanEval / MMLU / G-Eval /
   pass@k preserve and what they discard)
2. Three production failure modes capability scores don't catch —
   each pointing at a real session id in the bundled case-study
   replay store (silent migration regression; stable verdict /
   shifted behavior; faded evidence)
3. What reliability evidence preserves
4. Why this is a category gap, not a feature gap — the temporal
   crystallization ("Capability scores survive. The evidence that
   produced them usually does not.") plus the SBOM/SARIF/Sigstore/
   OpenTelemetry/FalsifyAI evidence-layer table
5. What this is not — explicit scope: NOT a critique of capability
   scoring; not "throw away your evals"; orthogonal question, not
   replacement

Discipline:
- Every "X catches Y" claim points at a verified session id in the
  bundled replay store. No synthesized examples.
- Phrasing: consistently "reliability evidence", never "reliability
  eval" (the latter sounds like another benchmark category;
  evidence reinforces the preservation thesis).

Cross-links added:
- README.md "Why this matters" section gains one sentence pointing
  at the categorical framing
- README.md "Further reading" gains the doc above EVIDENCE.md
- EVIDENCE.md §1 gains a callout pointing back at the wedge doc
- case-studies/README.md recursion-principle blockquote gains a
  link to the framing the case studies operationalize

Job of this doc: be forwardable. ~1000 words optimizes for clarity,
memorability, and category articulation over comprehensiveness.
Authoritative depth remains in EVIDENCE.md (~3000 words) and
ARCHITECTURE.md.
@ericckzhou ericckzhou merged commit 3b6624b into main May 25, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant