docs: THE-EVIDENCE-GAP.md to main#44
Merged
Merged
Conversation
New ~1000-word doc framing the gap capability scores leave open:
they preserve the score, not the evidence that produced it. For
stochastic systems, that asymmetry is operationally load-bearing —
production failures answer the question the score discards.
Structure (5 sections):
1. The shape of a capability score (what HumanEval / MMLU / G-Eval /
pass@k preserve and what they discard)
2. Three production failure modes capability scores don't catch —
each pointing at a real session id in the bundled case-study
replay store (silent migration regression; stable verdict /
shifted behavior; faded evidence)
3. What reliability evidence preserves
4. Why this is a category gap, not a feature gap — the temporal
crystallization ("Capability scores survive. The evidence that
produced them usually does not.") plus the SBOM/SARIF/Sigstore/
OpenTelemetry/FalsifyAI evidence-layer table
5. What this is not — explicit scope: NOT a critique of capability
scoring; not "throw away your evals"; orthogonal question, not
replacement
Discipline:
- Every "X catches Y" claim points at a verified session id in the
bundled replay store. No synthesized examples.
- Phrasing: consistently "reliability evidence", never "reliability
eval" (the latter sounds like another benchmark category;
evidence reinforces the preservation thesis).
Cross-links added:
- README.md "Why this matters" section gains one sentence pointing
at the categorical framing
- README.md "Further reading" gains the doc above EVIDENCE.md
- EVIDENCE.md §1 gains a callout pointing back at the wedge doc
- case-studies/README.md recursion-principle blockquote gains a
link to the framing the case studies operationalize
Job of this doc: be forwardable. ~1000 words optimizes for clarity,
memorability, and category articulation over comprehensiveness.
Authoritative depth remains in EVIDENCE.md (~3000 words) and
ARCHITECTURE.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promotes the categorical wedge doc from dev to main.
PR #43 shipped `docs/THE-EVIDENCE-GAP.md` — a ~1000-word forwardable framing of why capability scores and reliability evidence answer different questions, with every claim backed by verified session ids in the bundled case-study replay stores.
What ships
Single dev commit forward: `4103328` (PR #43 merged into dev).
4 files, +114 lines, 0 deletions.
Version
No version bump. Docs-only. Remains v0.4.0.
Risk
None. Pure documentation addition + four 1-line cross-link additions. No code, no schema, no behavior change.
Post-merge
Reset dev to match main per CLAUDE.md workflow.