docs: THE-EVIDENCE-GAP.md to main by ericckzhou · Pull Request #44 · ericckzhou/falsifyai

ericckzhou · 2026-05-25T10:55:53Z

Summary

Promotes the categorical wedge doc from dev to main.

PR #43 shipped `docs/THE-EVIDENCE-GAP.md` — a ~1000-word forwardable framing of why capability scores and reliability evidence answer different questions, with every claim backed by verified session ids in the bundled case-study replay stores.

Capability scores survive. The evidence that produced them usually does not.

What ships

Single dev commit forward: `4103328` (PR #43 merged into dev).

File	Change
`docs/THE-EVIDENCE-GAP.md`	NEW (~1000 words, 5 sections)
`README.md`	+1 cross-link in "Why this matters" section, +1 entry in "Further reading"
`docs/EVIDENCE.md`	+1 callout in §1 pointing back at the wedge doc
`docs/case-studies/README.md`	+1 cross-link in recursion-principle blockquote

4 files, +114 lines, 0 deletions.

Version

No version bump. Docs-only. Remains v0.4.0.

Risk

None. Pure documentation addition + four 1-line cross-link additions. No code, no schema, no behavior change.

Post-merge

Reset dev to match main per CLAUDE.md workflow.

New ~1000-word doc framing the gap capability scores leave open: they preserve the score, not the evidence that produced it. For stochastic systems, that asymmetry is operationally load-bearing — production failures answer the question the score discards. Structure (5 sections): 1. The shape of a capability score (what HumanEval / MMLU / G-Eval / pass@k preserve and what they discard) 2. Three production failure modes capability scores don't catch — each pointing at a real session id in the bundled case-study replay store (silent migration regression; stable verdict / shifted behavior; faded evidence) 3. What reliability evidence preserves 4. Why this is a category gap, not a feature gap — the temporal crystallization ("Capability scores survive. The evidence that produced them usually does not.") plus the SBOM/SARIF/Sigstore/ OpenTelemetry/FalsifyAI evidence-layer table 5. What this is not — explicit scope: NOT a critique of capability scoring; not "throw away your evals"; orthogonal question, not replacement Discipline: - Every "X catches Y" claim points at a verified session id in the bundled replay store. No synthesized examples. - Phrasing: consistently "reliability evidence", never "reliability eval" (the latter sounds like another benchmark category; evidence reinforces the preservation thesis). Cross-links added: - README.md "Why this matters" section gains one sentence pointing at the categorical framing - README.md "Further reading" gains the doc above EVIDENCE.md - EVIDENCE.md §1 gains a callout pointing back at the wedge doc - case-studies/README.md recursion-principle blockquote gains a link to the framing the case studies operationalize Job of this doc: be forwardable. ~1000 words optimizes for clarity, memorability, and category articulation over comprehensiveness. Authoritative depth remains in EVIDENCE.md (~3000 words) and ARCHITECTURE.md.

ericckzhou merged commit 3b6624b into main May 25, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: THE-EVIDENCE-GAP.md to main#44

docs: THE-EVIDENCE-GAP.md to main#44
ericckzhou merged 1 commit into
mainfrom
dev

ericckzhou commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ericckzhou commented May 25, 2026

Summary

What ships

Version

Risk

Post-merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant