release: docs + CS-02 bundle (post-v0.4.0) by ericckzhou · Pull Request #41 · ericckzhou/falsifyai

ericckzhou · 2026-05-24T19:24:40Z

Summary

Post-v0.4.0 documentation and evidence work accumulated on `dev`. No code changes, no version bump — purely docs, metadata, and a new case-study replay bundle.

What ships

PR	Change
(direct)	`chore(meta)`: sync `pyproject.toml` + `CITATION.cff` keywords with GitHub repository topics
#38	`docs(case-study)`: bundle CS-02 ReplayStore (96 KB, 2 sessions, SHA256 verified) + narrative restructure
#39	`docs(readme)`: initial utility-first restructure — workflow before philosophy
#40	`docs(readme)`: utility-first v2 — guided realization structure with verbatim-captured terminal output

Diff footprint

`README.md` — substantial restructure (utility-first, output-first 5-min proof captured verbatim from bundled case-study replay store)
`docs/case-studies/02-resolver-arbitration-boundary-shift.md` — restructured around boundary-allocation effect
`docs/case-studies/data/case-study-02.db` — NEW bundled replay artifact (96 KB)
`docs/case-studies/data/README.md` — extended with CS-02 provenance section
`pyproject.toml` + `CITATION.cff` — keyword sync with GitHub topics

Verification

`falsifyai verify --all` on `docs/case-studies/data/case-study-replays.db`: 64/64 checks passing
`falsifyai verify --all` on `docs/case-studies/data/case-study-02.db`: 16/16 checks passing
All README terminal snippets captured verbatim from bundled stores (no synthesized output)

No version bump

This is docs-only. Version remains 0.4.0.

Post-merge

Per CLAUDE.md branch workflow, dev will be reset to match main after this squash-merges, to prevent ghost-history conflicts on the next dev→main promotion.

🤖 Generated with Claude Code

Missed in the v0.3.0 release prep (PR-29 added the case study; PR-33 bumped the roadmap entry but not the case studies table). Ships with the next dev → main promotion.

Third and final item in the locked artifact-infrastructure track. Closes the sequence verify -> export --bundle -> embedded CLI invocation. After PR-35, the artifact can answer four questions without external bookkeeping: - what happened (existing case results + verdict) - how it was evaluated (existing materialized spec + invariants) - what was exported (existing bundle manifest + bundle_id) - what exact command produced it (new cli_invocation) That's procedural provenance — a real capability increase on top of v0.3.0, even though implementation size is small. Semantic boundary (load-bearing): cli_invocation records *what command produced the artifact*, NOT a guarantee that re-running it will produce identical outputs. Replay determinism still lives in materialized_hash (preserved perturbation evidence) and the bundle's bundle_id (preserved manifest + file hashes). The bundle's auto-generated README carries an explicit disclaimer. Capture contract (deliberately narrow): - argv: normalized invocation tokens. argv[0] canonicalized to 'falsifyai' regardless of entry path (entry-point launcher, python -m, direct script). Subsequent tokens preserved verbatim. - falsifyai_version: runtime package version at capture time. Explicitly NOT captured by design: - environment variables (secret-leakage surface) - API keys (no auth-bearing CLI flag today; future flags MUST redact) - current working directory (specs are referenced by path in argv) - hostname / username / machine identifiers (operator identity belongs at the commit / export layer, not the artifact) - shell history / pre-shell-expansion argv (unavailable by construction) - file contents (spec YAML lives in MaterializedSpec already) Capture point: - cmd_run only (single capture point at function entry, before any subcommand-internal mutation) - Read-only consumer surfaces (replay, inspect, history, diff, verify, export) NEVER stamp invocation (preservation discipline) Backward compatibility: - Default value is None (explicit absence, not a sentinel) - Deserializer treats missing JSON key as None - Pre-PR-35 artifacts load cleanly; verify does NOT gain a 9th check Bundle wire-up: - PR-32 already added a conditional render path; PR-35 makes it light up - Rendering improved: shlex.join argv + falsifyai_version + the determinism-vs-provenance disclaimer Architectural assertion (Tier-4 discipline): - test_cli_invocation_model_does_not_import_resolver: importing CliInvocation alone never loads verdict.resolver - test_replay_models_module_does_not_import_resolver: broader assertion that the preservation layer stays independent of the interpretation layer (future field additions can't accidentally cross the boundary) Folded co-edits (per plan §G): - README compliance-routing callout above 'Status and roadmap' pointing procurement readers at docs/COMPLIANCE.md (candidate #5 fold) - One-line note in 'What's in the evidence?' section about the new field - Architectural-test-pattern-as-template observation lives in the walkthrough rather than a separate doc file (candidate #6 fold) Docs: - CHANGELOG [Unreleased]: PR-35 entry + capture contract summary - docs/EVIDENCE.md §6: mark artifact-infrastructure track 3-of-3 shipped; note signing is the remaining locked deferral - docs/COMPLIANCE.md §2 Annex IV mapping: new 'Command that produced this evidence' row mapping to cli_invocation.argv + cli_invocation.falsifyai_version Tests: +24 unit (model+capture+architectural) + 5 integration = 29 new. Suite: 519 -> 541 passing. Ruff + format clean. Out of scope (explicit non-goals): - Cryptographic signing (signature_slots remain reserved) - Bundle import / round-trip - --json output mode - Verify check for invocation presence (would break pre-PR-35 artifacts) - Capture from non-cmd_run commands - Capture from a hypothetical programmatic falsifyai.runtime.run() API Version: this is a MINOR bump (new feature, new public dataclass, new persisted schema field). Release-prep PR after merge should target v0.4.0 ('Artifact-infrastructure track complete'), not v0.3.1.

Three stale references found by audit after the artifact-infrastructure track closed with PR-35: - README.md 'Coming next' section — was 'persisted CLI-invocation field (next)'; the track is now complete. Replaced with 'track complete' framing + the 'driven by external pressure' next-pull dynamic (regulatory signing, second case study, falsifyai import, etc.). - docs/case-studies/02 opening — claimed specs were 'planned follow-up work', but PR-30 actually shipped them. Updated to link the existing specs/ directory. - docs/case-studies/README.md index row for CS-02 — said 'CLI formalization forthcoming'; specs already exist. Updated tools-used label. Release-state metadata (status banner, current-release entry, RELEASE.md illustrative examples) intentionally NOT changed here — those bump in the v0.4.0 release-prep cycle, not on dev between releases.

Artifact-infrastructure track complete (3 of 3 locked items shipped). v0.4.0 adds persisted cli_invocation on ReplayArtifact via PR-35. Version bumps (sources of truth, all aligned): pyproject.toml 0.3.0 -> 0.4.0 falsifyai/__init__.py 0.3.0 -> 0.4.0 tests/unit/test_version.py rename + bump asserted value CITATION.cff version 0.3.0 -> 0.4.0 uv.lock editable-install entry 0.3.0 -> 0.4.0 (synced in same commit, not as a trailing fix) Release-state metadata (drift-prone files per RELEASE.md §7): README.md status banner - new wave: artifact-infrastructure track complete; cli_invocation + four-question artifact summary README.md "Status and roadmap" - 0.4.0 (current release) section promoted; 0.3.0 demoted to compact historical entry README.md "Coming next" - cleaned up the leftover diff-sharpening "Shipped in v0.3.0" bullet; reduced to a single "driven by external pressure" paragraph CHANGELOG.md - [Unreleased] placeholder preserved on top; PR-35 entries promoted into new [0.4.0] - 2026-05-24 section; [0.4.0] link reference added docs/RELEASE.md - MINOR example (0.2.0->0.3.0 -> 0.3.0->0.4.0); tag commands v0.4.0; GitHub Release example v0.4.0 + thematic name; post-release dev-marker example 0.4.0->0.5.0 docs/case-studies/01-invisible-character-substitution.md - pip install command bumped to 0.4.0 (2 occurrences) Bonus audit fixes - stale 'vNEXT' placeholders left over from v0.3.0 release-prep (caught by the comprehensive grep this cycle): docs/COMPLIANCE.md:138, 165 - 'shipped in vNEXT' -> 'shipped in v0.3.0' (these always meant v0.3.0; vNEXT was the pre-release placeholder I forgot to bump) docs/COMPLIANCE.md:12 - 'As of v0.3.0...' -> 'As of v0.4.0...'; also names the new procedural-provenance capability that v0.4.0 adds docs/COMPLIANCE.md:56 - cli_invocation row: 'since v0.3.0+PR-35' -> 'since v0.4.0' (since the PR ships in 0.4.0, not 0.3.0) docs/EVIDENCE.md:328 - '3-of-3 shipped as of v0.3.0 + PR-35' -> 'as of v0.4.0' (cleaner once PR-35 has a release version of its own) Verified: 510 + 31 = 541 tests pass. Ruff + format clean. Preserved as intentional historical references (NOT changed): CHANGELOG.md [0.3.0] entry body - historical, immutable CHANGELOG.md [0.3.0] link reference - always present README byte-identical-to-v0.2.0 fixture claim - load-bearing fixture README.md "0.3.0 - Artifact-infrastructure track (2 of 3)" roadmap entry - historical tests/integration/test_diff_end_to_end.py v0.2.0_baseline.txt fixture refs - load-bearing fixture tests/unit/test_bundle_writer.py _FIXED_VERSION = "0.3.0.test" - test sentinel tests/unit/test_cli_invocation_model.py falsifyai_version="0.3.0" test-data values - test fixtures, not release-state metadata .github/workflows/publish.yml GITHUB_REF format comment - illustrative What v0.4.0 ships (since v0.3.0): PR-35 - feat(replay): persist cli_invocation on ReplayArtifact Post-PR-35 docs audit - update stale 'next' / 'forthcoming' framing

…topics GitHub repo description and topics were updated post-v0.4.0 to reflect the artifact-infrastructure positioning. Keyword fields in pyproject.toml and CITATION.cff were left behind; this commit aligns them so PyPI search and the CITATION record surface the same terms as the repo. Removed: 'robustness' (pyproject), 'adversarial-robustness' (CITATION) — perturbation families are deliberately non-adversarial; the term was discovery-misleading. Added: ai-evaluation, llm-evaluation, ai-safety, reliability-testing, evaluation-validity, replayable-evidence, evidence-infrastructure, provenance, model-migration, eu-ai-act (pyproject); LLM-evaluation, model-migration, regression-testing, differential-testing, evaluation-validity, provenance, EU-AI-Act (CITATION). GitHub topic edits applied via gh repo edit (already live).

Three things ship together: 1. Bundled ReplayStore at docs/case-studies/data/case-study-02.db (96 KB, SHA256 eba7d89db5f961951bba712ae2ba473e1d74452e515959fdcc90b7964c2f3f7b). Two sessions produced by running the v1 and v2 specs against Claude Sonnet 4.6 via the Anthropic provider on 2026-05-24. Both sessions verify cleanly (16/16 integrity checks); diff reports 1 unchanged exit 0. The bundled run confirms the central CS-02 claim (boundary-allocation effect) at the verdict layer. 2. Extended data/README.md with a CS-02 bundle provenance section following the same shape as the existing CS-01 entry. Documents the honest divergence from the spec README's pre-run prediction: predicted STABLE, actual FRAGILE — semantic_equivalence cosine scores 0.69-0.76 on typo-noised variants fall below the 0.80 default threshold. The 0.80 threshold is well-tuned for short factual responses but too strict for long structured design responses. Recording the divergence per CASE-STUDY-METHODOLOGY §3 ("a boring result from a mundane real revision is more philosophically valuable than a dramatic result from a constructed one"; a wrong-but-honest prediction is stronger evidence than a quietly-re-tuned one). 3. Narrative restructure of 02-resolver-arbitration-boundary-shift.md. The intellectually-strong-but-too-careful prior version buried the finding under chronology, methodology justification, and reconstruction disclaimers. Restructured per user critique: - Opens with the finding (boundary-allocation effect), not chronology. 4-line lead. - Critical delta TABLE added before any responses, surfacing the V1-vs-V2 architectural-layer movement explicitly so a reader sees the asymmetry before reading the full evidence. - "Pass/fail evaluators miss this; preserved inspectable evidence does not" moved near the top as the emotional center. - "Key observation" section moved ABOVE the Evidence section, telling the reader what to look for before the long quoted outputs. - Section renamed Responses -> Evidence (aligns with project identity: preserved comparative material, not chat logs). - Bold "Boundary-allocation marker" callouts inline after each full response to surface the in-resolver vs out-of-resolver distinction without making the reader hunt. - "What this does NOT support" rewritten as "Scope" — more confident, less defensive, same restraint. - Phenomenon named ONCE as "boundary-allocation effect" and reused consistently (was drifting across "boundary shift", "architectural pressure", "interpretive structure", "consumer surface"). - Ending tightened to land on the boundary-allocation takeaway, reproduction details compressed. - Chronology + archaeological-retrieval justification + restraint discipline moved to a Methodology appendix at the end. The substance is preserved end-to-end; only the structure, ordering, and naming consistency changed. The bundled run's empirical findings are integrated honestly throughout. Suite: 541 passing. Ruff clean.

Rewrites the README around the soul sentence "AI evals are anecdotes unless the evidence survives the run." Reorganizes so a reader sees the tool in action before any philosophy. Structural changes: - Hero radically compressed; tagline + short code block + emotional callout - 5-minute proof moved up; workflow sequence Run → Run → Diff → Inspect → THEN show the YAML spec - New lifecycle ASCII diagram between proof and core idea - "The core idea" compressed to 3-bullet pattern - New "Typical uses" section with 5 concrete use cases - "What kind of tool is this?" (SBOM / SARIF / OpenTelemetry table) moved from near-top to after case studies — peer comparison is for readers who already understand the utility - Architecture compressed to plain ASCII (Mermaid removed) + short prose - "What FalsifyAI is not" preserved but tighter - Status banner, compliance callout, "Status and roadmap", and license preserved verbatim Rationale: prior README opened with worldview and ontology; readers had to read three sections before seeing what the tool actually does. New order: utility → mechanism → applications → comparison → philosophy. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Restructures around four sections that progressively reveal the project: operational → mechanism → architecture → institutional. The reader gradually discovers (1) this solves a real problem, (2) the workflow is compelling, (3) the artifact is the key abstraction, (4) this is actually evidence infrastructure. Section 1 — Operational wedge - Hero (radically shorter; tagline + 3-line code block + one callout) - "Why this matters" — emotional problem statement (NEW); a missed customer-flagged regression and how stochastic drift discards the evidence pass/fail evaluators need a week later - "Typical uses" — 5 concrete applications, no philosophy - "The 5-minute proof" — OUTPUT FIRST. Every snippet captured verbatim from the bundled case-study replay store (real session ids, real verdicts, real perturbation strings). Sequence: baseline run output → candidate run output → diff (exit 5) → inspect (with the actual U+202F-bearing model outputs preserved) → THEN the spec. The "preserved" list lives at the tail of the proof rather than as a separate section (compressed per pushback). Section 2 — Working mental model - "Core concepts" — 3-bullet pattern compressed - "Case studies" — moved UP; now the strongest evidence section - "CLI reference" with CI integration sub-section - "What FalsifyAI is not" — moved up to scope-stabilize before the worldview section (reader has just seen artifacts; risk of projecting "observability platform" / "governance suite") Section 3 — Architectural worldview (delayed until reader wants it) - "What kind of tool is this?" (SBOM/SARIF/OpenTelemetry table) — category crystallization, only after operational flow lands - "Architecture" compressed ~50%; ASCII diagram + 2 paragraphs; depth delegated to docs/ARCHITECTURE.md (already exists) - "Resolver predictability" — preserved as differentiator, tighter Section 4 — Institutional / long-term - Compliance callout - Status and roadmap (preserved verbatim) - Further reading (NEW) - Local development + License Removed sections (depth lives in dedicated docs): - "The lifecycle" ASCII (superseded by Section 1 flow + ARCHITECTURE.md data flow) - "What's in the evidence" H2 (compressed into proof tail; full protocol semantics in EVIDENCE.md) - "Examples" table (case studies serve this role better; examples/ directory remains discoverable) - "Writing your own spec" minimal example (5-min proof already shows a YAML spec; minimal spec remains in tests/fixtures/) Methodological discipline preserved: NO synthesized terminal output. Every code block in the proof is captured verbatim from running falsifyai against docs/case-studies/data/case-study-replays.db. The U+202F invisible-character evidence is the actual model output, not a representative example. The README itself now embodies the project's preservation discipline. Preserved verbatim: status banner, "Coming next" roadmap framing, compliance callout, version metadata, link references, license. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

ericckzhou · 2026-05-24T19:28:44Z

Superseded by clean release branch PR (avoids ghost-history conflict from v0.4.0 squash merge). Dev work fully preserved on PRs #38, #39, #40.

ericckzhou and others added 8 commits May 24, 2026 07:30

docs(readme): add case study 02 to case studies table

799ba0c

Missed in the v0.3.0 release prep (PR-29 added the case study; PR-33 bumped the roadmap entry but not the case studies table). Ships with the next dev → main promotion.

ericckzhou closed this May 24, 2026

ericckzhou mentioned this pull request May 24, 2026

release: docs + CS-02 bundle (post-v0.4.0) #42

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: docs + CS-02 bundle (post-v0.4.0)#41

release: docs + CS-02 bundle (post-v0.4.0)#41
ericckzhou wants to merge 8 commits into
mainfrom
dev

ericckzhou commented May 24, 2026

Uh oh!

ericckzhou commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ericckzhou commented May 24, 2026

Summary

What ships

Diff footprint

Verification

No version bump

Post-merge

Uh oh!

ericckzhou commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant