release: docs + CS-02 bundle (post-v0.4.0)#41
Closed
ericckzhou wants to merge 8 commits into
Closed
Conversation
Missed in the v0.3.0 release prep (PR-29 added the case study; PR-33 bumped the roadmap entry but not the case studies table). Ships with the next dev → main promotion.
Third and final item in the locked artifact-infrastructure track. Closes
the sequence verify -> export --bundle -> embedded CLI invocation. After
PR-35, the artifact can answer four questions without external bookkeeping:
- what happened (existing case results + verdict)
- how it was evaluated (existing materialized spec + invariants)
- what was exported (existing bundle manifest + bundle_id)
- what exact command produced it (new cli_invocation)
That's procedural provenance — a real capability increase on top of v0.3.0,
even though implementation size is small.
Semantic boundary (load-bearing):
cli_invocation records *what command produced the artifact*, NOT a
guarantee that re-running it will produce identical outputs. Replay
determinism still lives in materialized_hash (preserved perturbation
evidence) and the bundle's bundle_id (preserved manifest + file hashes).
The bundle's auto-generated README carries an explicit disclaimer.
Capture contract (deliberately narrow):
- argv: normalized invocation tokens. argv[0] canonicalized to
'falsifyai' regardless of entry path (entry-point launcher,
python -m, direct script). Subsequent tokens preserved verbatim.
- falsifyai_version: runtime package version at capture time.
Explicitly NOT captured by design:
- environment variables (secret-leakage surface)
- API keys (no auth-bearing CLI flag today; future flags MUST redact)
- current working directory (specs are referenced by path in argv)
- hostname / username / machine identifiers (operator identity belongs
at the commit / export layer, not the artifact)
- shell history / pre-shell-expansion argv (unavailable by construction)
- file contents (spec YAML lives in MaterializedSpec already)
Capture point:
- cmd_run only (single capture point at function entry, before any
subcommand-internal mutation)
- Read-only consumer surfaces (replay, inspect, history, diff, verify,
export) NEVER stamp invocation (preservation discipline)
Backward compatibility:
- Default value is None (explicit absence, not a sentinel)
- Deserializer treats missing JSON key as None
- Pre-PR-35 artifacts load cleanly; verify does NOT gain a 9th check
Bundle wire-up:
- PR-32 already added a conditional render path; PR-35 makes it light up
- Rendering improved: shlex.join argv + falsifyai_version + the
determinism-vs-provenance disclaimer
Architectural assertion (Tier-4 discipline):
- test_cli_invocation_model_does_not_import_resolver: importing
CliInvocation alone never loads verdict.resolver
- test_replay_models_module_does_not_import_resolver: broader assertion
that the preservation layer stays independent of the interpretation
layer (future field additions can't accidentally cross the boundary)
Folded co-edits (per plan §G):
- README compliance-routing callout above 'Status and roadmap' pointing
procurement readers at docs/COMPLIANCE.md (candidate #5 fold)
- One-line note in 'What's in the evidence?' section about the new field
- Architectural-test-pattern-as-template observation lives in the
walkthrough rather than a separate doc file (candidate #6 fold)
Docs:
- CHANGELOG [Unreleased]: PR-35 entry + capture contract summary
- docs/EVIDENCE.md §6: mark artifact-infrastructure track 3-of-3 shipped;
note signing is the remaining locked deferral
- docs/COMPLIANCE.md §2 Annex IV mapping: new 'Command that produced
this evidence' row mapping to cli_invocation.argv +
cli_invocation.falsifyai_version
Tests: +24 unit (model+capture+architectural) + 5 integration = 29 new.
Suite: 519 -> 541 passing. Ruff + format clean.
Out of scope (explicit non-goals):
- Cryptographic signing (signature_slots remain reserved)
- Bundle import / round-trip
- --json output mode
- Verify check for invocation presence (would break pre-PR-35 artifacts)
- Capture from non-cmd_run commands
- Capture from a hypothetical programmatic falsifyai.runtime.run() API
Version: this is a MINOR bump (new feature, new public dataclass, new
persisted schema field). Release-prep PR after merge should target v0.4.0
('Artifact-infrastructure track complete'), not v0.3.1.
Three stale references found by audit after the artifact-infrastructure track closed with PR-35: - README.md 'Coming next' section — was 'persisted CLI-invocation field (next)'; the track is now complete. Replaced with 'track complete' framing + the 'driven by external pressure' next-pull dynamic (regulatory signing, second case study, falsifyai import, etc.). - docs/case-studies/02 opening — claimed specs were 'planned follow-up work', but PR-30 actually shipped them. Updated to link the existing specs/ directory. - docs/case-studies/README.md index row for CS-02 — said 'CLI formalization forthcoming'; specs already exist. Updated tools-used label. Release-state metadata (status banner, current-release entry, RELEASE.md illustrative examples) intentionally NOT changed here — those bump in the v0.4.0 release-prep cycle, not on dev between releases.
Artifact-infrastructure track complete (3 of 3 locked items shipped).
v0.4.0 adds persisted cli_invocation on ReplayArtifact via PR-35.
Version bumps (sources of truth, all aligned):
pyproject.toml 0.3.0 -> 0.4.0
falsifyai/__init__.py 0.3.0 -> 0.4.0
tests/unit/test_version.py rename + bump asserted value
CITATION.cff version 0.3.0 -> 0.4.0
uv.lock editable-install entry 0.3.0 -> 0.4.0
(synced in same commit, not as a trailing fix)
Release-state metadata (drift-prone files per RELEASE.md §7):
README.md status banner - new wave: artifact-infrastructure
track complete; cli_invocation +
four-question artifact summary
README.md "Status and roadmap" - 0.4.0 (current release) section
promoted; 0.3.0 demoted to compact
historical entry
README.md "Coming next" - cleaned up the leftover diff-sharpening
"Shipped in v0.3.0" bullet; reduced to
a single "driven by external pressure"
paragraph
CHANGELOG.md - [Unreleased] placeholder preserved on
top; PR-35 entries promoted into new
[0.4.0] - 2026-05-24 section;
[0.4.0] link reference added
docs/RELEASE.md - MINOR example (0.2.0->0.3.0 ->
0.3.0->0.4.0); tag commands v0.4.0;
GitHub Release example v0.4.0 +
thematic name; post-release dev-marker
example 0.4.0->0.5.0
docs/case-studies/01-invisible-character-substitution.md
- pip install command bumped to 0.4.0
(2 occurrences)
Bonus audit fixes - stale 'vNEXT' placeholders left over from v0.3.0
release-prep (caught by the comprehensive grep this cycle):
docs/COMPLIANCE.md:138, 165 - 'shipped in vNEXT' -> 'shipped in v0.3.0'
(these always meant v0.3.0; vNEXT was the
pre-release placeholder I forgot to bump)
docs/COMPLIANCE.md:12 - 'As of v0.3.0...' -> 'As of v0.4.0...';
also names the new procedural-provenance
capability that v0.4.0 adds
docs/COMPLIANCE.md:56 - cli_invocation row: 'since v0.3.0+PR-35' ->
'since v0.4.0' (since the PR ships in 0.4.0,
not 0.3.0)
docs/EVIDENCE.md:328 - '3-of-3 shipped as of v0.3.0 + PR-35' ->
'as of v0.4.0' (cleaner once PR-35 has a
release version of its own)
Verified: 510 + 31 = 541 tests pass. Ruff + format clean.
Preserved as intentional historical references (NOT changed):
CHANGELOG.md [0.3.0] entry body - historical, immutable
CHANGELOG.md [0.3.0] link reference - always present
README byte-identical-to-v0.2.0 fixture claim - load-bearing fixture
README.md "0.3.0 - Artifact-infrastructure track
(2 of 3)" roadmap entry - historical
tests/integration/test_diff_end_to_end.py
v0.2.0_baseline.txt fixture refs - load-bearing fixture
tests/unit/test_bundle_writer.py
_FIXED_VERSION = "0.3.0.test" - test sentinel
tests/unit/test_cli_invocation_model.py
falsifyai_version="0.3.0" test-data values - test fixtures, not
release-state metadata
.github/workflows/publish.yml GITHUB_REF
format comment - illustrative
What v0.4.0 ships (since v0.3.0):
PR-35 - feat(replay): persist cli_invocation on ReplayArtifact
Post-PR-35 docs audit - update stale 'next' / 'forthcoming' framing
…topics GitHub repo description and topics were updated post-v0.4.0 to reflect the artifact-infrastructure positioning. Keyword fields in pyproject.toml and CITATION.cff were left behind; this commit aligns them so PyPI search and the CITATION record surface the same terms as the repo. Removed: 'robustness' (pyproject), 'adversarial-robustness' (CITATION) — perturbation families are deliberately non-adversarial; the term was discovery-misleading. Added: ai-evaluation, llm-evaluation, ai-safety, reliability-testing, evaluation-validity, replayable-evidence, evidence-infrastructure, provenance, model-migration, eu-ai-act (pyproject); LLM-evaluation, model-migration, regression-testing, differential-testing, evaluation-validity, provenance, EU-AI-Act (CITATION). GitHub topic edits applied via gh repo edit (already live).
Three things ship together:
1. Bundled ReplayStore at docs/case-studies/data/case-study-02.db
(96 KB, SHA256 eba7d89db5f961951bba712ae2ba473e1d74452e515959fdcc90b7964c2f3f7b).
Two sessions produced by running the v1 and v2 specs against Claude
Sonnet 4.6 via the Anthropic provider on 2026-05-24. Both sessions
verify cleanly (16/16 integrity checks); diff reports 1 unchanged
exit 0. The bundled run confirms the central CS-02 claim
(boundary-allocation effect) at the verdict layer.
2. Extended data/README.md with a CS-02 bundle provenance section
following the same shape as the existing CS-01 entry. Documents
the honest divergence from the spec README's pre-run prediction:
predicted STABLE, actual FRAGILE — semantic_equivalence cosine
scores 0.69-0.76 on typo-noised variants fall below the 0.80
default threshold. The 0.80 threshold is well-tuned for short
factual responses but too strict for long structured design
responses. Recording the divergence per CASE-STUDY-METHODOLOGY §3
("a boring result from a mundane real revision is more
philosophically valuable than a dramatic result from a constructed
one"; a wrong-but-honest prediction is stronger evidence than a
quietly-re-tuned one).
3. Narrative restructure of 02-resolver-arbitration-boundary-shift.md.
The intellectually-strong-but-too-careful prior version buried the
finding under chronology, methodology justification, and
reconstruction disclaimers. Restructured per user critique:
- Opens with the finding (boundary-allocation effect), not
chronology. 4-line lead.
- Critical delta TABLE added before any responses, surfacing
the V1-vs-V2 architectural-layer movement explicitly so a
reader sees the asymmetry before reading the full evidence.
- "Pass/fail evaluators miss this; preserved inspectable
evidence does not" moved near the top as the emotional center.
- "Key observation" section moved ABOVE the Evidence section,
telling the reader what to look for before the long quoted
outputs.
- Section renamed Responses -> Evidence (aligns with project
identity: preserved comparative material, not chat logs).
- Bold "Boundary-allocation marker" callouts inline after each
full response to surface the in-resolver vs out-of-resolver
distinction without making the reader hunt.
- "What this does NOT support" rewritten as "Scope" — more
confident, less defensive, same restraint.
- Phenomenon named ONCE as "boundary-allocation effect" and
reused consistently (was drifting across "boundary shift",
"architectural pressure", "interpretive structure",
"consumer surface").
- Ending tightened to land on the boundary-allocation takeaway,
reproduction details compressed.
- Chronology + archaeological-retrieval justification + restraint
discipline moved to a Methodology appendix at the end.
The substance is preserved end-to-end; only the structure, ordering,
and naming consistency changed. The bundled run's empirical findings
are integrated honestly throughout.
Suite: 541 passing. Ruff clean.
Rewrites the README around the soul sentence "AI evals are anecdotes unless the evidence survives the run." Reorganizes so a reader sees the tool in action before any philosophy. Structural changes: - Hero radically compressed; tagline + short code block + emotional callout - 5-minute proof moved up; workflow sequence Run → Run → Diff → Inspect → THEN show the YAML spec - New lifecycle ASCII diagram between proof and core idea - "The core idea" compressed to 3-bullet pattern - New "Typical uses" section with 5 concrete use cases - "What kind of tool is this?" (SBOM / SARIF / OpenTelemetry table) moved from near-top to after case studies — peer comparison is for readers who already understand the utility - Architecture compressed to plain ASCII (Mermaid removed) + short prose - "What FalsifyAI is not" preserved but tighter - Status banner, compliance callout, "Status and roadmap", and license preserved verbatim Rationale: prior README opened with worldview and ontology; readers had to read three sections before seeing what the tool actually does. New order: utility → mechanism → applications → comparison → philosophy. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Restructures around four sections that progressively reveal the project: operational → mechanism → architecture → institutional. The reader gradually discovers (1) this solves a real problem, (2) the workflow is compelling, (3) the artifact is the key abstraction, (4) this is actually evidence infrastructure. Section 1 — Operational wedge - Hero (radically shorter; tagline + 3-line code block + one callout) - "Why this matters" — emotional problem statement (NEW); a missed customer-flagged regression and how stochastic drift discards the evidence pass/fail evaluators need a week later - "Typical uses" — 5 concrete applications, no philosophy - "The 5-minute proof" — OUTPUT FIRST. Every snippet captured verbatim from the bundled case-study replay store (real session ids, real verdicts, real perturbation strings). Sequence: baseline run output → candidate run output → diff (exit 5) → inspect (with the actual U+202F-bearing model outputs preserved) → THEN the spec. The "preserved" list lives at the tail of the proof rather than as a separate section (compressed per pushback). Section 2 — Working mental model - "Core concepts" — 3-bullet pattern compressed - "Case studies" — moved UP; now the strongest evidence section - "CLI reference" with CI integration sub-section - "What FalsifyAI is not" — moved up to scope-stabilize before the worldview section (reader has just seen artifacts; risk of projecting "observability platform" / "governance suite") Section 3 — Architectural worldview (delayed until reader wants it) - "What kind of tool is this?" (SBOM/SARIF/OpenTelemetry table) — category crystallization, only after operational flow lands - "Architecture" compressed ~50%; ASCII diagram + 2 paragraphs; depth delegated to docs/ARCHITECTURE.md (already exists) - "Resolver predictability" — preserved as differentiator, tighter Section 4 — Institutional / long-term - Compliance callout - Status and roadmap (preserved verbatim) - Further reading (NEW) - Local development + License Removed sections (depth lives in dedicated docs): - "The lifecycle" ASCII (superseded by Section 1 flow + ARCHITECTURE.md data flow) - "What's in the evidence" H2 (compressed into proof tail; full protocol semantics in EVIDENCE.md) - "Examples" table (case studies serve this role better; examples/ directory remains discoverable) - "Writing your own spec" minimal example (5-min proof already shows a YAML spec; minimal spec remains in tests/fixtures/) Methodological discipline preserved: NO synthesized terminal output. Every code block in the proof is captured verbatim from running falsifyai against docs/case-studies/data/case-study-replays.db. The U+202F invisible-character evidence is the actual model output, not a representative example. The README itself now embodies the project's preservation discipline. Preserved verbatim: status banner, "Coming next" roadmap framing, compliance callout, version metadata, link references, license. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Post-v0.4.0 documentation and evidence work accumulated on `dev`. No code changes, no version bump — purely docs, metadata, and a new case-study replay bundle.
What ships
Diff footprint
Verification
No version bump
This is docs-only. Version remains 0.4.0.
Post-merge
Per CLAUDE.md branch workflow, dev will be reset to match main after this squash-merges, to prevent ghost-history conflicts on the next dev→main promotion.
🤖 Generated with Claude Code