Skip to content

release: docs + CS-02 bundle (post-v0.4.0)#41

Closed
ericckzhou wants to merge 8 commits into
mainfrom
dev
Closed

release: docs + CS-02 bundle (post-v0.4.0)#41
ericckzhou wants to merge 8 commits into
mainfrom
dev

Conversation

@ericckzhou
Copy link
Copy Markdown
Owner

Summary

Post-v0.4.0 documentation and evidence work accumulated on `dev`. No code changes, no version bump — purely docs, metadata, and a new case-study replay bundle.

What ships

PR Change
(direct) `chore(meta)`: sync `pyproject.toml` + `CITATION.cff` keywords with GitHub repository topics
#38 `docs(case-study)`: bundle CS-02 ReplayStore (96 KB, 2 sessions, SHA256 verified) + narrative restructure
#39 `docs(readme)`: initial utility-first restructure — workflow before philosophy
#40 `docs(readme)`: utility-first v2 — guided realization structure with verbatim-captured terminal output

Diff footprint

  • `README.md` — substantial restructure (utility-first, output-first 5-min proof captured verbatim from bundled case-study replay store)
  • `docs/case-studies/02-resolver-arbitration-boundary-shift.md` — restructured around boundary-allocation effect
  • `docs/case-studies/data/case-study-02.db` — NEW bundled replay artifact (96 KB)
  • `docs/case-studies/data/README.md` — extended with CS-02 provenance section
  • `pyproject.toml` + `CITATION.cff` — keyword sync with GitHub topics

Verification

  • `falsifyai verify --all` on `docs/case-studies/data/case-study-replays.db`: 64/64 checks passing
  • `falsifyai verify --all` on `docs/case-studies/data/case-study-02.db`: 16/16 checks passing
  • All README terminal snippets captured verbatim from bundled stores (no synthesized output)

No version bump

This is docs-only. Version remains 0.4.0.

Post-merge

Per CLAUDE.md branch workflow, dev will be reset to match main after this squash-merges, to prevent ghost-history conflicts on the next dev→main promotion.

🤖 Generated with Claude Code

ericckzhou and others added 8 commits May 24, 2026 07:30
Missed in the v0.3.0 release prep (PR-29 added the case study; PR-33
bumped the roadmap entry but not the case studies table). Ships with
the next dev → main promotion.
Third and final item in the locked artifact-infrastructure track. Closes
the sequence verify -> export --bundle -> embedded CLI invocation. After
PR-35, the artifact can answer four questions without external bookkeeping:

  - what happened (existing case results + verdict)
  - how it was evaluated (existing materialized spec + invariants)
  - what was exported (existing bundle manifest + bundle_id)
  - what exact command produced it (new cli_invocation)

That's procedural provenance — a real capability increase on top of v0.3.0,
even though implementation size is small.

Semantic boundary (load-bearing):
cli_invocation records *what command produced the artifact*, NOT a
guarantee that re-running it will produce identical outputs. Replay
determinism still lives in materialized_hash (preserved perturbation
evidence) and the bundle's bundle_id (preserved manifest + file hashes).
The bundle's auto-generated README carries an explicit disclaimer.

Capture contract (deliberately narrow):
  - argv: normalized invocation tokens. argv[0] canonicalized to
    'falsifyai' regardless of entry path (entry-point launcher,
    python -m, direct script). Subsequent tokens preserved verbatim.
  - falsifyai_version: runtime package version at capture time.

Explicitly NOT captured by design:
  - environment variables (secret-leakage surface)
  - API keys (no auth-bearing CLI flag today; future flags MUST redact)
  - current working directory (specs are referenced by path in argv)
  - hostname / username / machine identifiers (operator identity belongs
    at the commit / export layer, not the artifact)
  - shell history / pre-shell-expansion argv (unavailable by construction)
  - file contents (spec YAML lives in MaterializedSpec already)

Capture point:
  - cmd_run only (single capture point at function entry, before any
    subcommand-internal mutation)
  - Read-only consumer surfaces (replay, inspect, history, diff, verify,
    export) NEVER stamp invocation (preservation discipline)

Backward compatibility:
  - Default value is None (explicit absence, not a sentinel)
  - Deserializer treats missing JSON key as None
  - Pre-PR-35 artifacts load cleanly; verify does NOT gain a 9th check

Bundle wire-up:
  - PR-32 already added a conditional render path; PR-35 makes it light up
  - Rendering improved: shlex.join argv + falsifyai_version + the
    determinism-vs-provenance disclaimer

Architectural assertion (Tier-4 discipline):
  - test_cli_invocation_model_does_not_import_resolver: importing
    CliInvocation alone never loads verdict.resolver
  - test_replay_models_module_does_not_import_resolver: broader assertion
    that the preservation layer stays independent of the interpretation
    layer (future field additions can't accidentally cross the boundary)

Folded co-edits (per plan §G):
  - README compliance-routing callout above 'Status and roadmap' pointing
    procurement readers at docs/COMPLIANCE.md (candidate #5 fold)
  - One-line note in 'What's in the evidence?' section about the new field
  - Architectural-test-pattern-as-template observation lives in the
    walkthrough rather than a separate doc file (candidate #6 fold)

Docs:
  - CHANGELOG [Unreleased]: PR-35 entry + capture contract summary
  - docs/EVIDENCE.md §6: mark artifact-infrastructure track 3-of-3 shipped;
    note signing is the remaining locked deferral
  - docs/COMPLIANCE.md §2 Annex IV mapping: new 'Command that produced
    this evidence' row mapping to cli_invocation.argv +
    cli_invocation.falsifyai_version

Tests: +24 unit (model+capture+architectural) + 5 integration = 29 new.
Suite: 519 -> 541 passing. Ruff + format clean.

Out of scope (explicit non-goals):
  - Cryptographic signing (signature_slots remain reserved)
  - Bundle import / round-trip
  - --json output mode
  - Verify check for invocation presence (would break pre-PR-35 artifacts)
  - Capture from non-cmd_run commands
  - Capture from a hypothetical programmatic falsifyai.runtime.run() API

Version: this is a MINOR bump (new feature, new public dataclass, new
persisted schema field). Release-prep PR after merge should target v0.4.0
('Artifact-infrastructure track complete'), not v0.3.1.
Three stale references found by audit after the artifact-infrastructure
track closed with PR-35:

- README.md 'Coming next' section — was 'persisted CLI-invocation field
  (next)'; the track is now complete. Replaced with 'track complete'
  framing + the 'driven by external pressure' next-pull dynamic
  (regulatory signing, second case study, falsifyai import, etc.).

- docs/case-studies/02 opening — claimed specs were 'planned follow-up
  work', but PR-30 actually shipped them. Updated to link the existing
  specs/ directory.

- docs/case-studies/README.md index row for CS-02 — said 'CLI formalization
  forthcoming'; specs already exist. Updated tools-used label.

Release-state metadata (status banner, current-release entry, RELEASE.md
illustrative examples) intentionally NOT changed here — those bump in
the v0.4.0 release-prep cycle, not on dev between releases.
Artifact-infrastructure track complete (3 of 3 locked items shipped).
v0.4.0 adds persisted cli_invocation on ReplayArtifact via PR-35.

Version bumps (sources of truth, all aligned):
  pyproject.toml             0.3.0 -> 0.4.0
  falsifyai/__init__.py      0.3.0 -> 0.4.0
  tests/unit/test_version.py rename + bump asserted value
  CITATION.cff               version 0.3.0 -> 0.4.0
  uv.lock                    editable-install entry 0.3.0 -> 0.4.0
                             (synced in same commit, not as a trailing fix)

Release-state metadata (drift-prone files per RELEASE.md §7):
  README.md status banner          - new wave: artifact-infrastructure
                                     track complete; cli_invocation +
                                     four-question artifact summary
  README.md "Status and roadmap"   - 0.4.0 (current release) section
                                     promoted; 0.3.0 demoted to compact
                                     historical entry
  README.md "Coming next"          - cleaned up the leftover diff-sharpening
                                     "Shipped in v0.3.0" bullet; reduced to
                                     a single "driven by external pressure"
                                     paragraph
  CHANGELOG.md                     - [Unreleased] placeholder preserved on
                                     top; PR-35 entries promoted into new
                                     [0.4.0] - 2026-05-24 section;
                                     [0.4.0] link reference added
  docs/RELEASE.md                  - MINOR example (0.2.0->0.3.0 ->
                                     0.3.0->0.4.0); tag commands v0.4.0;
                                     GitHub Release example v0.4.0 +
                                     thematic name; post-release dev-marker
                                     example 0.4.0->0.5.0
  docs/case-studies/01-invisible-character-substitution.md
                                   - pip install command bumped to 0.4.0
                                     (2 occurrences)

Bonus audit fixes - stale 'vNEXT' placeholders left over from v0.3.0
release-prep (caught by the comprehensive grep this cycle):
  docs/COMPLIANCE.md:138, 165 - 'shipped in vNEXT' -> 'shipped in v0.3.0'
                                (these always meant v0.3.0; vNEXT was the
                                pre-release placeholder I forgot to bump)
  docs/COMPLIANCE.md:12       - 'As of v0.3.0...' -> 'As of v0.4.0...';
                                also names the new procedural-provenance
                                capability that v0.4.0 adds
  docs/COMPLIANCE.md:56       - cli_invocation row: 'since v0.3.0+PR-35' ->
                                'since v0.4.0' (since the PR ships in 0.4.0,
                                not 0.3.0)
  docs/EVIDENCE.md:328        - '3-of-3 shipped as of v0.3.0 + PR-35' ->
                                'as of v0.4.0' (cleaner once PR-35 has a
                                release version of its own)

Verified: 510 + 31 = 541 tests pass. Ruff + format clean.

Preserved as intentional historical references (NOT changed):
  CHANGELOG.md [0.3.0] entry body                  - historical, immutable
  CHANGELOG.md [0.3.0] link reference              - always present
  README byte-identical-to-v0.2.0 fixture claim    - load-bearing fixture
  README.md "0.3.0 - Artifact-infrastructure track
    (2 of 3)" roadmap entry                        - historical
  tests/integration/test_diff_end_to_end.py
    v0.2.0_baseline.txt fixture refs               - load-bearing fixture
  tests/unit/test_bundle_writer.py
    _FIXED_VERSION = "0.3.0.test"                  - test sentinel
  tests/unit/test_cli_invocation_model.py
    falsifyai_version="0.3.0" test-data values     - test fixtures, not
                                                     release-state metadata
  .github/workflows/publish.yml GITHUB_REF
    format comment                                 - illustrative

What v0.4.0 ships (since v0.3.0):
  PR-35 - feat(replay): persist cli_invocation on ReplayArtifact
  Post-PR-35 docs audit - update stale 'next' / 'forthcoming' framing
…topics

GitHub repo description and topics were updated post-v0.4.0 to reflect
the artifact-infrastructure positioning. Keyword fields in pyproject.toml
and CITATION.cff were left behind; this commit aligns them so PyPI search
and the CITATION record surface the same terms as the repo.

Removed: 'robustness' (pyproject), 'adversarial-robustness' (CITATION) —
perturbation families are deliberately non-adversarial; the term was
discovery-misleading.

Added: ai-evaluation, llm-evaluation, ai-safety, reliability-testing,
evaluation-validity, replayable-evidence, evidence-infrastructure,
provenance, model-migration, eu-ai-act (pyproject); LLM-evaluation,
model-migration, regression-testing, differential-testing,
evaluation-validity, provenance, EU-AI-Act (CITATION).

GitHub topic edits applied via gh repo edit (already live).
Three things ship together:

1. Bundled ReplayStore at docs/case-studies/data/case-study-02.db
   (96 KB, SHA256 eba7d89db5f961951bba712ae2ba473e1d74452e515959fdcc90b7964c2f3f7b).
   Two sessions produced by running the v1 and v2 specs against Claude
   Sonnet 4.6 via the Anthropic provider on 2026-05-24. Both sessions
   verify cleanly (16/16 integrity checks); diff reports 1 unchanged
   exit 0. The bundled run confirms the central CS-02 claim
   (boundary-allocation effect) at the verdict layer.

2. Extended data/README.md with a CS-02 bundle provenance section
   following the same shape as the existing CS-01 entry. Documents
   the honest divergence from the spec README's pre-run prediction:
   predicted STABLE, actual FRAGILE — semantic_equivalence cosine
   scores 0.69-0.76 on typo-noised variants fall below the 0.80
   default threshold. The 0.80 threshold is well-tuned for short
   factual responses but too strict for long structured design
   responses. Recording the divergence per CASE-STUDY-METHODOLOGY §3
   ("a boring result from a mundane real revision is more
   philosophically valuable than a dramatic result from a constructed
   one"; a wrong-but-honest prediction is stronger evidence than a
   quietly-re-tuned one).

3. Narrative restructure of 02-resolver-arbitration-boundary-shift.md.
   The intellectually-strong-but-too-careful prior version buried the
   finding under chronology, methodology justification, and
   reconstruction disclaimers. Restructured per user critique:

     - Opens with the finding (boundary-allocation effect), not
       chronology. 4-line lead.
     - Critical delta TABLE added before any responses, surfacing
       the V1-vs-V2 architectural-layer movement explicitly so a
       reader sees the asymmetry before reading the full evidence.
     - "Pass/fail evaluators miss this; preserved inspectable
       evidence does not" moved near the top as the emotional center.
     - "Key observation" section moved ABOVE the Evidence section,
       telling the reader what to look for before the long quoted
       outputs.
     - Section renamed Responses -> Evidence (aligns with project
       identity: preserved comparative material, not chat logs).
     - Bold "Boundary-allocation marker" callouts inline after each
       full response to surface the in-resolver vs out-of-resolver
       distinction without making the reader hunt.
     - "What this does NOT support" rewritten as "Scope" — more
       confident, less defensive, same restraint.
     - Phenomenon named ONCE as "boundary-allocation effect" and
       reused consistently (was drifting across "boundary shift",
       "architectural pressure", "interpretive structure",
       "consumer surface").
     - Ending tightened to land on the boundary-allocation takeaway,
       reproduction details compressed.
     - Chronology + archaeological-retrieval justification + restraint
       discipline moved to a Methodology appendix at the end.

The substance is preserved end-to-end; only the structure, ordering,
and naming consistency changed. The bundled run's empirical findings
are integrated honestly throughout.

Suite: 541 passing. Ruff clean.
Rewrites the README around the soul sentence "AI evals are anecdotes
unless the evidence survives the run." Reorganizes so a reader sees the
tool in action before any philosophy.

Structural changes:
- Hero radically compressed; tagline + short code block + emotional callout
- 5-minute proof moved up; workflow sequence Run → Run → Diff → Inspect
  → THEN show the YAML spec
- New lifecycle ASCII diagram between proof and core idea
- "The core idea" compressed to 3-bullet pattern
- New "Typical uses" section with 5 concrete use cases
- "What kind of tool is this?" (SBOM / SARIF / OpenTelemetry table) moved
  from near-top to after case studies — peer comparison is for readers
  who already understand the utility
- Architecture compressed to plain ASCII (Mermaid removed) + short prose
- "What FalsifyAI is not" preserved but tighter
- Status banner, compliance callout, "Status and roadmap", and license
  preserved verbatim

Rationale: prior README opened with worldview and ontology; readers had
to read three sections before seeing what the tool actually does. New
order: utility → mechanism → applications → comparison → philosophy.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Restructures around four sections that progressively reveal the
project: operational → mechanism → architecture → institutional. The
reader gradually discovers (1) this solves a real problem, (2) the
workflow is compelling, (3) the artifact is the key abstraction, (4)
this is actually evidence infrastructure.

Section 1 — Operational wedge
- Hero (radically shorter; tagline + 3-line code block + one callout)
- "Why this matters" — emotional problem statement (NEW); a missed
  customer-flagged regression and how stochastic drift discards the
  evidence pass/fail evaluators need a week later
- "Typical uses" — 5 concrete applications, no philosophy
- "The 5-minute proof" — OUTPUT FIRST. Every snippet captured
  verbatim from the bundled case-study replay store (real session
  ids, real verdicts, real perturbation strings). Sequence: baseline
  run output → candidate run output → diff (exit 5) → inspect (with
  the actual U+202F-bearing model outputs preserved) → THEN the spec.
  The "preserved" list lives at the tail of the proof rather than as
  a separate section (compressed per pushback).

Section 2 — Working mental model
- "Core concepts" — 3-bullet pattern compressed
- "Case studies" — moved UP; now the strongest evidence section
- "CLI reference" with CI integration sub-section
- "What FalsifyAI is not" — moved up to scope-stabilize before the
  worldview section (reader has just seen artifacts; risk of
  projecting "observability platform" / "governance suite")

Section 3 — Architectural worldview (delayed until reader wants it)
- "What kind of tool is this?" (SBOM/SARIF/OpenTelemetry table) —
  category crystallization, only after operational flow lands
- "Architecture" compressed ~50%; ASCII diagram + 2 paragraphs;
  depth delegated to docs/ARCHITECTURE.md (already exists)
- "Resolver predictability" — preserved as differentiator, tighter

Section 4 — Institutional / long-term
- Compliance callout
- Status and roadmap (preserved verbatim)
- Further reading (NEW)
- Local development + License

Removed sections (depth lives in dedicated docs):
- "The lifecycle" ASCII (superseded by Section 1 flow + ARCHITECTURE.md
  data flow)
- "What's in the evidence" H2 (compressed into proof tail; full
  protocol semantics in EVIDENCE.md)
- "Examples" table (case studies serve this role better; examples/
  directory remains discoverable)
- "Writing your own spec" minimal example (5-min proof already shows
  a YAML spec; minimal spec remains in tests/fixtures/)

Methodological discipline preserved: NO synthesized terminal output.
Every code block in the proof is captured verbatim from running
falsifyai against docs/case-studies/data/case-study-replays.db. The
U+202F invisible-character evidence is the actual model output, not a
representative example. The README itself now embodies the project's
preservation discipline.

Preserved verbatim: status banner, "Coming next" roadmap framing,
compliance callout, version metadata, link references, license.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@ericckzhou
Copy link
Copy Markdown
Owner Author

Superseded by clean release branch PR (avoids ghost-history conflict from v0.4.0 squash merge). Dev work fully preserved on PRs #38, #39, #40.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant