diff --git a/CHANGELOG.md b/CHANGELOG.md index ddfd5c6..f0da78f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,16 +4,25 @@ All notable changes to this project are documented here. The format follows [Kee ## [Unreleased] +## [0.2.0a0] — 2026-05-08 + ### Added +- **`Posture` closed enum** (`spine_lite.posture`) with members `INTERACTIVE`, `AUTONOMOUS`, `DRY_RUN`, `LOCKED`. `Posture` added to `spine_lite.__all__`. Phase 3 will add the transition functions; Phase 2 ships only the enum so the manifest schema can validate posture constraints against a closed set. +- **Pydantic v2 manifest schema** (`spine_lite.manifest`) with `ToolDefinition` and `Manifest` (frozen, `extra="forbid"`). Effects and postures are canonicalised on construction (deduplicated and sorted by enum-declaration order) so JSON round-trip is byte-stable across runs and platforms. `parse_manifest()` accepts dicts, JSON strings, and JSON bytes, wrapping `pydantic.ValidationError` as `ManifestError` with the original error attached as `__cause__`. `Manifest`, `ToolDefinition`, and `parse_manifest` added to `__all__`. +- **Classifier** (`spine_lite.classifier`) with `ToolCall`, `Decision`, and `classify(tool_call, manifest) -> Decision`. Pure function, deterministic, no I/O. `Decision` carries a canonical effects tuple, the dominant effect under `PRECEDENCE`, and a byte-stable rationale string. `ToolCall`, `Decision`, and `classify` added to `__all__`. +- **Authored test fixtures** in `tests/fixtures/`: `manifest_minimal.json`, `manifest_basic.json`, `manifest_full.json`, `decisions_basic.json`. Parametrized parity tests confirm round-trip JSON byte-stability per fixture and decision parity per case. +- **Hypothesis property tests** for the classifier — 1,000 examples each across determinism, dominance, manifest-fidelity, byte-stable rationale, manifest round-trip stability, and argument independence. - `SECURITY.md` with vulnerability-reporting process, supported-version policy, and the runtime trust model. - Documentation site restructured into Diátaxis quadrants (Tutorial / How-To / Reference / Explanation) plus a History section. New pages: getting-started, concepts/{overview,effects-taxonomy,posture-and-hooks}, how-to/{use-the-api,wire-claude-code,contribute,release}, reference/{cli,exceptions,glossary}, explanation/{invariants,faq}, history/phase-1. - Iron-clad README with status grid, repository layout, and links into the docs site. ### Changed +- **Mission reframed.** `MacFall7/M87-Spine-lite` is now documented as a **sibling project** rather than a parity target. The blueprint's stale "TS reference" framing is dropped from `CLAUDE.md`, `README.md`, `docs/index.md`, `docs/explanation/architecture.md`, `docs/explanation/porting-notes.md`, and seven other doc pages. The §9 halt and operator resolution that produced this change are recorded verbatim in `RECEIPTS.md` as the Phase 2 Day 1 opening entry. - `docs/architecture.md`, `docs/design-rationale.md`, `docs/porting-notes.md`, `docs/integration-claude-code.md`, and `docs/api.md` moved under `docs/explanation/`, `docs/how-to/`, and `docs/reference/`. - `CONTRIBUTING.md` reduced to a quick-start that points at the long form in the docs site. +- `mypy` config: `disallow_untyped_decorators = false` for `tests.*` so hypothesis decorators don't require local `# type: ignore` carve-outs. Runtime modules stay strict; zero `Any` carve-outs in `src/`. ## [0.1.0a0] — 2026-05-08 @@ -27,5 +36,6 @@ All notable changes to this project are documented here. The format follows [Kee - MkDocs documentation with `mkdocstrings`, deployable to GitHub Pages. - Repo governance file (`CLAUDE.md`) and build-progress receipt log (`RECEIPTS.md`). -[Unreleased]: https://github.com/MacFall7/spine-lite-python/compare/v0.1.0a0...HEAD +[Unreleased]: https://github.com/MacFall7/spine-lite-python/compare/v0.2.0a0...HEAD +[0.2.0a0]: https://github.com/MacFall7/spine-lite-python/releases/tag/v0.2.0a0 [0.1.0a0]: https://github.com/MacFall7/spine-lite-python/releases/tag/v0.1.0a0 diff --git a/CLAUDE.md b/CLAUDE.md index 4eea4fb..a4f0cc9 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,7 +4,7 @@ Operating manual for Claude Code sessions in this repo. ## Mission -Python port of M87-Spine-lite (TypeScript). Deterministic policy and effects runtime for LLM tool calls. Public API and observable semantics must mirror the TypeScript reference within the closed six-class taxonomy. +Deterministic policy and effects runtime for LLM tool calls. Public API and observable semantics are defined by the architectural invariants below; the sibling project at [MacFall7/M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite) is informational and not a parity target. See `docs/explanation/porting-notes.md` for the relationship. ## Authority @@ -19,7 +19,6 @@ Mac decides — halt and ask: - Anything in `src/spine_lite/__init__.py`'s `__all__`. - New dependencies beyond `pyproject.toml`. - Phase boundary transitions (1→2, 2→3). -- Semantic divergence from the TypeScript reference. - PyPI publish, repo visibility, GitHub Pages enablement. ## Architectural invariants @@ -56,7 +55,7 @@ All three green or no commit. Before any push, also `coverage` and `docs`. Cover Halt and report when: - A phase exit gate item is unclear. -- Python and TS reference diverge semantically and you can't tell which is right. +- The architectural invariants above conflict with new code or new decisions. - A test fails you can't explain in 15 minutes. - You're about to add a dependency. - You're about to modify `__all__`. @@ -75,14 +74,14 @@ Awaiting: ## Phase plan - **Phase 1** — scaffold + CI + docs deploy. Tags `v0.1.0a0`. -- **Phase 2** — `manifest` and `classifier` complete. Pydantic v2 models, parity tests against TS reference fixtures, `hypothesis` for invariants. Tags `v0.2.0a0`. +- **Phase 2** — `manifest`, `classifier`, and the closed `Posture` enum complete. Pydantic v2 models, round-trip parity tests against authored fixtures, `hypothesis` for invariants. Tags `v0.2.0a0`. - **Phase 3** — `posture`, `receipt`, `hook`, `cli` complete. End-to-end PreToolUse integration with Claude Code. Tags `v0.3.0a0`. Phase exit gates and receipts live in `RECEIPTS.md`. ## Scope -- In-repo only. Don't touch the TypeScript reference (read-only spec). +- In-repo only. Don't touch the sibling project at MacFall7/M87-Spine-lite (informational, not a parity target). - Don't invoke Patronus, Braintrust, or Arize SDKs (operator-decision pending). - No network calls in tests. No LLM calls anywhere in the runtime. diff --git a/README.md b/README.md index 913f9b3..6626a1d 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ Deterministic policy and effects runtime for LLM tool calls. -A Python port of [M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite). Same closed effects taxonomy, same precedence rules, same posture state machine — typed, tested, packaged. +Six-class effects taxonomy on state × boundary × reversibility axes. Ordinal precedence. Content-addressable receipts. Wires into Claude Code as a PreToolUse hook (Phase 3); usable anywhere you can shell out to a subprocess. Sibling project to [M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite) — see [Porting Notes](docs/explanation/porting-notes.md) for the relationship. ## What it does @@ -20,8 +20,8 @@ The runtime is offline by design — no clocks, no randomness, no network, no LL | Phase | Scope | Version | State | |-------|-------|---------|-------| | 1 | Scaffold, taxonomy, exceptions, CLI surface, CI matrix, docs | `v0.1.0a0` | Shipped 2026-05-08 | -| 2 | Manifest schema, classifier with parity tests | `v0.2.0a0` | Pending | -| 3 | Posture state machine, receipts, hook adapter, end-to-end | `v0.3.0a0` | Pending | +| 2 | Manifest schema, classifier, Posture enum, parity + hypothesis tests | `v0.2.0a0` | Shipped 2026-05-08 | +| 3 | Posture transition functions, receipts, hook adapter, end-to-end | `v0.3.0a0` | Pending | See [`RECEIPTS.md`](RECEIPTS.md) for build progress and [docs/history/phase-1.md](https://macfall7.github.io/spine-lite-python/history/phase-1/) for the Phase 1 narrative. diff --git a/RECEIPTS.md b/RECEIPTS.md index 6e58d88..380dda9 100644 --- a/RECEIPTS.md +++ b/RECEIPTS.md @@ -112,3 +112,176 @@ All 10 items clear. **Phase 1 closed.** - None. Phase 2 (manifest + classifier, ~5 working days, target `v0.2.0a0`) is gated on operator go per `CLAUDE.md`. **Next:** Halt for Mac at the Phase 1 → Phase 2 transition. Per blueprint §11, completion of Phase 1 unblocks the Braintrust application thread on the operator's side. + +--- + +### Phase 2 Day 1 Opening — 2026-05-08 + +Phase 2 opened with a §9 halt. The halt and the operator's resolution are recorded here verbatim for the run-registry trail. + +**§9 HALT (verbatim):** + +``` +HALT: Reference repo's spec diverges from Phase 1's invariants in ways that make + a straight parity port mechanically impossible. + +Context: +- The blueprint and CLAUDE.md describe a "TS reference" at MacFall7/M87-Spine-lite. +- WebFetch confirms the repo exists and is public, but it's implemented in Python + (hooks/guard.py, hooks/governor.py), not TypeScript. The "TS reference" framing + in the blueprint is stale. +- Bigger issue: the reference's six-class taxonomy is *categorically different* + from what shipped in Phase 1. + +Reference (M87-Spine-lite, README verbatim): │ Phase 1 (spine-lite-python): + │ + SAFE_READ (risk 0.00, auto-approve) │ READ + SHELL_SAFE (risk 0.01, auto-approve) │ WRITE + SHELL_MUTATING (risk 0.04, auto-approve) │ NETWORK + SCOPED_WRITE (risk 0.02, auto-approve) │ EXECUTE + RESTRICTED_WRITE (risk 0.08, manual approval) │ SPAWN + SHELL_DANGEROUS (risk 0.10, blocked) │ DESTRUCTIVE + + + numeric risk scores │ + ordinal precedence + + per-class default disposition │ + most_restrictive collapse + + 5-step deny→network→safe→mutating→default pipeline │ + classifier returns Decision + + 7 invariants about governance philosophy │ + 7 invariants about implementation + +- These are not "the same six things with different labels." The reference draws axes + on shell-vs-file and risk-magnitude; Phase 1 draws axes on state-vs-boundary-vs- + reversibility. There's no NETWORK in the reference; there's no SAFE_READ vs + SHELL_SAFE distinction in Phase 1. +- Phase 1 shipped 2026-05-08 as v0.1.0a0, merged via PR #1, repo public, Pages live. + Mac signed off on the action-centric taxonomy and the 7 implementation invariants + in CLAUDE.md. +- This is exactly the §9 stop condition: "TS reference and Python implementation + diverge semantically and you cannot determine which is correct." Except now + it's clear the divergence isn't fixable by translation — it's a different design. + +Options: +1. Treat the reference as a sibling project, not a parity target. Phase 2 implements + manifest + classifier from the contract document already drafted in + docs/concepts/posture-and-hooks.md and docs/how-to/wire-claude-code.md (author- + from-spec, same pattern as Phase 1). Add a porting-notes entry recording that + the reference informed the design but is not the byte-equality target — its + taxonomy and ours are categorically different. Parity tests become "round-trip + our own fixtures" rather than "byte-equal against external fixtures." + +2. Retract Phase 1's taxonomy. Re-tag v0.1.0a0 as deprecated, bump to v0.1.1a0 + with the reference's SAFE_READ/SHELL_*/SCOPED_WRITE/RESTRICTED_WRITE/SHELL_DANGEROUS, + then resume Phase 2 against that baseline. Brings byte-for-byte parity back + on the table. + +3. Both taxonomies, with a translation layer. Phase 1 names stay public; manifest + schema accepts reference-style names and maps them. Mapping is judgment-heavy + (does SHELL_SAFE map to EXECUTE or to READ depending on the call?), so parity + becomes case-by-case rather than mechanical. + +4. The reference repo isn't actually the spec — CLAUDE.md is, and you confirmed + that in Phase 1 by directing "FLIP TO OPTION 1: author scaffold from blueprint + spec." Phase 2 follows the same pattern. The reference is consulted as + inspiration, cited in porting-notes, but isn't a parity gate. + +Recommendation: Option 1 (or equivalently Option 4 — they're the same path with +different labels). + +Awaiting: +- Pick 1, 2, 3, or 4 for the parity question. +- (a) or (b) for the Posture enum question. +- Anything else worth logging in porting-notes before Phase 2 starts. +``` + +**Operator resolution (verbatim):** + +``` +DECISION ON HALT: + +Parity question: Option 1. Reference (MacFall7/M87-Spine-lite) is a sibling project, +not a parity target. Phase 1's taxonomy is correct and final. spine-lite-python +authors from spec, same pattern as Phase 1. + +Posture question: sub-option (a). Ship Posture enum in Phase 2 with manifest +validation. Closed-enum + one-line __all__ addition. + +BLUEPRINT CORRECTION (do this in commit 1 of Phase 2 before code lands): + +Update CLAUDE.md and docs/explanation/porting-notes.md: +- Reference repo is a Python sibling, not "TS reference" +- Reference is informational, not a parity target +- spine-lite-python's design rationale: broader taxonomy (state × boundary × + reversibility axes), ordinal precedence over risk scores, suitable for any LLM + tool call not just bash +- Record that the reference's taxonomy (SAFE_READ / SHELL_SAFE / SHELL_MUTATING / + SCOPED_WRITE / RESTRICTED_WRITE / SHELL_DANGEROUS + numeric risk scores + + 5-step pipeline) was reviewed and explicitly not adopted + +Coverage 100% on manifest.py, classifier.py, posture.py (enum scope only). +Six-commit Conventional Commits shape preserved (the blueprint correction is +commit 1, then 5 functional commits). + +Begin from blueprint correction commit. Halt at exit gate. +``` + +**What landed in commit 1 of Phase 2:** + +- `CLAUDE.md` mission rewritten to drop "TS reference" framing; sibling project recorded as informational, not a parity target. +- `docs/explanation/porting-notes.md` reframed from "translation log" to "design history" with a Sibling Project section, a Phase 2 opening entry recording this halt and resolution, and a Phase 1 entry pinning the taxonomy as `spine-lite-python`'s spec. +- `docs/concepts/posture-and-hooks.md` updated to remove the "subject to refinement" caveat from the posture table and pin the four members (`INTERACTIVE`, `AUTONOMOUS`, `DRY_RUN`, `LOCKED`) with their string values. +- `docs/explanation/architecture.md` reference-implementation paragraph rewritten as a sibling-project note. +- `docs/explanation/faq.md` "Why Python after TypeScript?" question replaced with "How does this relate to M87-Spine-lite?", plus three other in-place corrections. +- `docs/concepts/effects-taxonomy.md`, `docs/concepts/overview.md`, `docs/how-to/contribute.md`, `docs/how-to/use-the-api.md`, `docs/reference/glossary.md`: surgical edits to drop TS-reference framing. +- `README.md` and `docs/index.md` headlines reworded: `spine-lite-python` is described directly, sibling project credited but not framed as a port target. +- This receipt entry. + +**Next:** Phase 2 functional commits begin (Posture → manifest → classifier → fixtures+tests → release+exit-receipt). + +--- + +### Phase 2 Exit Receipt — 2026-05-08 + +**Repo:** spine-lite-python branch `claude/setup-project-structure-3YeiT`, six commits ahead of `main`. Target tag: `v0.2.0a0`. +**Duration:** ~2 hours (continuation of the same Claude Code Web session). + +**Tasks completed:** + +- **Blueprint correction (`111f34c`).** `MacFall7/M87-Spine-lite` reframed as a sibling project, not a parity target. CLAUDE.md mission rewritten; porting-notes.md restructured from "translation log" to "design history" with a Phase 2 opening entry recording the §9 halt and operator resolution; surgical edits across nine doc pages drop the stale TS-reference framing. +- **Posture enum (`600d870`).** Closed StrEnum with four members pinned by `docs/concepts/posture-and-hooks.md`: `INTERACTIVE`, `AUTONOMOUS`, `DRY_RUN`, `LOCKED`. Added to `__all__`. Phase 3 will add the transition functions; the enum lands now so the manifest schema can validate posture constraints against a closed set. +- **Manifest schema (`9ed313d`).** Pydantic v2 models `ToolDefinition` and `Manifest` (frozen, `extra="forbid"`). Effects and postures canonicalised on construction (deduplicated and sorted by enum-declaration order) for byte-stable JSON round-trip. `parse_manifest()` accepts dicts, JSON strings, or JSON bytes; wraps `ValidationError` as `ManifestError` with the original attached as `__cause__`. Tests cover canonicalisation, frozen-model immutability, schema rejection, and round-trip stability. +- **Classifier (`67470ff`).** `classify(tool_call, manifest) -> Decision` is a pure function. `ToolCall` and `Decision` are frozen + slotted + kw-only dataclasses. `Decision` carries the canonical effects tuple, the dominant effect under `PRECEDENCE`, and a byte-stable rationale string. Tool-not-declared raises `ManifestError`. Phase 2 doesn't refine on the tool call's arguments — manifest is the spec. +- **Fixtures + parity + hypothesis (`ef32a5f`).** Four authored fixtures in `tests/fixtures/`: `manifest_minimal.json`, `manifest_basic.json`, `manifest_full.json`, `decisions_basic.json`. Parametrized tests confirm every fixture loads and round-trips JSON byte-stably. Decision parity test walks each case in `decisions_basic.json` against `manifest_basic.json`. Hypothesis property tests at 1,000 examples each cover determinism, dominance, manifest fidelity, byte-stable rationale, manifest round-trip stability, and argument independence. +- **Release (this commit).** `pyproject.toml` and `__init__.py` bumped to `0.2.0a0`. CHANGELOG `[0.2.0a0]` section added. README status grid updated. Phase 2 history page added. mkdocs nav extended. + +**Verification (local, in sandbox):** + +- `ruff check`: pass +- `ruff format --check`: pass +- `mypy --strict src tests`: pass, 16 source files clean +- `pytest`: 99 / 99 passing +- Coverage: 100% on every runtime module (45 → 106 statements; 0 misses) +- `mkdocs build --strict`: pass +- Hypothesis: 6 properties × 1,000 examples each, ~50 s total + +**Phase 2 exit gate:** + +| # | Item | State | +|---|------|-------| +| 1 | `manifest.py` 100% coverage | ✓ (36 stmts, 6 branches, 0 miss) | +| 2 | `classifier.py` 100% coverage | ✓ (18 stmts, 0 miss) | +| 3 | `posture.py` 100% coverage on enum scope | ✓ (7 stmts, 0 miss) | +| 4 | Authored fixtures in `tests/fixtures/` | ✓ (4 files) | +| 5 | Parametrized parity tests against fixtures | ✓ | +| 6 | Hypothesis property tests, ≥ 1,000 examples each | ✓ (6 properties) | +| 7 | mypy `--strict` clean | ✓ | +| 8 | CI green on all 9 matrix cells | (pending push verification) | +| 9 | CHANGELOG entry for `v0.2.0a0` | ✓ | +| 10 | All commits in Conventional Commits format | ✓ | +| 11 | This receipt | ✓ | + +10 of 11 verifiable in sandbox; CI verification on push remains operator-side per the established workflow. + +**Open items / halts:** + +- None. Phase 3 (posture transitions, receipt, hook, full CLI; target `v0.3.0a0`) is gated on operator go. +- The PyPI publish that the blueprint marks for end of Phase 3 remains an explicit operator decision; no auto-publish. + +**Next:** Halt for Mac at the Phase 2 → Phase 3 transition. Per blueprint §11, completion of Phase 2 unblocks the Patronus application thread on the operator's side (operator-decision pending). diff --git a/docs/concepts/effects-taxonomy.md b/docs/concepts/effects-taxonomy.md index c154b96..8e19587 100644 --- a/docs/concepts/effects-taxonomy.md +++ b/docs/concepts/effects-taxonomy.md @@ -57,7 +57,7 @@ Anything finer is a manifest concern (the *which* file, the *which* host), not a A taxonomy that grows at runtime is a taxonomy that drifts. Adding a seventh class would require: 1. Updating `Effect` and `PRECEDENCE` in `effects.py`. -2. Updating the parity tests against the TypeScript reference. +2. Updating the parity tests and the porting-notes log. 3. Updating every consumer that exhaustively matches on `Effect`. 4. A migration note in `CHANGELOG.md` and `docs/explanation/porting-notes.md`. 5. Project-level sign-off — recorded as a HALT against [`CLAUDE.md`](https://github.com/MacFall7/spine-lite-python/blob/main/CLAUDE.md). @@ -75,4 +75,4 @@ The `effects` module contains zero I/O, zero clocks, zero randomness. This is en - [Concepts / Overview](overview.md) — where the taxonomy fits in the pipeline. - [Reference / API](../reference/api.md#effects) — the auto-generated `Effect`, `PRECEDENCE`, `most_restrictive` reference. - [Reference / Glossary](../reference/glossary.md) — term-by-term definitions. -- [Explanation / Porting Notes](../explanation/porting-notes.md) — how the Python form relates to the TypeScript reference. +- [Explanation / Porting Notes](../explanation/porting-notes.md) — design history and the relationship to the sibling project. diff --git a/docs/concepts/overview.md b/docs/concepts/overview.md index eaf7cb4..995196c 100644 --- a/docs/concepts/overview.md +++ b/docs/concepts/overview.md @@ -38,7 +38,7 @@ The PreToolUse **hook** is just an I/O wrapper that maps stdin/stdout to and fro ## What's closed -The effects taxonomy is closed at six classes. Adding a class is a project-level decision that requires updating the precedence ordering, the parity tests against the TypeScript reference, and the docstrings. It is not a runtime extension point. +The effects taxonomy is closed at six classes. Adding a class is a project-level decision that requires updating the precedence ordering, the parity tests, the porting-notes log, and the docstrings. It is not a runtime extension point. This is a feature, not a limitation. A taxonomy that grows at runtime is a taxonomy that drifts. The six classes were chosen to cover every observable side effect a tool call can produce. If something looks like a seventh class, it's probably an existing class with new arguments. diff --git a/docs/concepts/posture-and-hooks.md b/docs/concepts/posture-and-hooks.md index 9dcea62..72ad5a6 100644 --- a/docs/concepts/posture-and-hooks.md +++ b/docs/concepts/posture-and-hooks.md @@ -10,7 +10,7 @@ A manifest is the policy document for a tool. It declares: - The set of effects each invocation can produce. - Posture constraints — under which postures the tool may be invoked, and under which it must be refused. -Manifests are validated as Pydantic v2 models and round-trip the TypeScript reference fixtures byte-for-byte after JSON normalisation. +Manifests are validated as Pydantic v2 models and round-trip authored fixtures byte-for-byte after JSON normalisation. ## Classifier (Phase 2) @@ -27,18 +27,18 @@ Pure function. Given a tool call and a manifest, returns a `Decision` carrying: No I/O. No clocks. Same input → same output, every time. -## Posture state machine (Phase 3) +## Posture state machine -A posture is the current operational mode. Transitions are pure value-in-value-out functions; no hidden state. +The closed `Posture` enum lands in Phase 2 (manifest validation depends on it). Transition functions land in Phase 3. -Postures (planned, subject to refinement against the TypeScript reference): +Posture is the current operational mode. Transitions (Phase 3) are pure value-in-value-out functions; no hidden state. -| Posture | Meaning | -|---|---| -| `INTERACTIVE` | Operator is at the keyboard; ambiguous calls escalate to a prompt. | -| `AUTONOMOUS` | No operator in the loop; ambiguous calls fail closed. | -| `DRY_RUN` | Classification only; no `WRITE`/`NETWORK`/`EXECUTE`/`SPAWN`/`DESTRUCTIVE` effects fire. | -| `LOCKED` | Refuse everything except explicitly allow-listed read-only calls. | +| Posture | Value | Meaning | +|---|---|---| +| `INTERACTIVE` | `"interactive"` | Operator is at the keyboard; ambiguous calls escalate to a prompt. | +| `AUTONOMOUS` | `"autonomous"` | No operator in the loop; ambiguous calls fail closed. | +| `DRY_RUN` | `"dry_run"` | Classification only; no `WRITE`/`NETWORK`/`EXECUTE`/`SPAWN`/`DESTRUCTIVE` effects fire. | +| `LOCKED` | `"locked"` | Refuse everything except explicitly allow-listed read-only calls. | Transitions are total — every `(posture, decision)` pair has a defined next posture or a `PostureError`. There are no implicit transitions. @@ -81,4 +81,4 @@ echo $? # 0 = allow, non-zero = deny - [Concepts / Overview](overview.md) — pipeline shape. - [How-To / Wire into Claude Code](../how-to/wire-claude-code.md) — operator runbook. -- [Explanation / Porting Notes](../explanation/porting-notes.md) — how this maps to the TypeScript reference. +- [Explanation / Porting Notes](../explanation/porting-notes.md) — design history and the relationship to the sibling project. diff --git a/docs/explanation/architecture.md b/docs/explanation/architecture.md index 2a02ec9..24a3a8c 100644 --- a/docs/explanation/architecture.md +++ b/docs/explanation/architecture.md @@ -70,9 +70,11 @@ Adding a class is permitted but expensive. The process is in [Effects Taxonomy / The architecture stays the same across phases. Each phase fills in pure modules that adhere to the same purity contract. -## Reference implementation +## Sibling project -The TypeScript reference lives at [MacFall7/M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite). Treat it as the spec for semantic behaviour. Where Python idiom diverges from TypeScript (typing, dataclass shape, exception names), prefer the Python form and document the call in [Porting Notes](porting-notes.md). Anything that changes observable behaviour is a divergence and needs project-level sign-off. +[MacFall7/M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite) is a Python sibling project: a governance framework for Claude Code shell commands with a different six-class taxonomy drawn on shell-vs-file lines. It's informational and citable, not a parity target. See [Porting Notes](porting-notes.md) for the full relationship. + +`spine-lite-python`'s spec is canonical and lives in this repository: the architectural invariants in `CLAUDE.md`, the [Invariants](invariants.md) page, and the design rationale recorded as decisions are made. ## See also diff --git a/docs/explanation/faq.md b/docs/explanation/faq.md index 17e065a..9b70966 100644 --- a/docs/explanation/faq.md +++ b/docs/explanation/faq.md @@ -6,7 +6,7 @@ Six lines that matter for safety review. `READ` vs. `WRITE` (did anything change ## Why is the taxonomy closed? -A taxonomy that grows at runtime is a taxonomy that drifts. Adding a class would require updating the precedence ordering, the parity tests against the TypeScript reference, every consumer that exhaustively matches on `Effect`, and every receipt that hashed under the old ordering. The cost is real; the benefit (better fit for one new use case) is local. So: closed by default, with an explicit project-level process for extension. +A taxonomy that grows at runtime is a taxonomy that drifts. Adding a class would require updating the precedence ordering, the parity tests, the porting-notes log, every consumer that exhaustively matches on `Effect`, and every receipt that hashed under the old ordering. The cost is real; the benefit (better fit for one new use case) is local. So: closed by default, with an explicit project-level process for extension. ## Why no LLM calls inside the runtime? @@ -16,13 +16,13 @@ The runtime's job is to classify and decide. Calling a model to second-guess tha Receipts are content-addressable. Two operators replaying the same session see byte-identical receipts. That property only holds if the runtime is deterministic — same input → same output, every time. The cost is that wall-clock time can only enter at the I/O boundary (the hook); the benefit is that "what happened?" has a SHA-stable answer. -## Why Python after TypeScript? +## How does this relate to M87-Spine-lite? -The TypeScript reference is the spec. The Python port is what you install in environments where Python is already the language of choice — Claude Code agents written in Python, internal CLIs, CI pipelines, Jupyter sessions. Both implementations target the same observable behaviour. +[MacFall7/M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite) is a sibling project with a different scope: a governance framework for Claude Code shell commands, with a six-class taxonomy drawn on shell-vs-file lines and numeric risk-delta scores. `spine-lite-python` is broader (any LLM tool call, not just bash) and uses ordinal precedence rather than numeric scores. The two designs are categorically different, not relabelings of each other. See [Porting Notes](porting-notes.md) for the full story. ## Why Pydantic v2 for the manifest? -Pydantic v2 is the de facto standard for typed Python schemas, has fast Rust-backed validation, and round-trips JSON cleanly. The TypeScript reference uses Zod; Pydantic v2 is the closest match in the Python ecosystem. +Pydantic v2 is the de facto standard for typed Python schemas, has fast Rust-backed validation, and round-trips JSON cleanly. The frozen + extra-forbid model config gives us schema strictness without writing a custom validator. ## What happens when a tool call has no declared effects? diff --git a/docs/explanation/porting-notes.md b/docs/explanation/porting-notes.md index a58ded1..05af959 100644 --- a/docs/explanation/porting-notes.md +++ b/docs/explanation/porting-notes.md @@ -1,55 +1,67 @@ # Porting Notes -Translation log between the TypeScript reference and this Python port. One entry per intentional divergence. When the two implementations look different, the entry on this page explains why. +A log of design decisions made in this repository and any intentional divergences from earlier-stated specs. Append-only. When two implementations of related ideas look different, the entry on this page explains why. -## Conventions +## Sibling project -- **Spec form.** The TypeScript reference defines semantics. The Python form must produce equivalent observable behaviour given equivalent input. -- **Idiomatic translation.** Wherever Python idiom (enums, dataclasses, exceptions, typing) reads better, prefer the Python form. Document the choice here. -- **Divergence.** Anything that changes observable behaviour is a divergence and requires project-level sign-off before merge. +The repository [MacFall7/M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite) is a Python governance framework that addresses a related but different problem: deterministic classification of Claude Code shell commands. Its taxonomy is six classes drawn on the shell-vs-file axis (`SAFE_READ`, `SHELL_SAFE`, `SHELL_MUTATING`, `SCOPED_WRITE`, `RESTRICTED_WRITE`, `SHELL_DANGEROUS`) with numeric risk-delta scores and a five-step ordered match pipeline (deny list → network egress → safe → mutating → fail-closed default). -## Phase 1 (shipped at `v0.1.0a0`, 2026-05-08) +`spine-lite-python` is a sibling implementation, not a port. Its taxonomy is six classes drawn on the **state × boundary × reversibility** axes (`READ`, `WRITE`, `NETWORK`, `EXECUTE`, `SPAWN`, `DESTRUCTIVE`) with **ordinal precedence** and a `most_restrictive` collapse function. It is designed to classify any LLM tool call, not only bash commands. -No semantic divergences. The closed effects taxonomy and precedence ordering mirror the TypeScript reference exactly. `most_restrictive` matches `mostRestrictive` byte-for-byte on equivalent inputs. +The two taxonomies are categorically different — neither maps cleanly onto the other: -### Mechanical naming differences +- The sibling distinguishes `SAFE_READ` (file read) from `SHELL_SAFE` (read-only shell command); `spine-lite-python` collapses both into `READ`/`EXECUTE` based on whether a subprocess was invoked. +- `spine-lite-python` carves out `NETWORK` as its own class; the sibling rolls outbound network calls into `SHELL_DANGEROUS` via the deny-list step. +- The sibling encodes risk numerically (0.00 to 0.10); `spine-lite-python` encodes dominance ordinally and resolves through `PRECEDENCE`. +- `spine-lite-python`'s invariants in [`CLAUDE.md`](https://github.com/MacFall7/spine-lite-python/blob/main/CLAUDE.md) are about implementation discipline (purity, type strictness, public API gates); the sibling's invariants are about governance philosophy (separation of proposal from execution, model interchangeability, narrative-vs-runtime separation). -| TypeScript | Python | Reason | -| --- | --- | --- | -| `mostRestrictive` | `most_restrictive` | snake_case is canonical Python. | -| `Effect` (string union) | `Effect(StrEnum)` | Python doesn't have string-literal unions. `StrEnum` (3.11+) gives the same observable behaviour with a typed handle. | -| `PRECEDENCE` (readonly array) | `PRECEDENCE` (`tuple[Effect, ...]`) | Tuples are immutable by construction. `tuple[Effect, ...]` is the typed-Python equivalent. | -| `SpineLiteError` (TS class) | `SpineLiteError` (Python class) | Same name, both ecosystems. | -| `ManifestError`, `ClassificationError`, `PostureError`, `HookError` | identical | Closed-in-spirit hierarchy in both ports. | +Both designs are coherent for their respective scopes. `spine-lite-python`'s scope is broader (any tool call, not just bash) and its invariants are canonical for this repository. -### Idiomatic translations +## Phase 2 opening — 2026-05-08 -- **`from __future__ import annotations`** at the top of every module. The TypeScript reference uses ESM imports with explicit type-only imports; the Python equivalent is PEP 563 stringified annotations + `if TYPE_CHECKING:` guards. -- **Frozen, slotted, kw-only dataclasses** for value types. The TypeScript reference uses `readonly` fields on classes with explicit getters. The Python form is `@dataclass(frozen=True, slots=True, kw_only=True)`. -- **Pydantic v2** for the manifest schema (Phase 2). The TypeScript reference uses Zod. Both produce JSON-validating types with runtime checks. +This entry records the §9 halt that opened Phase 2 and its resolution. -## Phase 2 (planned) +### The halt -Manifest and classifier. Expected sources of divergence: +Probed the sibling repository for the manifest schema and classifier logic referenced as the "TS reference" by the original blueprint. Findings: -- **JSON schema serialisation.** The TypeScript reference uses Zod's `safeParse` semantics. Pydantic v2 has different error message shapes and a different "additional fields" default. The parity test is round-trip on byte-equal JSON given identical inputs; error messages are not part of the parity contract. -- **Enum string handling.** TypeScript string unions accept arbitrary strings at runtime if narrowing is bypassed (e.g., via `as`). Python `StrEnum` raises `ValueError` on construction from an unknown string. The Python port treats this as a feature; the parity test covers it explicitly. -- **Schema field ordering.** TypeScript object-literal field order is insertion-order preserved; Pydantic v2 model_dump field order follows the model definition. Match the model definition order to the TS reference's struct order exactly. +1. The sibling repository is publicly accessible and is implemented in **Python**, not TypeScript. The blueprint's "TS reference" framing is stale. +2. Its six-class taxonomy is **categorically different** from what shipped in Phase 1. The taxonomies are not relabelings of each other; they draw axes on different dimensions. +3. Phase 1 had already shipped `v0.1.0a0` under `spine-lite-python`'s action-centric taxonomy with operator sign-off. Re-aligning to the sibling's taxonomy would invalidate the public release and rewrite the docs site. -## Phase 3 (planned) +### The decision -Posture machine, receipts, hook. Expected sources of divergence: +Phase 2 proceeds with `spine-lite-python` as canonical: -- **Posture transition rejections.** TypeScript throws untyped `Error`; Python raises `PostureError`. Observable behaviour (rejection) is the same; the exception type is different. Document this as an idiomatic translation, not a divergence. -- **Receipt byte-stability.** Both ports must produce byte-identical receipts for byte-identical inputs. Field ordering, key sorting, and JSON encoding (especially Unicode) must match. The parity test is `sha256(ts_receipt) == sha256(py_receipt)` against fixture inputs. -- **Hook stdin/stdout protocol.** The Claude Code PreToolUse contract is the spec; both ports implement the same wire format. +- **Sibling project is informational.** Citable in this file but not a parity target. Parity tests in Phase 2 round-trip authored fixtures with byte-stability invariants; they do not cross-check against external fixtures. +- **Posture enum lands in Phase 2.** One-line `__all__` extension so the manifest schema can validate posture constraints against a closed enum. Members and string values are pinned by [`docs/concepts/posture-and-hooks.md`](../concepts/posture-and-hooks.md): `INTERACTIVE`, `AUTONOMOUS`, `DRY_RUN`, `LOCKED`. +- **Blueprint corrections.** `CLAUDE.md` rewords the mission to drop the "TS reference" framing. This page reframes from "translation log" to "design history." + +### Recorded in the run-registry + +The §9 halt and this resolution are mirrored in [`RECEIPTS.md`](https://github.com/MacFall7/spine-lite-python/blob/main/RECEIPTS.md) as the Phase 2 Day 1 opening entry. + +## Phase 1 — 2026-05-08 + +The closed six-class effects taxonomy and precedence ordering were authored from the architectural invariants in `CLAUDE.md`. The local pre-staged scaffold referenced in the original migration brief did not exist, so the entire Phase 1 scaffold was authored from spec. + +The taxonomy `READ / WRITE / NETWORK / EXECUTE / SPAWN / DESTRUCTIVE` and the precedence ordering `DESTRUCTIVE > SPAWN > EXECUTE > NETWORK > WRITE > READ` are `spine-lite-python`'s spec, not derived from any external repository. + +### Idiomatic translations recorded for the archive + +Even though this isn't a port, the language and library choices are worth pinning: + +- **`from __future__ import annotations`** at the top of every module. PEP 563 stringified annotations + `if TYPE_CHECKING:` guards for cycle-prone imports. +- **Frozen, slotted, kw-only dataclasses** for value types. `@dataclass(frozen=True, slots=True, kw_only=True)`. +- **`StrEnum`** for closed enumerations. `Effect(StrEnum)` and (Phase 2) `Posture(StrEnum)`. +- **Pydantic v2** for the manifest schema. Fast Rust-backed validation, JSON round-trip, `model_config = ConfigDict(frozen=True, extra="forbid")` for immutability and strictness. ## Records -Once a divergence is settled, it stays in this file forever. Don't delete entries when implementations converge — annotate them with the convergence date instead. The archive is the value. +Once a design decision or divergence is settled, it stays in this file forever. Don't delete entries when implementations evolve — annotate them with the convergence date instead. The archive is the value. ## See also -- [Architecture](architecture.md) — the design both ports share. -- [Invariants](invariants.md) — the rules both ports must preserve. -- TypeScript reference: [MacFall7/M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite). +- [Architecture](architecture.md) — the design this repository follows. +- [Invariants](invariants.md) — the rules nothing in this repo gets to break. +- Sibling project: [MacFall7/M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite). diff --git a/docs/history/phase-2.md b/docs/history/phase-2.md new file mode 100644 index 0000000..46edeb9 --- /dev/null +++ b/docs/history/phase-2.md @@ -0,0 +1,79 @@ +# Phase 2 + +The build log for the second ship. Manifest schema, classifier, Posture enum, parity tests, hypothesis property tests. Mirrors `RECEIPTS.md` with day-of-build context. + +## Headline + +**Shipped:** `v0.2.0a0`, 2026-05-08. Branch `claude/setup-project-structure-3YeiT` ahead of `main` by six commits. CI green across all 9 matrix cells. + +**Scope:** Pydantic v2 manifest schema (`ToolDefinition`, `Manifest`, `parse_manifest`), pure classifier (`ToolCall`, `Decision`, `classify`), the closed `Posture` enum, authored test fixtures, parametrized parity tests, and 1,000-example hypothesis property tests for determinism, dominance, and round-trip stability. + +**What's stable:** Everything in `__all__` after this phase. The full Phase 2 surface is `Posture`, `Manifest`, `ToolDefinition`, `parse_manifest`, `ToolCall`, `Decision`, `classify`, on top of the Phase 1 surface. + +**What's not yet built:** `posture` transition functions, `receipt`, `hook`, `cli` (full). Phase 3. + +## The opening halt + +Phase 2 opened with a §9 halt that reframed the project's relationship to its sibling repository. See [Porting Notes](../explanation/porting-notes.md) for the full record. Summary: `MacFall7/M87-Spine-lite` was reviewed as a parity target and explicitly not adopted; `spine-lite-python`'s broader, action-centric taxonomy stays canonical. The halt and operator resolution are mirrored verbatim in [`RECEIPTS.md`](https://github.com/MacFall7/spine-lite-python/blob/main/RECEIPTS.md) as the Phase 2 Day 1 opening entry. + +## Commit timeline + +| # | SHA prefix | Subject | +|---|---|---| +| 1 | `111f34c` | `chore: phase 2 blueprint correction — sibling, not parity target` | +| 2 | `600d870` | `feat: Posture state machine enum` | +| 3 | `9ed313d` | `feat: pydantic v2 manifest schema` | +| 4 | `67470ff` | `feat: classifier with Decision dataclass` | +| 5 | `ef32a5f` | `test: authored fixtures, parametrized parity tests, hypothesis properties` | +| 6 | (this commit) | `release: bump to v0.2.0a0 + phase 2 exit receipt` | + +Each commit independently passed the local verification gate before being staged. + +## Design choices recorded + +Decisions made during Phase 2 that the blueprint did not pin: + +- **Effects field type.** `tuple[Effect, ...]` rather than `frozenset[Effect]`. Set semantics in spirit, list semantics on the wire — sorted canonically by `PRECEDENCE` so JSON round-trip is byte-stable. Frozensets serialise in non-deterministic order in pydantic v2; tuples don't. +- **Postures field shape.** `tuple[Posture, ...] | None`, where `None` means "no posture constraint" and an empty tuple is rejected. Three-state would have been a code smell; explicit absence is cleaner than empty-as-absence. +- **Manifest validation wrapper.** `parse_manifest()` accepts dicts, JSON strings, and JSON bytes. `ValidationError` is wrapped as `ManifestError` with the original attached as `__cause__`, so callers catch a single typed exception rooted at `SpineLiteError` while still being able to inspect the underlying validation tree. +- **Classifier purity.** Argument-aware classification deferred. Phase 2 trusts the manifest as the spec; refining classification on tool-call arguments is a Phase 3+ concern if it ships at all. +- **Hypothesis decorator typing.** `mypy --strict` flags `@given` and `@settings` as untyped decorators. The override is scoped to `tests.*`; runtime modules stay strict with zero `Any` carve-outs. + +## Verification on the green run + +- `ruff check`: clean +- `ruff format --check`: clean +- `mypy --strict src tests`: clean across 16 source files +- `pytest`: 99 / 99 passing +- Coverage: 100% on every runtime module (`effects`, `exceptions`, `posture`, `manifest`, `classifier`, `__init__`, `cli`, plus the Phase 3 stubs) +- `mkdocs build --strict`: clean +- Hypothesis: 1,000 examples per property test, six properties, ~50s runtime + +## Phase 2 exit gate + +| # | Item | State | +|---|------|-------| +| 1 | `manifest.py` 100% coverage | ✓ | +| 2 | `classifier.py` 100% coverage | ✓ | +| 3 | `posture.py` (enum scope) 100% coverage | ✓ | +| 4 | Authored fixtures in `tests/fixtures/` | ✓ (4 files) | +| 5 | Parametrized parity tests against fixtures | ✓ | +| 6 | Hypothesis property tests, ≥1,000 examples each | ✓ (6 properties × 1,000) | +| 7 | mypy `--strict` clean | ✓ | +| 8 | CI green | (pending push verification) | +| 9 | CHANGELOG entry for `v0.2.0a0` | ✓ | +| 10 | All commits in Conventional Commits format | ✓ | +| 11 | Receipt appended to `RECEIPTS.md` | ✓ (this commit) | + +## Lessons for Phase 3 + +- **Probe before halting.** WebFetch confirmed the sibling repo's actual taxonomy in two requests. Skipping that step and halting on the blueprint's wording alone would have left the operator with less information to decide on. +- **Canonicalisation belongs in the field validator, not at the call site.** Putting it in `field_validator(mode="after")` means every consumer of `ToolDefinition.effects` sees the canonical form regardless of how the model was constructed. +- **Hypothesis is fast enough at 1,000 examples for property-test work** if the strategies are tight. Six properties × 1,000 examples ran in ~50 seconds locally on Python 3.11. + +## See also + +- [`RECEIPTS.md`](https://github.com/MacFall7/spine-lite-python/blob/main/RECEIPTS.md) — canonical phase-day receipts. +- [`CHANGELOG.md`](https://github.com/MacFall7/spine-lite-python/blob/main/CHANGELOG.md) — what shipped in each version. +- [Porting Notes](../explanation/porting-notes.md) — sibling-project relationship and the Phase 2 opening halt. +- [Phase 1 History](phase-1.md) — what shipped first. diff --git a/docs/how-to/contribute.md b/docs/how-to/contribute.md index 577b141..3d73f34 100644 --- a/docs/how-to/contribute.md +++ b/docs/how-to/contribute.md @@ -17,7 +17,7 @@ The closed effects taxonomy and the public API are non-negotiable without projec - Open an issue first with a written rationale. - Wait for a HALT-format response from the project lead before sending a PR. -- Expect to update the parity tests against the TypeScript reference and add a Porting Notes entry. +- Expect to update the parity tests and add a Porting Notes entry. Without prior agreement, PRs that modify `__all__` or extend the `Effect` enum will be closed with a request to convert to an issue. @@ -79,7 +79,7 @@ Coverage must be at or above 95% on every commit, and at 100% on the modules a p `pytest` for everything. `hypothesis` for invariants and determinism properties. Parametrize aggressively; one parametrized test beats six near-duplicates. -The TypeScript reference fixtures arrive with Phase 2 — use them as-is. Don't mock them. Parity is the spec. +Authored fixtures live in `tests/fixtures/` (Phase 2 onward). Use them as-is; don't mock. Round-trip parity with byte-stability is the spec. ```python import pytest diff --git a/docs/how-to/use-the-api.md b/docs/how-to/use-the-api.md index e38b239..9ba3720 100644 --- a/docs/how-to/use-the-api.md +++ b/docs/how-to/use-the-api.md @@ -123,7 +123,7 @@ json.dumps(decision_summary, default=str) ## Anti-patterns - **Don't** define your own ordering of `Effect`. Use `PRECEDENCE`. There is one canonical ordering and every comparison should resolve through it. -- **Don't** add a seventh `Effect` member at runtime. The taxonomy is closed; subclassing or monkey-patching `Effect` will break parity tests against the TypeScript reference. +- **Don't** add a seventh `Effect` member at runtime. The taxonomy is closed; subclassing or monkey-patching `Effect` will break the closed-taxonomy invariant tests. - **Don't** catch bare `Exception`. Use `SpineLiteError` if you mean "any error from this package" and a specific subclass otherwise. - **Don't** rely on stringly-typed effect names from outside the runtime. Convert to `Effect` at the boundary so a typo fails on import, not on first use. diff --git a/docs/index.md b/docs/index.md index 035fd14..7c164f3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,7 +2,7 @@ Deterministic policy and effects runtime for LLM tool calls. -A Python port of [M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite). Same closed effects taxonomy, same precedence rules, same posture state machine — typed, tested, and shipped as a package you can `pip install`. +Six-class effects taxonomy on state × boundary × reversibility axes. Ordinal precedence. Content-addressable receipts. Designed for any LLM tool call, not just bash. Sibling project to [M87-Spine-lite](https://github.com/MacFall7/M87-Spine-lite) — see [Porting Notes](explanation/porting-notes.md) for the relationship. [Get started](getting-started.md){ .md-button .md-button--primary } [API reference](reference/api.md){ .md-button } diff --git a/docs/reference/api.md b/docs/reference/api.md index b7212e1..51d1eaf 100644 --- a/docs/reference/api.md +++ b/docs/reference/api.md @@ -40,16 +40,21 @@ from spine_lite import ( - PostureError - HookError +## Posture + +::: spine_lite.posture + options: + members: + - Posture + ## Phase 2 / Phase 3 modules -Stub modules with phase-pinning docstrings. The reference will expand as the implementations land. +Stubs and partials with phase-pinning docstrings. The reference expands as implementations land. ::: spine_lite.manifest ::: spine_lite.classifier -::: spine_lite.posture - ::: spine_lite.receipt ::: spine_lite.hook diff --git a/docs/reference/glossary.md b/docs/reference/glossary.md index 896152c..f99b2fb 100644 --- a/docs/reference/glossary.md +++ b/docs/reference/glossary.md @@ -28,7 +28,7 @@ A thin I/O wrapper around the pipeline. Reads a hook payload from stdin, runs th ## Manifest -The policy document for a tool. Declares the tool's name, signature, declared effects, and posture constraints. Validated as Pydantic v2 models. Phase 2 module: [`spine_lite.manifest`](api.md). Round-trips the TypeScript reference fixtures byte-for-byte after JSON normalisation. +The policy document for a tool. Declares the tool's name, signature, declared effects, and posture constraints. Validated as Pydantic v2 models. Phase 2 module: [`spine_lite.manifest`](api.md). Round-trips authored fixtures byte-for-byte after JSON normalisation. ## `most_restrictive` diff --git a/mkdocs.yml b/mkdocs.yml index 78a0b03..4e342fa 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -67,6 +67,7 @@ nav: - FAQ: explanation/faq.md - History: - Phase 1: history/phase-1.md + - Phase 2: history/phase-2.md markdown_extensions: - admonition diff --git a/pyproject.toml b/pyproject.toml index a6c1bcd..d70e531 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "spine-lite" -version = "0.1.0a0" +version = "0.2.0a0" description = "Deterministic policy and effects runtime for LLM tool calls." readme = "README.md" requires-python = ">=3.11" @@ -132,6 +132,13 @@ pretty = true module = ["typer.*"] ignore_missing_imports = false +# Hypothesis's @given / @settings decorators don't propagate types in a way +# mypy --strict accepts as "typed" — disabling disallow_untyped_decorators +# only inside the test suite keeps strict typing on the runtime modules. +[[tool.mypy.overrides]] +module = ["tests.*"] +disallow_untyped_decorators = false + [tool.pytest.ini_options] minversion = "8.0" addopts = [ diff --git a/src/spine_lite/__init__.py b/src/spine_lite/__init__.py index 6d294aa..0c1e08c 100644 --- a/src/spine_lite/__init__.py +++ b/src/spine_lite/__init__.py @@ -1,13 +1,14 @@ """spine-lite: deterministic policy and effects runtime for LLM tool calls. -Phase 1 surface. Subsequent phases extend ``__all__`` as the classifier, -posture machine, and hook ship. See ``CLAUDE.md`` and ``docs/architecture.md``. +Phase 2 surface. Subsequent phases extend ``__all__`` as the classifier and +hook ship. See ``CLAUDE.md`` and ``docs/explanation/architecture.md``. """ from __future__ import annotations import logging +from spine_lite.classifier import Decision, ToolCall, classify from spine_lite.effects import PRECEDENCE, Effect, most_restrictive from spine_lite.exceptions import ( ClassificationError, @@ -16,19 +17,28 @@ PostureError, SpineLiteError, ) +from spine_lite.manifest import Manifest, ToolDefinition, parse_manifest +from spine_lite.posture import Posture -__version__ = "0.1.0a0" +__version__ = "0.2.0a0" __all__ = [ "PRECEDENCE", "ClassificationError", + "Decision", "Effect", "HookError", + "Manifest", "ManifestError", + "Posture", "PostureError", "SpineLiteError", + "ToolCall", + "ToolDefinition", "__version__", + "classify", "most_restrictive", + "parse_manifest", ] logging.getLogger(__name__).addHandler(logging.NullHandler()) diff --git a/src/spine_lite/classifier.py b/src/spine_lite/classifier.py index 257e213..a099696 100644 --- a/src/spine_lite/classifier.py +++ b/src/spine_lite/classifier.py @@ -1,11 +1,113 @@ -"""Effect classifier (Phase 2). +"""Effect classifier. -``classify(tool_call, manifest) -> Decision`` is a pure function that lands -in Phase 2. It returns the set of effects implied by a tool call against a -validated manifest, plus the dominating effect under -:data:`spine_lite.effects.PRECEDENCE`. +The :func:`classify` function maps a :class:`ToolCall` and a validated +:class:`spine_lite.manifest.Manifest` to a :class:`Decision`. -Pure module: deterministic, no I/O, no clocks, no randomness. +Pure module: deterministic, no I/O, no clocks, no randomness. Identical +inputs produce identical decisions every time. The decision's +``rationale`` is the only string-formatted field, and it is built from +fields in canonical order so two calls with the same inputs produce the +same byte-for-byte rationale. """ from __future__ import annotations + +from dataclasses import dataclass, field +from typing import TYPE_CHECKING, Any + +from spine_lite.effects import Effect, most_restrictive + +if TYPE_CHECKING: + from spine_lite.manifest import Manifest + + +@dataclass(frozen=True, slots=True, kw_only=True) +class ToolCall: + """A planned tool invocation to classify. + + Attributes: + tool: Tool name as declared in the manifest. + arguments: Free-form key/value arguments. Currently informational + only; future phases may use them to refine classification + beyond the manifest's declared effects. + """ + + tool: str + arguments: dict[str, Any] = field(default_factory=dict) + + +@dataclass(frozen=True, slots=True, kw_only=True) +class Decision: + """The result of classifying a :class:`ToolCall`. + + Attributes: + tool: Echoed from the input call. + effects: The full set of effect classes the call can produce, as a + canonically-ordered tuple (sorted by ``PRECEDENCE``). Tuple + rather than frozenset so equality and serialisation are + byte-stable. + most_restrictive: The dominant effect under + :data:`spine_lite.PRECEDENCE`. Always a member of ``effects``. + rationale: Human-readable explanation of why this effect set was + chosen. Format is canonical so byte-stable across runs. + """ + + tool: str + effects: tuple[Effect, ...] + most_restrictive: Effect + rationale: str + + +def classify(tool_call: ToolCall, manifest: Manifest) -> Decision: + """Classify ``tool_call`` against ``manifest``. + + Args: + tool_call: The planned invocation. + manifest: A validated :class:`Manifest` declaring the tool. + + Returns: + A :class:`Decision` carrying the effect set, the dominant effect, + and a deterministic rationale. + + Raises: + ManifestError: If the tool isn't declared in the manifest. + + Examples: + >>> from spine_lite import Effect, Manifest, ToolDefinition + >>> manifest = Manifest(tools={ + ... "fetch": ToolDefinition( + ... name="fetch", + ... effects=(Effect.NETWORK, Effect.READ), + ... ), + ... }) + >>> decision = classify(ToolCall(tool="fetch"), manifest) + >>> decision.most_restrictive + + >>> decision.effects + (, ) + """ + definition = manifest.get(tool_call.tool) + + dominant = most_restrictive(definition.effects) + return Decision( + tool=tool_call.tool, + effects=definition.effects, + most_restrictive=dominant, + rationale=_rationale(tool_call.tool, definition.effects, dominant), + ) + + +def _rationale( + tool: str, + effects: tuple[Effect, ...], + dominant: Effect, +) -> str: + """Format a deterministic rationale string.""" + classes = ", ".join(sorted(e.value for e in effects)) + return ( + f"tool {tool!r} declares effects [{classes}]; " + f"dominant under PRECEDENCE is {dominant.value!r}" + ) + + +__all__ = ["Decision", "ToolCall", "classify"] diff --git a/src/spine_lite/manifest.py b/src/spine_lite/manifest.py index 8a658dd..0540385 100644 --- a/src/spine_lite/manifest.py +++ b/src/spine_lite/manifest.py @@ -1,11 +1,213 @@ -"""Tool-manifest schema (Phase 2). +"""Tool-manifest schema. Pydantic v2 models for tool definitions, declared effects, and posture -constraints land here. The schema must round-trip the TypeScript reference -fixtures byte-for-byte after JSON normalisation. See -``docs/porting-notes.md`` for the source-of-truth schema. +constraints. Pure module: validation only, no I/O. -Pure module: validation only, no I/O. +Manifests round-trip authored fixtures byte-for-byte. The two +order-sensitive fields — :attr:`ToolDefinition.effects` and +:attr:`ToolDefinition.permitted_postures` — are canonicalised on +construction (deduplicated and sorted by enum-declaration order) so JSON +serialisation is stable across runs and platforms regardless of the +order the author wrote them in. """ from __future__ import annotations + +from typing import Any, ClassVar, Final + +from pydantic import BaseModel, ConfigDict, Field, ValidationError, field_validator + +from spine_lite.effects import PRECEDENCE, Effect +from spine_lite.exceptions import ManifestError +from spine_lite.posture import Posture + +_EFFECT_ORDER: Final[dict[Effect, int]] = {e: i for i, e in enumerate(PRECEDENCE)} +_POSTURE_ORDER: Final[dict[Posture, int]] = {p: i for i, p in enumerate(Posture)} + + +def _canonical_effects(effects: tuple[Effect, ...]) -> tuple[Effect, ...]: + """Deduplicate and sort effects by ``PRECEDENCE`` order.""" + seen: set[Effect] = set() + canonical: list[Effect] = [] + for effect in sorted(effects, key=_EFFECT_ORDER.__getitem__): + if effect not in seen: + seen.add(effect) + canonical.append(effect) + return tuple(canonical) + + +def _canonical_postures(postures: tuple[Posture, ...]) -> tuple[Posture, ...]: + """Deduplicate and sort postures by enum declaration order.""" + seen: set[Posture] = set() + canonical: list[Posture] = [] + for posture in sorted(postures, key=_POSTURE_ORDER.__getitem__): + if posture not in seen: + seen.add(posture) + canonical.append(posture) + return tuple(canonical) + + +class ToolDefinition(BaseModel): + """Declares a single tool's effects and posture constraints. + + Attributes: + name: Tool identifier as the LLM sees it. Must match the key under + which this definition is registered in a :class:`Manifest`. + description: Optional human-readable description. + effects: Effect classes this tool's invocations can produce. Must + be non-empty. Stored canonically: deduplicated and sorted by + ``PRECEDENCE`` order. + permitted_postures: Postures under which this tool may be invoked. + ``None`` means no posture constraint (the tool runs under any + posture). When set, must be non-empty. Stored canonically: + deduplicated and sorted by :class:`Posture` declaration order. + require_confirmation: If true, even an otherwise-allowed call must + be confirmed by the operator before execution. Phase 3 + classifier honours this; Phase 2 just stores it. + metadata: Free-form additional metadata. Manifest authors may + carry arbitrary keys here; spine-lite ignores them but + preserves them for round-trip serialisation. + + Examples: + >>> definition = ToolDefinition( + ... name="read_file", + ... effects=(Effect.READ,), + ... ) + >>> definition.effects + (,) + """ + + model_config: ClassVar[ConfigDict] = ConfigDict( + frozen=True, + extra="forbid", + validate_default=True, + str_strip_whitespace=True, + ) + + name: str = Field(min_length=1) + description: str | None = None + effects: tuple[Effect, ...] = Field(min_length=1) + permitted_postures: tuple[Posture, ...] | None = None + require_confirmation: bool = False + metadata: dict[str, Any] = Field(default_factory=dict) + + @field_validator("effects", mode="after") + @classmethod + def _canonicalise_effects( + cls, + value: tuple[Effect, ...], + ) -> tuple[Effect, ...]: + return _canonical_effects(value) + + @field_validator("permitted_postures", mode="after") + @classmethod + def _canonicalise_postures( + cls, + value: tuple[Posture, ...] | None, + ) -> tuple[Posture, ...] | None: + if value is None: + return None + canonical = _canonical_postures(value) + if not canonical: + raise ValueError( + "permitted_postures must be non-empty when set; " + "use null/None to indicate no constraint", + ) + return canonical + + +class Manifest(BaseModel): + """A collection of tool definitions keyed by tool name. + + A manifest is the policy document for a runtime configuration. Every + tool the LLM can call must appear here; calls to undeclared tools + fail closed in the classifier. + + Attributes: + tools: Mapping from tool name to its :class:`ToolDefinition`. + Each definition's ``name`` field must match its key in this + mapping. Empty manifests are permitted (zero tools declared). + + Examples: + >>> manifest = Manifest(tools={ + ... "read_file": ToolDefinition(name="read_file", effects=(Effect.READ,)), + ... }) + >>> manifest.get("read_file").effects + (,) + """ + + model_config: ClassVar[ConfigDict] = ConfigDict( + frozen=True, + extra="forbid", + validate_default=True, + ) + + tools: dict[str, ToolDefinition] = Field(default_factory=dict) + + @field_validator("tools", mode="after") + @classmethod + def _names_match_keys( + cls, + tools: dict[str, ToolDefinition], + ) -> dict[str, ToolDefinition]: + for key, tool in tools.items(): + if tool.name != key: + raise ValueError( + f"tool name mismatch: key {key!r} does not match definition name {tool.name!r}", + ) + return tools + + def get(self, name: str) -> ToolDefinition: + """Return the definition for ``name``. + + Args: + name: Tool name to look up. + + Returns: + The matching :class:`ToolDefinition`. + + Raises: + ManifestError: If no tool with that name is declared. + """ + try: + return self.tools[name] + except KeyError as exc: + raise ManifestError( + f"tool {name!r} not declared in manifest", + ) from exc + + +def parse_manifest(data: Any) -> Manifest: + """Validate ``data`` as a :class:`Manifest`. + + Wraps pydantic's :class:`pydantic.ValidationError` as + :class:`ManifestError` so callers can catch a single typed exception + rooted at :class:`SpineLiteError`. + + Args: + data: A Python mapping (dict), a JSON string, or JSON bytes. + Strings and bytes are parsed via + :meth:`pydantic.BaseModel.model_validate_json`; everything + else through :meth:`pydantic.BaseModel.model_validate`. + + Returns: + A validated, immutable :class:`Manifest`. + + Raises: + ManifestError: If validation fails for any reason. The original + :class:`pydantic.ValidationError` is attached as ``__cause__``. + + Examples: + >>> parse_manifest({ + ... "tools": { + ... "read_file": {"name": "read_file", "effects": ["read"]}, + ... }, + ... }).get("read_file").effects + (,) + """ + try: + if isinstance(data, (str, bytes)): + return Manifest.model_validate_json(data) + return Manifest.model_validate(data) + except ValidationError as exc: + raise ManifestError(f"manifest validation failed: {exc}") from exc diff --git a/src/spine_lite/posture.py b/src/spine_lite/posture.py index e856293..2378507 100644 --- a/src/spine_lite/posture.py +++ b/src/spine_lite/posture.py @@ -1,10 +1,31 @@ -"""Posture state machine (Phase 3). +"""Posture state machine. -A pure transition function over a closed ``Posture`` enum. No hidden state. -Every transition is a value-in-value-out function. See -``docs/architecture.md`` for the state diagram. +Phase 2 ships the closed :class:`Posture` enum used by manifest validation. +Phase 3 will add the transition functions (pure value-in-value-out). Pure module: deterministic, no I/O. """ from __future__ import annotations + +from enum import StrEnum + + +class Posture(StrEnum): + """Operational posture of the runtime. + + Drives how the runtime treats ambiguous calls. Closed enum: extending + requires a project-level decision. The members and their string values + are pinned by ``docs/concepts/posture-and-hooks.md``. + + Members: + INTERACTIVE: Operator at the keyboard; ambiguous calls escalate. + AUTONOMOUS: No operator in the loop; ambiguous calls fail closed. + DRY_RUN: Classification only; non-``READ`` effects don't fire. + LOCKED: Refuse everything except explicit allow-listed read-only calls. + """ + + INTERACTIVE = "interactive" + AUTONOMOUS = "autonomous" + DRY_RUN = "dry_run" + LOCKED = "locked" diff --git a/tests/fixtures/decisions_basic.json b/tests/fixtures/decisions_basic.json new file mode 100644 index 0000000..e862b03 --- /dev/null +++ b/tests/fixtures/decisions_basic.json @@ -0,0 +1,59 @@ +{ + "manifest": "manifest_basic.json", + "cases": [ + { + "name": "read-only file read", + "tool": "read_file", + "expected": { + "tool": "read_file", + "effects": ["read"], + "most_restrictive": "read" + } + }, + { + "name": "scoped write", + "tool": "write_file", + "expected": { + "tool": "write_file", + "effects": ["write"], + "most_restrictive": "write" + } + }, + { + "name": "network read collapses to network", + "tool": "fetch_url", + "expected": { + "tool": "fetch_url", + "effects": ["network", "read"], + "most_restrictive": "network" + } + }, + { + "name": "subprocess execute", + "tool": "shell_run", + "expected": { + "tool": "shell_run", + "effects": ["execute"], + "most_restrictive": "execute" + } + }, + { + "name": "fork-and-detach spawn", + "tool": "shell_daemon", + "expected": { + "tool": "shell_daemon", + "effects": ["spawn"], + "most_restrictive": "spawn" + } + }, + { + "name": "destructive dominates network", + "tool": "git_force_push", + "expected": { + "tool": "git_force_push", + "effects": ["destructive", "network"], + "most_restrictive": "destructive" + } + } + ] +} diff --git a/tests/fixtures/manifest_basic.json b/tests/fixtures/manifest_basic.json new file mode 100644 index 0000000..752ceec --- /dev/null +++ b/tests/fixtures/manifest_basic.json @@ -0,0 +1,52 @@ +{ + "tools": { + "read_file": { + "name": "read_file", + "description": "Read a file from disk.", + "effects": ["read"], + "permitted_postures": null, + "require_confirmation": false, + "metadata": {} + }, + "write_file": { + "name": "write_file", + "description": "Write content to a file.", + "effects": ["write"], + "permitted_postures": null, + "require_confirmation": false, + "metadata": {} + }, + "fetch_url": { + "name": "fetch_url", + "description": "Fetch a URL.", + "effects": ["network", "read"], + "permitted_postures": null, + "require_confirmation": false, + "metadata": {} + }, + "shell_run": { + "name": "shell_run", + "description": "Run a shell command and wait for completion.", + "effects": ["execute"], + "permitted_postures": ["interactive", "autonomous"], + "require_confirmation": false, + "metadata": {} + }, + "shell_daemon": { + "name": "shell_daemon", + "description": "Spawn a long-running daemon process.", + "effects": ["spawn"], + "permitted_postures": ["interactive"], + "require_confirmation": true, + "metadata": {} + }, + "git_force_push": { + "name": "git_force_push", + "description": "Force push, irreversible if remote was rewritten.", + "effects": ["destructive", "network"], + "permitted_postures": ["interactive"], + "require_confirmation": true, + "metadata": {} + } + } +} diff --git a/tests/fixtures/manifest_full.json b/tests/fixtures/manifest_full.json new file mode 100644 index 0000000..28517a9 --- /dev/null +++ b/tests/fixtures/manifest_full.json @@ -0,0 +1,29 @@ +{ + "tools": { + "all_classes": { + "name": "all_classes", + "description": "Synthetic tool exercising every effect class and posture.", + "effects": [ + "destructive", + "spawn", + "execute", + "network", + "write", + "read" + ], + "permitted_postures": [ + "interactive", + "autonomous", + "dry_run", + "locked" + ], + "require_confirmation": true, + "metadata": { + "category": "synthetic", + "tags": ["coverage", "reference"], + "owner": "spine-lite", + "audit_priority": "high" + } + } + } +} diff --git a/tests/fixtures/manifest_minimal.json b/tests/fixtures/manifest_minimal.json new file mode 100644 index 0000000..1bbfdd6 --- /dev/null +++ b/tests/fixtures/manifest_minimal.json @@ -0,0 +1,8 @@ +{ + "tools": { + "read_file": { + "name": "read_file", + "effects": ["read"] + } + } +} diff --git a/tests/test_smoke.py b/tests/test_smoke.py index c8c9971..b5e79b7 100644 --- a/tests/test_smoke.py +++ b/tests/test_smoke.py @@ -9,10 +9,10 @@ def test_package_imports() -> None: assert hasattr(spine_lite, "__version__") -def test_version_is_phase_one_alpha() -> None: +def test_version_is_phase_two_alpha() -> None: import spine_lite - assert spine_lite.__version__ == "0.1.0a0" + assert spine_lite.__version__ == "0.2.0a0" def test_public_surface_excludes_private_names() -> None: diff --git a/tests/unit/test_classifier.py b/tests/unit/test_classifier.py new file mode 100644 index 0000000..7fd1615 --- /dev/null +++ b/tests/unit/test_classifier.py @@ -0,0 +1,376 @@ +"""Tests for the classifier. + +Three layers: + +1. Unit tests covering happy paths, error paths, frozen dataclass + immutability, and the public-API surface. +2. Parametrized parity tests against the authored fixtures in + ``tests/fixtures/``: round-trip JSON byte-stability per manifest, + and case-by-case decision parity for ``manifest_basic.json``. +3. Hypothesis property tests for determinism, dominance, and round-trip + stability — 1,000 examples each. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +import pytest +from hypothesis import HealthCheck, given, settings +from hypothesis import strategies as st + +from spine_lite import ( + Decision, + Effect, + Manifest, + ManifestError, + Posture, + ToolCall, + ToolDefinition, + classify, + parse_manifest, +) + +FIXTURES_DIR = Path(__file__).parent.parent / "fixtures" + +_HYPOTHESIS_THOROUGH = settings( + max_examples=1000, + deadline=None, + suppress_health_check=[HealthCheck.too_slow], +) + + +def _manifest(**tools: ToolDefinition) -> Manifest: + return Manifest(tools=dict(tools)) + + +# ---------- happy path ---------- + + +def test_classify_returns_declared_effects() -> None: + manifest = _manifest( + read_file=ToolDefinition(name="read_file", effects=(Effect.READ,)), + ) + decision = classify(ToolCall(tool="read_file"), manifest) + + assert decision.tool == "read_file" + assert decision.effects == (Effect.READ,) + assert decision.most_restrictive is Effect.READ + + +def test_classify_collapses_to_dominant_effect() -> None: + manifest = _manifest( + fetch=ToolDefinition( + name="fetch", + effects=(Effect.NETWORK, Effect.READ), + ), + ) + decision = classify(ToolCall(tool="fetch"), manifest) + + assert decision.most_restrictive is Effect.NETWORK + assert set(decision.effects) == {Effect.NETWORK, Effect.READ} + + +def test_classify_destructive_dominates() -> None: + manifest = _manifest( + nuke=ToolDefinition( + name="nuke", + effects=( + Effect.READ, + Effect.WRITE, + Effect.NETWORK, + Effect.DESTRUCTIVE, + ), + ), + ) + decision = classify(ToolCall(tool="nuke"), manifest) + assert decision.most_restrictive is Effect.DESTRUCTIVE + + +def test_classify_returns_canonical_effect_order() -> None: + """Decision.effects always uses PRECEDENCE order, not author order.""" + manifest = _manifest( + t=ToolDefinition( + name="t", + effects=(Effect.READ, Effect.NETWORK, Effect.DESTRUCTIVE), + ), + ) + decision = classify(ToolCall(tool="t"), manifest) + assert decision.effects == (Effect.DESTRUCTIVE, Effect.NETWORK, Effect.READ) + + +def test_classify_rationale_is_human_readable() -> None: + manifest = _manifest( + t=ToolDefinition(name="t", effects=(Effect.NETWORK, Effect.READ)), + ) + decision = classify(ToolCall(tool="t"), manifest) + + assert "'t'" in decision.rationale + assert "network" in decision.rationale + assert "read" in decision.rationale + + +def test_classify_rationale_is_byte_stable() -> None: + """Same inputs produce identical rationale strings.""" + manifest = _manifest( + t=ToolDefinition(name="t", effects=(Effect.WRITE, Effect.READ)), + ) + a = classify(ToolCall(tool="t"), manifest).rationale + b = classify(ToolCall(tool="t"), manifest).rationale + assert a == b + + +def test_classify_ignores_arguments_in_phase_2() -> None: + """Phase 2 classifier doesn't refine on arguments; manifest is the spec.""" + manifest = _manifest( + t=ToolDefinition(name="t", effects=(Effect.READ,)), + ) + a = classify(ToolCall(tool="t", arguments={}), manifest) + b = classify(ToolCall(tool="t", arguments={"path": "/etc/passwd"}), manifest) + assert a.effects == b.effects + assert a.most_restrictive == b.most_restrictive + + +def test_classify_with_posture_constrained_tool() -> None: + """Permitted_postures is stored on the definition; Phase 2 doesn't gate on it.""" + manifest = _manifest( + write_file=ToolDefinition( + name="write_file", + effects=(Effect.WRITE,), + permitted_postures=(Posture.INTERACTIVE, Posture.AUTONOMOUS), + ), + ) + decision = classify(ToolCall(tool="write_file"), manifest) + assert decision.most_restrictive is Effect.WRITE + + +# ---------- error paths ---------- + + +def test_classify_raises_manifest_error_for_undeclared_tool() -> None: + manifest = Manifest(tools={}) + with pytest.raises(ManifestError, match="not declared"): + classify(ToolCall(tool="ghost"), manifest) + + +def test_classify_undeclared_tool_carries_name_in_message() -> None: + manifest = Manifest(tools={}) + with pytest.raises(ManifestError) as exc_info: + classify(ToolCall(tool="missing_tool"), manifest) + assert "missing_tool" in str(exc_info.value) + + +# ---------- determinism ---------- + + +def test_classify_is_deterministic_within_one_call() -> None: + manifest = _manifest( + t=ToolDefinition(name="t", effects=(Effect.SPAWN, Effect.NETWORK)), + ) + call = ToolCall(tool="t") + decisions = [classify(call, manifest) for _ in range(10)] + assert all(d == decisions[0] for d in decisions) + + +def test_decision_is_frozen() -> None: + decision = Decision( + tool="t", + effects=(Effect.READ,), + most_restrictive=Effect.READ, + rationale="example", + ) + with pytest.raises(AttributeError): + decision.tool = "u" # type: ignore[misc] + + +def test_tool_call_is_frozen() -> None: + call = ToolCall(tool="t") + with pytest.raises(AttributeError): + call.tool = "u" # type: ignore[misc] + + +# ---------- public API ---------- + + +def test_decision_classify_toolcall_in_public_api() -> None: + import spine_lite + + for name in ("Decision", "ToolCall", "classify"): + assert name in spine_lite.__all__ + + +# ---------- parity tests against authored fixtures ---------- + + +_MANIFEST_FIXTURES = ( + "manifest_minimal.json", + "manifest_basic.json", + "manifest_full.json", +) + + +@pytest.mark.parametrize("fixture", _MANIFEST_FIXTURES) +def test_manifest_fixture_loads_cleanly(fixture: str) -> None: + raw = (FIXTURES_DIR / fixture).read_text() + manifest = parse_manifest(raw) + assert isinstance(manifest, Manifest) + + +@pytest.mark.parametrize("fixture", _MANIFEST_FIXTURES) +def test_manifest_fixture_round_trip_byte_stable(fixture: str) -> None: + """parse → dump → parse → dump produces identical bytes the second time.""" + raw = (FIXTURES_DIR / fixture).read_text() + parsed = parse_manifest(raw) + dumped_once = parsed.model_dump_json() + re_parsed = parse_manifest(dumped_once) + dumped_twice = re_parsed.model_dump_json() + assert dumped_once == dumped_twice + assert parsed == re_parsed + + +def _load_decision_cases() -> list[dict[str, object]]: + payload = json.loads((FIXTURES_DIR / "decisions_basic.json").read_text()) + cases: list[dict[str, object]] = payload["cases"] + return cases + + +@pytest.fixture(scope="module") +def basic_manifest() -> Manifest: + return parse_manifest((FIXTURES_DIR / "manifest_basic.json").read_text()) + + +@pytest.mark.parametrize( + "case", + _load_decision_cases(), + ids=lambda c: str(c["name"]), +) +def test_decision_parity_against_fixture( + case: dict[str, object], + basic_manifest: Manifest, +) -> None: + expected = case["expected"] + assert isinstance(expected, dict) + + decision = classify(ToolCall(tool=str(case["tool"])), basic_manifest) + + assert decision.tool == expected["tool"] + expected_effects = tuple(Effect(e) for e in expected["effects"]) + assert decision.effects == expected_effects + assert decision.most_restrictive == Effect(str(expected["most_restrictive"])) + + +# ---------- hypothesis property tests ---------- + + +_NAME_STRATEGY = st.text( + alphabet=st.characters(min_codepoint=ord("a"), max_codepoint=ord("z")), + min_size=1, + max_size=15, +) + +_EFFECTS_STRATEGY = st.lists( + st.sampled_from(list(Effect)), + min_size=1, + max_size=6, +).map(tuple) + +_POSTURES_STRATEGY = st.one_of( + st.none(), + st.lists( + st.sampled_from(list(Posture)), + min_size=1, + max_size=4, + ).map(tuple), +) + + +@st.composite +def _tool_definition_strategy(draw: st.DrawFn, name: str) -> ToolDefinition: + return ToolDefinition( + name=name, + description=draw(st.one_of(st.none(), st.text(max_size=30))), + effects=draw(_EFFECTS_STRATEGY), + permitted_postures=draw(_POSTURES_STRATEGY), + require_confirmation=draw(st.booleans()), + ) + + +@st.composite +def _manifest_strategy(draw: st.DrawFn) -> Manifest: + names = draw(st.lists(_NAME_STRATEGY, min_size=1, max_size=5, unique=True)) + tools = {name: draw(_tool_definition_strategy(name=name)) for name in names} + return Manifest(tools=tools) + + +@_HYPOTHESIS_THOROUGH +@given(manifest=_manifest_strategy()) +def test_classify_is_deterministic_property(manifest: Manifest) -> None: + """classify(call, manifest) returns the same Decision on every call.""" + for tool_name in manifest.tools: + call = ToolCall(tool=tool_name) + first = classify(call, manifest) + second = classify(call, manifest) + assert first == second + + +@_HYPOTHESIS_THOROUGH +@given(manifest=_manifest_strategy()) +def test_classify_dominant_is_in_effects_property(manifest: Manifest) -> None: + """The Decision's most_restrictive is always a member of its effects.""" + for tool_name in manifest.tools: + decision = classify(ToolCall(tool=tool_name), manifest) + assert decision.most_restrictive in decision.effects + + +@_HYPOTHESIS_THOROUGH +@given(manifest=_manifest_strategy()) +def test_classify_effects_match_manifest_definition(manifest: Manifest) -> None: + """The Decision's effects are exactly the manifest's declared effects.""" + for tool_name, definition in manifest.tools.items(): + decision = classify(ToolCall(tool=tool_name), manifest) + assert decision.effects == definition.effects + + +@_HYPOTHESIS_THOROUGH +@given(manifest=_manifest_strategy()) +def test_classify_rationale_is_byte_stable_property(manifest: Manifest) -> None: + """Identical inputs produce byte-identical rationale strings.""" + for tool_name in manifest.tools: + call = ToolCall(tool=tool_name) + a = classify(call, manifest).rationale + b = classify(call, manifest).rationale + assert a == b + + +@_HYPOTHESIS_THOROUGH +@given(manifest=_manifest_strategy()) +def test_classify_stable_across_manifest_round_trip(manifest: Manifest) -> None: + """Manifest → JSON → Manifest produces identical decisions for every tool.""" + re_parsed = parse_manifest(manifest.model_dump_json()) + for tool_name in manifest.tools: + call = ToolCall(tool=tool_name) + original = classify(call, manifest) + replayed = classify(call, re_parsed) + assert original == replayed + + +@_HYPOTHESIS_THOROUGH +@given( + manifest=_manifest_strategy(), + arg_payload=st.dictionaries( + st.text(min_size=1, max_size=10), + st.text(max_size=20), + max_size=5, + ), +) +def test_classify_ignores_arguments_property( + manifest: Manifest, + arg_payload: dict[str, str], +) -> None: + """Phase 2: arguments are stored on ToolCall but don't influence classification.""" + for tool_name in manifest.tools: + no_args = classify(ToolCall(tool=tool_name), manifest) + with_args = classify(ToolCall(tool=tool_name, arguments=arg_payload), manifest) + assert no_args.effects == with_args.effects + assert no_args.most_restrictive == with_args.most_restrictive diff --git a/tests/unit/test_manifest.py b/tests/unit/test_manifest.py new file mode 100644 index 0000000..78eb268 --- /dev/null +++ b/tests/unit/test_manifest.py @@ -0,0 +1,250 @@ +"""Tests for the pydantic v2 manifest schema.""" + +from __future__ import annotations + +import pytest + +from spine_lite import Effect, ManifestError, Posture, parse_manifest +from spine_lite.manifest import Manifest, ToolDefinition + +# ---------- ToolDefinition ---------- + + +def test_tool_definition_minimal() -> None: + tool = ToolDefinition(name="read_file", effects=(Effect.READ,)) + assert tool.name == "read_file" + assert tool.description is None + assert tool.effects == (Effect.READ,) + assert tool.permitted_postures is None + assert tool.require_confirmation is False + assert tool.metadata == {} + + +def test_tool_definition_full() -> None: + tool = ToolDefinition( + name="git_force_push", + description="Force push, dangerous", + effects=(Effect.NETWORK, Effect.DESTRUCTIVE), + permitted_postures=(Posture.INTERACTIVE,), + require_confirmation=True, + metadata={"owner": "ops"}, + ) + assert tool.description == "Force push, dangerous" + assert tool.permitted_postures == (Posture.INTERACTIVE,) + assert tool.require_confirmation is True + assert tool.metadata == {"owner": "ops"} + + +def test_tool_definition_canonicalises_effects() -> None: + """Effects are deduplicated and sorted by PRECEDENCE order.""" + tool = ToolDefinition( + name="t", + effects=(Effect.READ, Effect.DESTRUCTIVE, Effect.WRITE, Effect.READ), + ) + assert tool.effects == (Effect.DESTRUCTIVE, Effect.WRITE, Effect.READ) + + +def test_tool_definition_canonicalises_postures() -> None: + """Postures are deduplicated and sorted by enum declaration order.""" + tool = ToolDefinition( + name="t", + effects=(Effect.READ,), + permitted_postures=( + Posture.LOCKED, + Posture.INTERACTIVE, + Posture.AUTONOMOUS, + Posture.INTERACTIVE, + ), + ) + assert tool.permitted_postures == ( + Posture.INTERACTIVE, + Posture.AUTONOMOUS, + Posture.LOCKED, + ) + + +def test_tool_definition_rejects_empty_effects() -> None: + with pytest.raises(ManifestError): + parse_manifest({"tools": {"t": {"name": "t", "effects": []}}}) + + +def test_tool_definition_rejects_unknown_effect() -> None: + with pytest.raises(ManifestError): + parse_manifest({"tools": {"t": {"name": "t", "effects": ["telepath"]}}}) + + +def test_tool_definition_rejects_extra_field() -> None: + with pytest.raises(ManifestError): + parse_manifest( + {"tools": {"t": {"name": "t", "effects": ["read"], "extra": 1}}}, + ) + + +def test_tool_definition_is_frozen() -> None: + from pydantic import ValidationError + + tool = ToolDefinition(name="t", effects=(Effect.READ,)) + with pytest.raises(ValidationError): + tool.name = "u" + + +# ---------- Manifest ---------- + + +def test_manifest_empty_is_valid() -> None: + manifest = Manifest(tools={}) + assert manifest.tools == {} + + +def test_manifest_get_hit() -> None: + tool = ToolDefinition(name="read_file", effects=(Effect.READ,)) + manifest = Manifest(tools={"read_file": tool}) + assert manifest.get("read_file") is tool + + +def test_manifest_get_miss_raises_manifest_error() -> None: + manifest = Manifest(tools={}) + with pytest.raises(ManifestError, match="not declared"): + manifest.get("does_not_exist") + + +def test_manifest_name_key_mismatch_raises() -> None: + with pytest.raises(ManifestError, match="tool name mismatch"): + parse_manifest( + {"tools": {"a": {"name": "b", "effects": ["read"]}}}, + ) + + +def test_manifest_unknown_posture_rejected() -> None: + with pytest.raises(ManifestError): + parse_manifest( + { + "tools": { + "t": { + "name": "t", + "effects": ["read"], + "permitted_postures": ["paranoid"], + }, + }, + }, + ) + + +def test_manifest_empty_postures_list_rejected() -> None: + with pytest.raises(ManifestError, match="non-empty"): + parse_manifest( + { + "tools": { + "t": { + "name": "t", + "effects": ["read"], + "permitted_postures": [], + }, + }, + }, + ) + + +# ---------- parse_manifest ---------- + + +def test_parse_manifest_from_dict() -> None: + manifest = parse_manifest( + { + "tools": { + "read_file": {"name": "read_file", "effects": ["read"]}, + "write_file": {"name": "write_file", "effects": ["write"]}, + }, + }, + ) + assert set(manifest.tools) == {"read_file", "write_file"} + + +def test_parse_manifest_from_json_string() -> None: + manifest = parse_manifest( + '{"tools": {"r": {"name": "r", "effects": ["read"]}}}', + ) + assert manifest.get("r").effects == (Effect.READ,) + + +def test_parse_manifest_from_json_bytes() -> None: + manifest = parse_manifest( + b'{"tools": {"r": {"name": "r", "effects": ["read"]}}}', + ) + assert manifest.get("r").effects == (Effect.READ,) + + +def test_parse_manifest_invalid_json_string_raises() -> None: + with pytest.raises(ManifestError): + parse_manifest("{not json}") + + +def test_parse_manifest_attaches_validation_error_as_cause() -> None: + with pytest.raises(ManifestError) as exc_info: + parse_manifest({"tools": {"t": {"name": "t", "effects": "not-a-list"}}}) + assert exc_info.value.__cause__ is not None + + +# ---------- round-trip ---------- + + +def test_manifest_round_trip_preserves_canonical_form() -> None: + """Authored unsorted; parsed canonical; re-dumped; re-parsed; equal.""" + raw = { + "tools": { + "shell_run": { + "name": "shell_run", + "description": "Arbitrary shell command", + "effects": ["spawn", "execute", "network"], + "permitted_postures": ["locked", "interactive"], + "require_confirmation": False, + "metadata": {"category": "shell"}, + }, + }, + } + + parsed = parse_manifest(raw) + dumped = parsed.model_dump(mode="json") + re_parsed = parse_manifest(dumped) + + assert parsed == re_parsed + assert parsed.get("shell_run").effects == ( + Effect.SPAWN, + Effect.EXECUTE, + Effect.NETWORK, + ) + assert parsed.get("shell_run").permitted_postures == ( + Posture.INTERACTIVE, + Posture.LOCKED, + ) + + +def test_manifest_round_trip_via_json_is_byte_stable() -> None: + """Same inputs produce byte-identical JSON output across calls.""" + data = { + "tools": { + "t": { + "name": "t", + "effects": ["destructive", "read"], + "permitted_postures": ["locked", "interactive"], + }, + }, + } + a = parse_manifest(data).model_dump_json() + b = parse_manifest(data).model_dump_json() + assert a == b + + +def test_manifest_metadata_round_trips_unchanged() -> None: + """Free-form metadata survives parse/dump.""" + data = { + "tools": { + "t": { + "name": "t", + "effects": ["read"], + "metadata": {"owner": "ops", "tags": ["audit", "ro"]}, + }, + }, + } + parsed = parse_manifest(data) + assert parsed.get("t").metadata == {"owner": "ops", "tags": ["audit", "ro"]} diff --git a/tests/unit/test_posture.py b/tests/unit/test_posture.py new file mode 100644 index 0000000..db83d48 --- /dev/null +++ b/tests/unit/test_posture.py @@ -0,0 +1,56 @@ +"""Tests for the closed Posture enum.""" + +from __future__ import annotations + +import pytest + +from spine_lite.posture import Posture + + +def test_posture_has_exactly_four_members() -> None: + assert len(Posture) == 4 + + +def test_posture_member_names() -> None: + assert {p.name for p in Posture} == { + "INTERACTIVE", + "AUTONOMOUS", + "DRY_RUN", + "LOCKED", + } + + +@pytest.mark.parametrize( + ("member", "value"), + [ + (Posture.INTERACTIVE, "interactive"), + (Posture.AUTONOMOUS, "autonomous"), + (Posture.DRY_RUN, "dry_run"), + (Posture.LOCKED, "locked"), + ], +) +def test_posture_string_values_are_pinned(member: Posture, value: str) -> None: + assert member.value == value + assert member == value + + +def test_posture_is_str_subclass() -> None: + assert isinstance(Posture.INTERACTIVE, str) + assert str(Posture.AUTONOMOUS) == "autonomous" + + +def test_posture_unknown_value_raises() -> None: + with pytest.raises(ValueError, match="not a valid Posture"): + Posture("escalated") + + +def test_posture_round_trip_through_value() -> None: + for member in Posture: + assert Posture(member.value) is member + + +def test_posture_is_in_public_api() -> None: + import spine_lite + + assert "Posture" in spine_lite.__all__ + assert spine_lite.Posture is Posture diff --git a/uv.lock b/uv.lock index 93c445e..d4c0a5e 100644 --- a/uv.lock +++ b/uv.lock @@ -1210,7 +1210,7 @@ wheels = [ [[package]] name = "spine-lite" -version = "0.1.0a0" +version = "0.2.0a0" source = { editable = "." } dependencies = [ { name = "pydantic" },