feat(common,server,cli): canonical v1 filter hash#353
Merged
Conversation
Stacked on #351. Implements ADR-0002 (`docs/adr/0002-canonical-v1-hash.md`): a schema-independent canonical TOML hash that is stable across future `FilterConfig` schema additions. v1 is the long-term identity for every filter going forward; #351's `content_hash` (recomputed canonical_hash) is retained as the transition fallback. ## What v1 does `canonical_v1::hash(toml_str)` returns `v1:<sha256-hex>` over a canonicalised byte stream: parse TOML → toml::Value tree → sort arrays whose paths are in the unordered policy table (`skip`, `keep`, `[on_success].skip`, `[on_failure].skip`) → collapse `command = ["x"]` to `command = "x"` (single-entry only) → recursively prune entries equal to `false`, `[]`, or `{}` → re-emit via `toml::to_string` → SHA-256 the bytes The canonicaliser is schema-decoupled — it operates on `toml::Value`, not `FilterConfig`. New fields added to `FilterConfig` later don't appear in the canonical bytes unless the user writes them in their TOML. Filters that don't reference new fields keep the same v1 hash forever. ## Server side `DownloadPayload` gains a `v1_hash: String` field, computed alongside `content_hash` on every download. ## Client side `DownloadedFilter` gains an optional `v1_hash` field. Install flow's `verify_and_resolve_hash` now has a three-tier preference: v1 first, content_hash second, URL hash as last-resort fallback. Each tier recomputes locally and verifies against the server's claim — wire-tamper detection is preserved. ## Test corpus Snapshot test: every `.toml` under `crates/tokf-cli/filters/` (51 real-world stdlib filters) has its v1 hash recorded in `tests/canonical_v1_stdlib.txt`. CI fails if any drifts. Authoring helper (`dump_stdlib_hashes`, `#[ignore]`d) regenerates the file when new stdlib filters are added. Property tests on a representative subset: - `invariant_toml_roundtrip_idempotent` — toml::from_str → to_string → hash is stable. - `invariant_leading_comments_and_blanks` — file-level comments and blank-line padding don't change the hash. - `invariant_skip_keep_reversed_via_ast` — reversing unordered array values via AST mutation doesn't change the hash. - `invariant_default_false_added` — adding `dedup = false` to a filter that doesn't mention it doesn't change the hash. - `distinguishing_value_change` — sanity: a real value change DOES change the hash. 24 unit tests in canonical_v1 itself cover every spec rule (output format, BTreeMap label invariance, unknown-fields-dropped, default omission edge cases — including 0-integer and "" preserved, true bool preserved, empty-string distinguished from absent — plus all error paths). ## Dependency change `toml` is promoted from optional (under `validation` feature) to a regular dep on `tokf-common`, pinned to exactly `=1.0.3`. The pin is load-bearing: v1's canonical bytes depend on the toml crate's emission; any version bump must be gated by the corpus, with a v2 trigger prepared for cases where emission genuinely changes. ## Test plan - [x] `cargo fmt --all -- --check` clean - [x] `cargo clippy --workspace --all-targets -- -D warnings` clean - [x] `cargo test --workspace` — 2216 passed (+30 over baseline) - [x] All 51 stdlib filter v1 hashes recorded; corpus round-trip green - [x] Property tests green on representative stdlib subset ## Closes / refs Refs #350 (the strategic fix; #351 stays as the immediate stopgap). Follow-up work (separate PRs): - Server backfill endpoint to populate v1 aliases for existing filters. - Filter table dedup migration: collapse rows with the same v1 hash, unifying split statistics. Behind a 0.x-track deprecation announcement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
## Summary Persists the schema-independent canonical TOML hash (`v1_hash`, ADR-0002) on the `filters` table and ships an operator-only backfill endpoint. ## What changes - **Migration** `20260428000000_add_v1_hash.sql`: adds nullable `v1_hash TEXT` + non-partial index. Nullable on purpose — existing rows backfill operationally; `NOT NULL` and `UNIQUE` are deferred to the dedup PR. - **Publish** computes v1 alongside `content_hash` in `validate_and_prepare`. Both publish paths (regular + stdlib) now store v1 on insert. - **v1-collision rejection**: when a new submission has the same `v1_hash` as an existing row but a different `content_hash`, the request returns 200 OK with the *existing* row's author and content_hash, without inserting. Stops new publishes from re-splitting canonically equivalent filters across `content_hash` variants — the going-forward fix for #350. - **Backfill endpoint** `POST /api/filters/backfill-v1-hashes`: service-token-protected, scan-and-compute. Iterates `WHERE v1_hash IS NULL` rows, fetches the TOML from R2, computes v1, writes it back. Per-row failures (missing R2 object, malformed TOML) surface in `failed[]` without aborting the batch. ## Out of scope (explicit follow-ups) - Dedup migration for existing duplicate rows - `UNIQUE(v1_hash)` constraint - `v1_hash NOT NULL` tightening These all belong to the next PR after backfill has been run in production. Per-row decisions for that work were captured during planning: canonical row = oldest, stdlib breaks ties; non-canonical rows kept and pointed at the canonical row via the existing `successor_hash` column (no URLs broken). ## Operator runbook (for after deploy) \`\`\`sh while true; do out=\$(curl -X POST -H \"Authorization: Bearer \$TOKF_SERVICE_TOKEN\" \\ \$SERVER/api/filters/backfill-v1-hashes -d '{\"limit\":100}') echo \"\$out\" [ \"\$(echo \"\$out\" | jq .processed)\" -eq 0 ] && break done \`\`\` Inspect any persistent entries in \`failed[]\` for manual triage (corrupt TOML in R2, missing object). ## Test plan - [x] Migration applies cleanly to a fresh CockroachDB (\`just db-reset && cargo sqlx migrate run --source crates/tokf-server/migrations\`) - [x] Publish v1 storage + collision rejection (DB integration tests, fixture genuinely triggers v1-collision branch via \`command = "x"\` vs \`["x"]\`) - [x] Backfill: populates NULL rows, idempotent, respects limit, caps at MAX, requires service token, rejects invalid bearer, reports failure for missing R2 object, reports failure for unparseable TOML - [x] Stdlib publish stores v1 (separate INSERT path) - [x] \`cargo fmt --check\`, \`cargo clippy --workspace --all-targets -- -D warnings\`, \`cargo dupes\` all clean - [x] Full server suite: 268 passed (was 266 before, +2 new tests) ## Stack - #353 ← (parent — \`feat/350-canonical-v1\`) - #351 (foundation — \`feat/350-install-hash-back-compat\`) - main This PR targets \`feat/350-canonical-v1\`. CI will run once #353 merges to main and this PR's base auto-rebases. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #351. This PR can only merge after #351 lands; if reviewing, it might be easier to view just this PR's commits via the diff against
feat/350-install-hash-back-compat.Implements ADR-0002 — a schema-independent canonical TOML hash that's stable across future
FilterConfigschema additions. v1 is the long-term identity for every filter going forward; #351'scontent_hash(schema-tied recompute) is retained as a transition fallback.Summary
crates/tokf-common/src/canonical_v1.rs—canonical_v1::hash(toml_str) -> Result<String, _>. Returnsv1:<sha256-hex>. Three normalisation passes on atoml::Valuetree (sort unordered arrays, collapsecommandsingle-form, prunefalse/[]/{}defaults), then re-emit viatoml::to_string, then SHA-256.FilterConfig. Adding fields toFilterConfigdoesn't change v1 hashes for filters that don't write those fields in their TOML.DownloadPayloadgainsv1_hash: String, computed on every download alongsidecontent_hash.DownloadedFiltergains optionalv1_hash.verify_and_resolve_hashnow prefers v1 → content_hash → URL hash, recomputing locally at each tier so wire-tamper detection is preserved..tomlundercrates/tokf-cli/filters/) have their v1 hashes recorded intests/canonical_v1_stdlib.txt. CI fails if any drift. An#[ignore]d authoring helper rebuilds the file when new stdlib filters are added.tomldep promoted to regular and pinned exactly to=1.0.3. v1's bytes depend on the toml crate's emission; the pin is load-bearing per the ADR's stability clause.ADR
The full specification is in
docs/adr/0002-canonical-v1-hash.md(status: Accepted). It locks down the algorithm, the unordered-paths policy table, thecommandspecial-form collapse, the default-omission rules, and the v2 conditions (canonicaliser bug, toml-crate emission change we can't defer, or an existing policy entry needing a different policy — adding new entries for new fields is part of v1's compat clause).Test plan
cargo fmt --all -- --checkcleancargo clippy --workspace --all-targets -- -D warningscleancargo test --workspace— 2216 passed (+30 over baseline)canonical_v1covering every spec ruledownload_returns_toml_contenttest asserts the newv1_hashfield isv1:<64-hex>verify_and_resolve_hashtests cover the v1-preferred path, content_hash fallback, URL fallback, and every error caseFollow-ups (separate PRs, after this lands)
filter_hashes(or equivalent) so old URLs route to their v1 hash.tomlcrate emission change we can't defer, or existing policy entry needing a different policy.Closes / refs
Refs #350. Does not close — the migration / dedup work is separate.
🤖 Generated with Claude Code