Skip to content

feat(common,server,cli): canonical v1 filter hash#353

Merged
mpecan merged 2 commits into
feat/350-install-hash-back-compatfrom
feat/350-canonical-v1
May 5, 2026
Merged

feat(common,server,cli): canonical v1 filter hash#353
mpecan merged 2 commits into
feat/350-install-hash-back-compatfrom
feat/350-canonical-v1

Conversation

@mpecan
Copy link
Copy Markdown
Owner

@mpecan mpecan commented Apr 28, 2026

Stacked on #351. This PR can only merge after #351 lands; if reviewing, it might be easier to view just this PR's commits via the diff against feat/350-install-hash-back-compat.

Implements ADR-0002 — a schema-independent canonical TOML hash that's stable across future FilterConfig schema additions. v1 is the long-term identity for every filter going forward; #351's content_hash (schema-tied recompute) is retained as a transition fallback.

Summary

  • crates/tokf-common/src/canonical_v1.rscanonical_v1::hash(toml_str) -> Result<String, _>. Returns v1:<sha256-hex>. Three normalisation passes on a toml::Value tree (sort unordered arrays, collapse command single-form, prune false/[]/{} defaults), then re-emit via toml::to_string, then SHA-256.
  • Schema-decoupled. Walks the TOML AST, not FilterConfig. Adding fields to FilterConfig doesn't change v1 hashes for filters that don't write those fields in their TOML.
  • Server: DownloadPayload gains v1_hash: String, computed on every download alongside content_hash.
  • Client: DownloadedFilter gains optional v1_hash. verify_and_resolve_hash now prefers v1 → content_hash → URL hash, recomputing locally at each tier so wire-tamper detection is preserved.
  • Frozen corpus: 51 real stdlib filters (every .toml under crates/tokf-cli/filters/) have their v1 hashes recorded in tests/canonical_v1_stdlib.txt. CI fails if any drift. An #[ignore]d authoring helper rebuilds the file when new stdlib filters are added.
  • Property tests: TOML round-trip idempotence, leading comments/blanks invariance, AST-level skip/keep reordering invariance, default-omission invariance, and a sanity-check distinguishing test.
  • toml dep promoted to regular and pinned exactly to =1.0.3. v1's bytes depend on the toml crate's emission; the pin is load-bearing per the ADR's stability clause.

ADR

The full specification is in docs/adr/0002-canonical-v1-hash.md (status: Accepted). It locks down the algorithm, the unordered-paths policy table, the command special-form collapse, the default-omission rules, and the v2 conditions (canonicaliser bug, toml-crate emission change we can't defer, or an existing policy entry needing a different policy — adding new entries for new fields is part of v1's compat clause).

Test plan

  • cargo fmt --all -- --check clean
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • cargo test --workspace — 2216 passed (+30 over baseline)
  • 24 unit tests in canonical_v1 covering every spec rule
  • 6 corpus + property tests across 51 real-world stdlib filters
  • Server download_returns_toml_content test asserts the new v1_hash field is v1:<64-hex>
  • Client verify_and_resolve_hash tests cover the v1-preferred path, content_hash fallback, URL fallback, and every error case

Follow-ups (separate PRs, after this lands)

  • Server backfill endpoint — populate filter_hashes (or equivalent) so old URLs route to their v1 hash.
  • Filter table dedup migration — collapse rows with the same v1 hash, unifying their split statistics. Cut behind a 0.x-track deprecation announcement (old client URLs will need an upgrade).
  • Long-term: ship v2 only when canonicaliser bug, toml crate emission change we can't defer, or existing policy entry needing a different policy.

Closes / refs

Refs #350. Does not close — the migration / dedup work is separate.

🤖 Generated with Claude Code

Stacked on #351. Implements ADR-0002 (`docs/adr/0002-canonical-v1-hash.md`):
a schema-independent canonical TOML hash that is stable across future
`FilterConfig` schema additions. v1 is the long-term identity for every
filter going forward; #351's `content_hash` (recomputed canonical_hash) is
retained as the transition fallback.

## What v1 does

`canonical_v1::hash(toml_str)` returns `v1:<sha256-hex>` over a
canonicalised byte stream:

  parse TOML → toml::Value tree
  → sort arrays whose paths are in the unordered policy table
    (`skip`, `keep`, `[on_success].skip`, `[on_failure].skip`)
  → collapse `command = ["x"]` to `command = "x"` (single-entry only)
  → recursively prune entries equal to `false`, `[]`, or `{}`
  → re-emit via `toml::to_string`
  → SHA-256 the bytes

The canonicaliser is schema-decoupled — it operates on `toml::Value`,
not `FilterConfig`. New fields added to `FilterConfig` later don't
appear in the canonical bytes unless the user writes them in their
TOML. Filters that don't reference new fields keep the same v1 hash
forever.

## Server side

`DownloadPayload` gains a `v1_hash: String` field, computed alongside
`content_hash` on every download.

## Client side

`DownloadedFilter` gains an optional `v1_hash` field. Install flow's
`verify_and_resolve_hash` now has a three-tier preference: v1 first,
content_hash second, URL hash as last-resort fallback. Each tier
recomputes locally and verifies against the server's claim — wire-tamper
detection is preserved.

## Test corpus

Snapshot test: every `.toml` under `crates/tokf-cli/filters/` (51
real-world stdlib filters) has its v1 hash recorded in
`tests/canonical_v1_stdlib.txt`. CI fails if any drifts. Authoring
helper (`dump_stdlib_hashes`, `#[ignore]`d) regenerates the file when
new stdlib filters are added.

Property tests on a representative subset:
- `invariant_toml_roundtrip_idempotent` — toml::from_str → to_string →
  hash is stable.
- `invariant_leading_comments_and_blanks` — file-level comments and
  blank-line padding don't change the hash.
- `invariant_skip_keep_reversed_via_ast` — reversing unordered array
  values via AST mutation doesn't change the hash.
- `invariant_default_false_added` — adding `dedup = false` to a filter
  that doesn't mention it doesn't change the hash.
- `distinguishing_value_change` — sanity: a real value change DOES
  change the hash.

24 unit tests in canonical_v1 itself cover every spec rule (output
format, BTreeMap label invariance, unknown-fields-dropped, default
omission edge cases — including 0-integer and "" preserved, true bool
preserved, empty-string distinguished from absent — plus all error
paths).

## Dependency change

`toml` is promoted from optional (under `validation` feature) to a
regular dep on `tokf-common`, pinned to exactly `=1.0.3`. The pin is
load-bearing: v1's canonical bytes depend on the toml crate's emission;
any version bump must be gated by the corpus, with a v2 trigger
prepared for cases where emission genuinely changes.

## Test plan

- [x] `cargo fmt --all -- --check` clean
- [x] `cargo clippy --workspace --all-targets -- -D warnings` clean
- [x] `cargo test --workspace` — 2216 passed (+30 over baseline)
- [x] All 51 stdlib filter v1 hashes recorded; corpus round-trip green
- [x] Property tests green on representative stdlib subset

## Closes / refs

Refs #350 (the strategic fix; #351 stays as the immediate stopgap).

Follow-up work (separate PRs):
- Server backfill endpoint to populate v1 aliases for existing filters.
- Filter table dedup migration: collapse rows with the same v1 hash,
  unifying split statistics. Behind a 0.x-track deprecation
  announcement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary

Persists the schema-independent canonical TOML hash (`v1_hash`,
ADR-0002) on the `filters` table and ships an operator-only backfill
endpoint.

## What changes

- **Migration** `20260428000000_add_v1_hash.sql`: adds nullable `v1_hash
TEXT` + non-partial index. Nullable on purpose — existing rows backfill
operationally; `NOT NULL` and `UNIQUE` are deferred to the dedup PR.
- **Publish** computes v1 alongside `content_hash` in
`validate_and_prepare`. Both publish paths (regular + stdlib) now store
v1 on insert.
- **v1-collision rejection**: when a new submission has the same
`v1_hash` as an existing row but a different `content_hash`, the request
returns 200 OK with the *existing* row's author and content_hash,
without inserting. Stops new publishes from re-splitting canonically
equivalent filters across `content_hash` variants — the going-forward
fix for #350.
- **Backfill endpoint** `POST /api/filters/backfill-v1-hashes`:
service-token-protected, scan-and-compute. Iterates `WHERE v1_hash IS
NULL` rows, fetches the TOML from R2, computes v1, writes it back.
Per-row failures (missing R2 object, malformed TOML) surface in
`failed[]` without aborting the batch.

## Out of scope (explicit follow-ups)

- Dedup migration for existing duplicate rows
- `UNIQUE(v1_hash)` constraint
- `v1_hash NOT NULL` tightening

These all belong to the next PR after backfill has been run in
production. Per-row decisions for that work were captured during
planning: canonical row = oldest, stdlib breaks ties; non-canonical rows
kept and pointed at the canonical row via the existing `successor_hash`
column (no URLs broken).

## Operator runbook (for after deploy)

\`\`\`sh
while true; do
out=\$(curl -X POST -H \"Authorization: Bearer \$TOKF_SERVICE_TOKEN\" \\
    \$SERVER/api/filters/backfill-v1-hashes -d '{\"limit\":100}')
  echo \"\$out\"
  [ \"\$(echo \"\$out\" | jq .processed)\" -eq 0 ] && break
done
\`\`\`

Inspect any persistent entries in \`failed[]\` for manual triage
(corrupt TOML in R2, missing object).

## Test plan

- [x] Migration applies cleanly to a fresh CockroachDB (\`just db-reset
&& cargo sqlx migrate run --source crates/tokf-server/migrations\`)
- [x] Publish v1 storage + collision rejection (DB integration tests,
fixture genuinely triggers v1-collision branch via \`command = "x"\` vs
\`["x"]\`)
- [x] Backfill: populates NULL rows, idempotent, respects limit, caps at
MAX, requires service token, rejects invalid bearer, reports failure for
missing R2 object, reports failure for unparseable TOML
- [x] Stdlib publish stores v1 (separate INSERT path)
- [x] \`cargo fmt --check\`, \`cargo clippy --workspace --all-targets --
-D warnings\`, \`cargo dupes\` all clean
- [x] Full server suite: 268 passed (was 266 before, +2 new tests)

## Stack

- #353 ← (parent — \`feat/350-canonical-v1\`)
- #351 (foundation — \`feat/350-install-hash-back-compat\`)
- main

This PR targets \`feat/350-canonical-v1\`. CI will run once #353 merges
to main and this PR's base auto-rebases.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mpecan mpecan merged commit e47ac7e into feat/350-install-hash-back-compat May 5, 2026
@mpecan mpecan deleted the feat/350-canonical-v1 branch May 5, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant