fix(server,cli): server recomputes content_hash on download to absorb FilterConfig schema drift#351
Open
mpecan wants to merge 2 commits into
Open
fix(server,cli): server recomputes content_hash on download to absorb FilterConfig schema drift#351mpecan wants to merge 2 commits into
mpecan wants to merge 2 commits into
Conversation
… FilterConfig schema drift Closes #350. `tokf install <hash>` was failing with "filter hash mismatch — the server may have returned tampered content" for filters published before recent `FilterConfig` schema additions (e.g. `inject_path`, added 2026-03-07 in 2fa1e50). Each new field with `#[serde(default)]` silently changes the output of `canonical_hash` for every same-TOML filter that doesn't reference the new field, breaking the hash that was stored at publish time. Investigation ruled out client-side reconstruction strategies: stripping type-default fields from the JSON, stripping known-since-initial fields, emitting canonical TOML via `toml::to_string(toml::Value)` — none reproduce the URL hash for the two reported filters. The original shape can't be recovered from the current binary. The server is the trust authority: on `GET /api/filters/<hash>/download`, it now parses the stored TOML and recomputes `canonical_hash` with the current binary, returning it as `content_hash` in the response. The URL hash becomes a stable lookup key; the recomputed `content_hash` is the authoritative identity under the current schema. The client trusts the server's `content_hash`, but still hashes the wire bytes and asserts they match — preserving wire-tampering detection between server and client. When the user-requested URL hash differs from the recomputed `content_hash`, the client emits a one-line stderr note explaining the schema drift; the install proceeds under the recomputed identity. Old servers that don't yet emit `content_hash` fall through to the historical URL-hash check (graceful degradation); behaviour matches today against the upgrade matrix. Long-term: define an explicit, version-tagged canonical-TOML hash so future schema additions don't silently invalidate stored identities. Tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Filter Verification ReportChanged FiltersNo filter files changed in this PR. All Filters Summary✅ 143/143 test cases passed across 51 filters Generated by |
This was referenced Apr 28, 2026
mpecan
added a commit
that referenced
this pull request
May 5, 2026
## Summary Persists the schema-independent canonical TOML hash (`v1_hash`, ADR-0002) on the `filters` table and ships an operator-only backfill endpoint. ## What changes - **Migration** `20260428000000_add_v1_hash.sql`: adds nullable `v1_hash TEXT` + non-partial index. Nullable on purpose — existing rows backfill operationally; `NOT NULL` and `UNIQUE` are deferred to the dedup PR. - **Publish** computes v1 alongside `content_hash` in `validate_and_prepare`. Both publish paths (regular + stdlib) now store v1 on insert. - **v1-collision rejection**: when a new submission has the same `v1_hash` as an existing row but a different `content_hash`, the request returns 200 OK with the *existing* row's author and content_hash, without inserting. Stops new publishes from re-splitting canonically equivalent filters across `content_hash` variants — the going-forward fix for #350. - **Backfill endpoint** `POST /api/filters/backfill-v1-hashes`: service-token-protected, scan-and-compute. Iterates `WHERE v1_hash IS NULL` rows, fetches the TOML from R2, computes v1, writes it back. Per-row failures (missing R2 object, malformed TOML) surface in `failed[]` without aborting the batch. ## Out of scope (explicit follow-ups) - Dedup migration for existing duplicate rows - `UNIQUE(v1_hash)` constraint - `v1_hash NOT NULL` tightening These all belong to the next PR after backfill has been run in production. Per-row decisions for that work were captured during planning: canonical row = oldest, stdlib breaks ties; non-canonical rows kept and pointed at the canonical row via the existing `successor_hash` column (no URLs broken). ## Operator runbook (for after deploy) \`\`\`sh while true; do out=\$(curl -X POST -H \"Authorization: Bearer \$TOKF_SERVICE_TOKEN\" \\ \$SERVER/api/filters/backfill-v1-hashes -d '{\"limit\":100}') echo \"\$out\" [ \"\$(echo \"\$out\" | jq .processed)\" -eq 0 ] && break done \`\`\` Inspect any persistent entries in \`failed[]\` for manual triage (corrupt TOML in R2, missing object). ## Test plan - [x] Migration applies cleanly to a fresh CockroachDB (\`just db-reset && cargo sqlx migrate run --source crates/tokf-server/migrations\`) - [x] Publish v1 storage + collision rejection (DB integration tests, fixture genuinely triggers v1-collision branch via \`command = "x"\` vs \`["x"]\`) - [x] Backfill: populates NULL rows, idempotent, respects limit, caps at MAX, requires service token, rejects invalid bearer, reports failure for missing R2 object, reports failure for unparseable TOML - [x] Stdlib publish stores v1 (separate INSERT path) - [x] \`cargo fmt --check\`, \`cargo clippy --workspace --all-targets -- -D warnings\`, \`cargo dupes\` all clean - [x] Full server suite: 268 passed (was 266 before, +2 new tests) ## Stack - #353 ← (parent — \`feat/350-canonical-v1\`) - #351 (foundation — \`feat/350-install-hash-back-compat\`) - main This PR targets \`feat/350-canonical-v1\`. CI will run once #353 merges to main and this PR's base auto-rebases. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
**Stacked on #351.** This PR can only merge after #351 lands; if reviewing, it might be easier to view just this PR's commits via the diff against `feat/350-install-hash-back-compat`. Implements [ADR-0002](docs/adr/0002-canonical-v1-hash.md) — a schema-independent canonical TOML hash that's stable across future `FilterConfig` schema additions. v1 is the long-term identity for every filter going forward; #351's `content_hash` (schema-tied recompute) is retained as a transition fallback. ## Summary - **`crates/tokf-common/src/canonical_v1.rs`** — `canonical_v1::hash(toml_str) -> Result<String, _>`. Returns `v1:<sha256-hex>`. Three normalisation passes on a `toml::Value` tree (sort unordered arrays, collapse `command` single-form, prune `false`/`[]`/`{}` defaults), then re-emit via `toml::to_string`, then SHA-256. - **Schema-decoupled.** Walks the TOML AST, not `FilterConfig`. Adding fields to `FilterConfig` doesn't change v1 hashes for filters that don't write those fields in their TOML. - **Server**: `DownloadPayload` gains `v1_hash: String`, computed on every download alongside `content_hash`. - **Client**: `DownloadedFilter` gains optional `v1_hash`. `verify_and_resolve_hash` now prefers v1 → content_hash → URL hash, recomputing locally at each tier so wire-tamper detection is preserved. - **Frozen corpus**: 51 real stdlib filters (every `.toml` under `crates/tokf-cli/filters/`) have their v1 hashes recorded in `tests/canonical_v1_stdlib.txt`. CI fails if any drift. An `#[ignore]`d authoring helper rebuilds the file when new stdlib filters are added. - **Property tests**: TOML round-trip idempotence, leading comments/blanks invariance, AST-level skip/keep reordering invariance, default-omission invariance, and a sanity-check distinguishing test. - **`toml` dep promoted to regular and pinned exactly to `=1.0.3`**. v1's bytes depend on the toml crate's emission; the pin is load-bearing per the ADR's stability clause. ## ADR The full specification is in `docs/adr/0002-canonical-v1-hash.md` (status: Accepted). It locks down the algorithm, the unordered-paths policy table, the `command` special-form collapse, the default-omission rules, and the v2 conditions (canonicaliser bug, toml-crate emission change we can't defer, or an existing policy entry needing a different policy — adding *new* entries for new fields is part of v1's compat clause). ## Test plan - [x] `cargo fmt --all -- --check` clean - [x] `cargo clippy --workspace --all-targets -- -D warnings` clean - [x] `cargo test --workspace` — 2216 passed (+30 over baseline) - [x] 24 unit tests in `canonical_v1` covering every spec rule - [x] 6 corpus + property tests across 51 real-world stdlib filters - [x] Server `download_returns_toml_content` test asserts the new `v1_hash` field is `v1:<64-hex>` - [x] Client `verify_and_resolve_hash` tests cover the v1-preferred path, content_hash fallback, URL fallback, and every error case ## Follow-ups (separate PRs, after this lands) - **Server backfill endpoint** — populate `filter_hashes` (or equivalent) so old URLs route to their v1 hash. - **Filter table dedup migration** — collapse rows with the same v1 hash, unifying their split statistics. Cut behind a 0.x-track deprecation announcement (old client URLs will need an upgrade). - **Long-term**: ship v2 only when canonicaliser bug, `toml` crate emission change we can't defer, or existing policy entry needing a different policy. ## Closes / refs Refs #350. Does not close — the migration / dedup work is separate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #350.
Summary
GET /api/filters/<hash>/downloadnow parses the stored TOML and recomputescanonical_hashwith the current binary, returning it ascontent_hashin the response.content_hashas the authoritative identity, while still hashing the wire bytes to detect tampering between server and client. When the URL hash differs from the recomputedcontent_hash(i.e. the filter was published under an older schema), a one-line stderr note explains the drift and the install proceeds under the new identity.content_hashtrigger the historical URL-hash check on the client (graceful degradation; no regression).Why this approach
The bug-report filters (
0585b874…,d2a19dc4…) cannot be repaired client-only — the URL hash was produced by an olderFilterConfigschema whose exact field set we can't reconstruct from the current binary. I exhausted client-side reconstruction strategies (strip type-defaults, strip-since-initial, canonical TOML viatoml::to_stringontoml::Value, raw-byte hash) and none reproduce the stored hash.Switching to "server is the trust" works because:
The URL hash effectively becomes a stable lookup key; the recomputed hash is the content identity under the current schema. The architecture stays simple, no DB migration is needed, and old clients keep working unchanged against new servers.
Long-term direction (not in this PR)
canonical_hashis fundamentally fragile because it's tied to the in-memory shape ofFilterConfig. The proper fix is an explicit, version-tagged canonical-TOML hash decoupled from struct evolution (e.g.v1:<sha256>over a deterministic TOML emission with sorted keys / stripped comments). Tracking as a follow-up issue.Test plan
cargo fmt -- --checkcleancargo clippy --workspace --all-targets -- -D warningscleancargo test --workspace— 2184 passing (10 new: 3 server unit, 2 client deserialize, 5 clientverify_and_resolve_hashbranches)download_returns_toml_contentextended to assertcontent_hashround-tripsBranch coverage of
verify_and_resolve_hashDeployment ordering note
This is a coordinated fix:
The fix lands the moment
tokf.netis updated to servecontent_hash. No client republish needed.🤖 Generated with Claude Code