Skip to content

fix(server,cli): server recomputes content_hash on download to absorb FilterConfig schema drift#351

Open
mpecan wants to merge 2 commits into
mainfrom
feat/350-install-hash-back-compat
Open

fix(server,cli): server recomputes content_hash on download to absorb FilterConfig schema drift#351
mpecan wants to merge 2 commits into
mainfrom
feat/350-install-hash-back-compat

Conversation

@mpecan
Copy link
Copy Markdown
Owner

@mpecan mpecan commented Apr 27, 2026

Closes #350.

Summary

  • Server: GET /api/filters/<hash>/download now parses the stored TOML and recomputes canonical_hash with the current binary, returning it as content_hash in the response.
  • Client: trusts the server-provided content_hash as the authoritative identity, while still hashing the wire bytes to detect tampering between server and client. When the URL hash differs from the recomputed content_hash (i.e. the filter was published under an older schema), a one-line stderr note explains the drift and the install proceeds under the new identity.
  • Old servers that don't yet emit content_hash trigger the historical URL-hash check on the client (graceful degradation; no regression).

Why this approach

The bug-report filters (0585b874…, d2a19dc4…) cannot be repaired client-only — the URL hash was produced by an older FilterConfig schema whose exact field set we can't reconstruct from the current binary. I exhausted client-side reconstruction strategies (strip type-defaults, strip-since-initial, canonical TOML via toml::to_string on toml::Value, raw-byte hash) and none reproduce the stored hash.

Switching to "server is the trust" works because:

  • The server already has the canonical bytes in R2.
  • Recomputing on every download is cheap (one TOML parse + one SHA-256, dwarfed by R2 latency).
  • Wire-tampering detection survives: the client hashes the bytes it received and asserts they hash to the value the server claims.
  • Future schema additions stop being silent identity-breakers — the recomputed hash always matches what the current code believes.

The URL hash effectively becomes a stable lookup key; the recomputed hash is the content identity under the current schema. The architecture stays simple, no DB migration is needed, and old clients keep working unchanged against new servers.

Long-term direction (not in this PR)

canonical_hash is fundamentally fragile because it's tied to the in-memory shape of FilterConfig. The proper fix is an explicit, version-tagged canonical-TOML hash decoupled from struct evolution (e.g. v1:<sha256> over a deterministic TOML emission with sorted keys / stripped comments). Tracking as a follow-up issue.

Test plan

  • cargo fmt -- --check clean
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • cargo test --workspace — 2184 passing (10 new: 3 server unit, 2 client deserialize, 5 client verify_and_resolve_hash branches)
  • Existing server integration test download_returns_toml_content extended to assert content_hash round-trips
  • Pre-commit hook (file-size limits, README regeneration) passed

Branch coverage of verify_and_resolve_hash

  • server hash present + matches client + URL matches server hash → silent success
  • server hash present + matches client + URL ≠ server hash → success with stderr note (this is the [BUG] Community filters cannot be installed #350 happy path)
  • server hash present + ≠ client → wire-tamper error (assert both hashes appear in message)
  • server hash absent + URL == client → success (old-server happy path)
  • server hash absent + URL ≠ client → error referencing [BUG] Community filters cannot be installed #350 (old-server, no help possible)

Deployment ordering note

This is a coordinated fix:

  • New client + new server → bug-report filters install successfully.
  • Old client + new server → unchanged (old client ignores the new field).
  • New client + old server → falls back to URL hash check; affected filters still fail (no regression).
  • Old client + old server → status quo.

The fix lands the moment tokf.net is updated to serve content_hash. No client republish needed.

🤖 Generated with Claude Code

… FilterConfig schema drift

Closes #350.

`tokf install <hash>` was failing with "filter hash mismatch — the server
may have returned tampered content" for filters published before recent
`FilterConfig` schema additions (e.g. `inject_path`, added 2026-03-07 in
2fa1e50). Each new field with `#[serde(default)]` silently changes the
output of `canonical_hash` for every same-TOML filter that doesn't
reference the new field, breaking the hash that was stored at publish
time.

Investigation ruled out client-side reconstruction strategies: stripping
type-default fields from the JSON, stripping known-since-initial fields,
emitting canonical TOML via `toml::to_string(toml::Value)` — none
reproduce the URL hash for the two reported filters. The original
shape can't be recovered from the current binary.

The server is the trust authority: on `GET /api/filters/<hash>/download`,
it now parses the stored TOML and recomputes `canonical_hash` with the
current binary, returning it as `content_hash` in the response. The URL
hash becomes a stable lookup key; the recomputed `content_hash` is the
authoritative identity under the current schema.

The client trusts the server's `content_hash`, but still hashes the wire
bytes and asserts they match — preserving wire-tampering detection
between server and client. When the user-requested URL hash differs
from the recomputed `content_hash`, the client emits a one-line stderr
note explaining the schema drift; the install proceeds under the
recomputed identity.

Old servers that don't yet emit `content_hash` fall through to the
historical URL-hash check (graceful degradation); behaviour matches
today against the upgrade matrix.

Long-term: define an explicit, version-tagged canonical-TOML hash so
future schema additions don't silently invalidate stored identities.
Tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@repository-butler
Copy link
Copy Markdown
Contributor

Filter Verification Report

Changed Filters

No filter files changed in this PR.

All Filters Summary

✅ 143/143 test cases passed across 51 filters


Generated by tokf verify

mpecan added a commit that referenced this pull request May 5, 2026
## Summary

Persists the schema-independent canonical TOML hash (`v1_hash`,
ADR-0002) on the `filters` table and ships an operator-only backfill
endpoint.

## What changes

- **Migration** `20260428000000_add_v1_hash.sql`: adds nullable `v1_hash
TEXT` + non-partial index. Nullable on purpose — existing rows backfill
operationally; `NOT NULL` and `UNIQUE` are deferred to the dedup PR.
- **Publish** computes v1 alongside `content_hash` in
`validate_and_prepare`. Both publish paths (regular + stdlib) now store
v1 on insert.
- **v1-collision rejection**: when a new submission has the same
`v1_hash` as an existing row but a different `content_hash`, the request
returns 200 OK with the *existing* row's author and content_hash,
without inserting. Stops new publishes from re-splitting canonically
equivalent filters across `content_hash` variants — the going-forward
fix for #350.
- **Backfill endpoint** `POST /api/filters/backfill-v1-hashes`:
service-token-protected, scan-and-compute. Iterates `WHERE v1_hash IS
NULL` rows, fetches the TOML from R2, computes v1, writes it back.
Per-row failures (missing R2 object, malformed TOML) surface in
`failed[]` without aborting the batch.

## Out of scope (explicit follow-ups)

- Dedup migration for existing duplicate rows
- `UNIQUE(v1_hash)` constraint
- `v1_hash NOT NULL` tightening

These all belong to the next PR after backfill has been run in
production. Per-row decisions for that work were captured during
planning: canonical row = oldest, stdlib breaks ties; non-canonical rows
kept and pointed at the canonical row via the existing `successor_hash`
column (no URLs broken).

## Operator runbook (for after deploy)

\`\`\`sh
while true; do
out=\$(curl -X POST -H \"Authorization: Bearer \$TOKF_SERVICE_TOKEN\" \\
    \$SERVER/api/filters/backfill-v1-hashes -d '{\"limit\":100}')
  echo \"\$out\"
  [ \"\$(echo \"\$out\" | jq .processed)\" -eq 0 ] && break
done
\`\`\`

Inspect any persistent entries in \`failed[]\` for manual triage
(corrupt TOML in R2, missing object).

## Test plan

- [x] Migration applies cleanly to a fresh CockroachDB (\`just db-reset
&& cargo sqlx migrate run --source crates/tokf-server/migrations\`)
- [x] Publish v1 storage + collision rejection (DB integration tests,
fixture genuinely triggers v1-collision branch via \`command = "x"\` vs
\`["x"]\`)
- [x] Backfill: populates NULL rows, idempotent, respects limit, caps at
MAX, requires service token, rejects invalid bearer, reports failure for
missing R2 object, reports failure for unparseable TOML
- [x] Stdlib publish stores v1 (separate INSERT path)
- [x] \`cargo fmt --check\`, \`cargo clippy --workspace --all-targets --
-D warnings\`, \`cargo dupes\` all clean
- [x] Full server suite: 268 passed (was 266 before, +2 new tests)

## Stack

- #353 ← (parent — \`feat/350-canonical-v1\`)
- #351 (foundation — \`feat/350-install-hash-back-compat\`)
- main

This PR targets \`feat/350-canonical-v1\`. CI will run once #353 merges
to main and this PR's base auto-rebases.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
**Stacked on #351.** This PR can only merge after #351 lands; if
reviewing, it might be easier to view just this PR's commits via the
diff against `feat/350-install-hash-back-compat`.

Implements [ADR-0002](docs/adr/0002-canonical-v1-hash.md) — a
schema-independent canonical TOML hash that's stable across future
`FilterConfig` schema additions. v1 is the long-term identity for every
filter going forward; #351's `content_hash` (schema-tied recompute) is
retained as a transition fallback.

## Summary

- **`crates/tokf-common/src/canonical_v1.rs`** —
`canonical_v1::hash(toml_str) -> Result<String, _>`. Returns
`v1:<sha256-hex>`. Three normalisation passes on a `toml::Value` tree
(sort unordered arrays, collapse `command` single-form, prune
`false`/`[]`/`{}` defaults), then re-emit via `toml::to_string`, then
SHA-256.
- **Schema-decoupled.** Walks the TOML AST, not `FilterConfig`. Adding
fields to `FilterConfig` doesn't change v1 hashes for filters that don't
write those fields in their TOML.
- **Server**: `DownloadPayload` gains `v1_hash: String`, computed on
every download alongside `content_hash`.
- **Client**: `DownloadedFilter` gains optional `v1_hash`.
`verify_and_resolve_hash` now prefers v1 → content_hash → URL hash,
recomputing locally at each tier so wire-tamper detection is preserved.
- **Frozen corpus**: 51 real stdlib filters (every `.toml` under
`crates/tokf-cli/filters/`) have their v1 hashes recorded in
`tests/canonical_v1_stdlib.txt`. CI fails if any drift. An `#[ignore]`d
authoring helper rebuilds the file when new stdlib filters are added.
- **Property tests**: TOML round-trip idempotence, leading
comments/blanks invariance, AST-level skip/keep reordering invariance,
default-omission invariance, and a sanity-check distinguishing test.
- **`toml` dep promoted to regular and pinned exactly to `=1.0.3`**.
v1's bytes depend on the toml crate's emission; the pin is load-bearing
per the ADR's stability clause.

## ADR

The full specification is in `docs/adr/0002-canonical-v1-hash.md`
(status: Accepted). It locks down the algorithm, the unordered-paths
policy table, the `command` special-form collapse, the default-omission
rules, and the v2 conditions (canonicaliser bug, toml-crate emission
change we can't defer, or an existing policy entry needing a different
policy — adding *new* entries for new fields is part of v1's compat
clause).

## Test plan

- [x] `cargo fmt --all -- --check` clean
- [x] `cargo clippy --workspace --all-targets -- -D warnings` clean
- [x] `cargo test --workspace` — 2216 passed (+30 over baseline)
- [x] 24 unit tests in `canonical_v1` covering every spec rule
- [x] 6 corpus + property tests across 51 real-world stdlib filters
- [x] Server `download_returns_toml_content` test asserts the new
`v1_hash` field is `v1:<64-hex>`
- [x] Client `verify_and_resolve_hash` tests cover the v1-preferred
path, content_hash fallback, URL fallback, and every error case

## Follow-ups (separate PRs, after this lands)

- **Server backfill endpoint** — populate `filter_hashes` (or
equivalent) so old URLs route to their v1 hash.
- **Filter table dedup migration** — collapse rows with the same v1
hash, unifying their split statistics. Cut behind a 0.x-track
deprecation announcement (old client URLs will need an upgrade).
- **Long-term**: ship v2 only when canonicaliser bug, `toml` crate
emission change we can't defer, or existing policy entry needing a
different policy.

## Closes / refs

Refs #350. Does not close — the migration / dedup work is separate.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Community filters cannot be installed

1 participant