Kernel auth gap on Withdraw — reproducible PoC + analysis#4
Closed
saroupille wants to merge 11 commits intotrilitech:mainfrom
Closed
Kernel auth gap on Withdraw — reproducible PoC + analysis#4saroupille wants to merge 11 commits intotrilitech:mainfrom
saroupille wants to merge 11 commits intotrilitech:mainfrom
Conversation
A WOTS-signed `ConfigureVerifier` KernelInboxMessage serializes to
4923 bytes, and `ConfigureBridge` to 4835 bytes. Both exceed the
Tezos smart-rollup protocol constant `sc_rollup_message_size_limit`
(4096 bytes), so they cannot transit through the L1 external-message
path that `octez-client send smart rollup message` uses.
This commit extends the existing DAL delivery path — already routing
Shield / Transfer / Unshield payloads too large for L1 — to cover the
two admin configuration messages.
Kernel-side:
- `core/src/kernel_wire.rs`
* `KernelDalPayloadKind` gains `ConfigureVerifier` (wire tag 3)
and `ConfigureBridge` (wire tag 4).
* `kernel_dal_payload_kind_{to,from}_wire` handle both.
* A comment clarifies that tag numbering here is independent of
the tags used by `WireKernelInboxMessage`.
* `KERNEL_WIRE_VERSION` bumped to 10: older clients that read a
`DalPointer` with `kind=3|4` now see an explicit envelope
version mismatch rather than an opaque tag error.
- `tezos/rollup-kernel/src/lib.rs`
* `fetch_kernel_message_from_dal` accepts the two new kinds.
* The dispatch match is reshaped to be exhaustive on
`KernelDalPayloadKind`: any future variant will be a compile
error until handled here, instead of silently hitting the old
`_ => Err("kind mismatch")` arm.
* Docstring explains that the kind-vs-content check is a
defense-in-depth control (forces honest labeling for auditors)
and that authenticity itself comes from the WOTS signature /
STARK proof inside the payload — DAL is a public bulletin
board with no transport-level authentication.
* `dal_payload_kind_name` labels the two new variants.
Tooling:
- `tezos/rollup-kernel/src/bin/octez_kernel_message.rs`
* New `configure-verifier-payload` and `configure-bridge-payload`
subcommands emit the raw unframed KernelInboxMessage hex — the
input for chunking and DAL publication.
* `dal-pointer` accepts the `configure_verifier` and
`configure_bridge` kind tokens.
* When `TZEL_ROLLUP_CONFIG_ADMIN_ASK_HEX` is unset in debug
builds, the fallback to the public dev ask now emits a
`eprintln!` warning; silent fallback paired with a
release-profile kernel built without admin material would be a
footgun.
- `tezos/rollup-kernel/build.rs` (new)
* Emits `cargo:rerun-if-env-changed=` for the three admin
material env vars consumed by `option_env!()` in the kernel
source (the config-admin public seed plus the verifier and
bridge config-admin WOTS leaves), so a rotation of the admin
material always re-bakes the WASM.
Tests:
- `core/src/kernel_wire.rs::tests`
* Size sentinels for `ConfigureVerifier` (4923 bytes) and
`ConfigureBridge` (4835 bytes). The tests pass today; they
fail loudly if a future encoding change shifts either size,
forcing a review of DAL routing assumptions.
Operator and wallet-server are intentionally left unchanged: by
design, admin config messages flow directly from an admin's
`octez_kernel_message` + `octez-client` (with the admin's own L1 key
and WOTS ask), never through the user-facing operator API. This keeps
the operator's interface narrow, preserves admin availability
independent of operator health, and prevents a bearer-token leak from
granting the ability to inject admin configs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an "Admin configuration messages and DAL routing" section to the
rollup-kernel README describing:
- why `ConfigureVerifier` and `ConfigureBridge` must use DAL (WOTS
signature bloat pushes each message above 4096 bytes);
- the delivery flow end-to-end (admin computes unframed payload,
chunks to DAL, injects a `DalPointer` on L1; kernel reassembles
pages, verifies the payload hash, decodes, and dispatches);
- a checklist for adding a new oversized message type in the
future (wire tag, dispatch arm, CLI subcommand, size sentinel).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five fixes bundled to get `octez_rollup_sandbox_dal_smoke.sh` through
configure / deposit on an Octez master built from recent trunk:
1. `attestation_lags` invariant
The Alpha protocol gained a restriction (tezos master 8499ce19ac,
2025-12-04) that the last element of `attestation_lags` must equal
`attestation_lag`. The mockup generates `attestation_lags =
[1,2,3,4,5]` by default, and the script overrides `attestation_lag`
to 2 via `DAL_ATTESTATION_LAG`, so `activate_alpha` fails. Force
`attestation_lags = [attestation_lag]` in `build_alpha_sandbox_params`.
2. Configure messages via DAL
`configure-verifier` (4944 bytes framed) and `configure-bridge`
(4856 bytes framed) exceed `sc_rollup_message_size_limit = 4096`,
so the old direct `octez-client send smart rollup message` path
fails at encoding. Route both via the DAL delivery path instead,
using the new `configure-{verifier,bridge}-payload` CLI subcommands
and a generalized `publish_payload_via_dal_and_inject_pointer`
helper factored out of `publish_shield_via_dal_and_inject_pointer`.
3. Admin material baked into the release kernel WASM
The release kernel's `authenticate_{verifier,bridge}_config` only
accepts admin-signed payloads when the admin leaves are baked in at
compile time via `TZEL_ROLLUP_CONFIG_ADMIN_*_HEX`. Without this,
the kernel silently rejects every configure payload. Call
`scripts/prepare_rollup_config_admin.sh` before the kernel build
and source both the runtime (secret ask) and build (public leaves)
env files.
4. `xxd -ps -c 0` newline workaround
On our xxd version `-c 0` still wraps at ~60 characters, inserting
newlines that silently break string matches and URL / Michelson
arg construction. Pipe to `tr -d '\n'` on the affected call sites:
`await_bridge_ticketer`, `deposit_to_bridge`, and the balance-key
construction in `main`. Without this, `await_bridge_ticketer`
reports "ticketer did not appear" even after the kernel applied
the configuration.
5. Caveat on `set -a` scope
A comment makes explicit that exporting
`TZEL_ROLLUP_CONFIG_ADMIN_ASK_HEX` to every descendant process is
acceptable in sandbox (ephemeral per-workdir ask) but must not be
copied to a production runner.
After these fixes, the smoke reaches and applies configure-verifier +
configure-bridge + the initial bridge deposit; the subsequent fixture
shield step is not within the scope of this patch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bde1347 ("Add burned rollup fees and DAL producer note outputs") made two changes the sandbox smoke script never absorbed: 1. apply_shield now debits `v + fee + producer_fee` (not just `v`) from the sender's public balance: let debit = req.v .checked_add(req.fee) .and_then(|value| value.checked_add(req.producer_fee))?; if bal < debit { return Err("insufficient balance"); } The fixture metadata still exposed only `shield_amount: fixture.shield.v` and the sandbox deposited exactly that. Post-bde1347 the balance is short by `fee + producer_fee`, the shield fails with "insufficient balance", and the public drain never happens. For the checked-in fixture (v=400_000, fee=100_000, producer_fee=1 mutez) the required deposit is 500_001 mutez instead of 400_000. Rename `shield_amount` to `shield_bridge_deposit` in `FixtureMetadata` and compute it as `v + fee + producer_fee`. The sandbox script picks up the new field and uses the same value for both the bridge deposit and the pre-shield balance assertion. 2. apply_shield now appends *two* notes to the Merkle tree per shield: the sender's own commitment and the producer's compensation commitment. The smoke's post-shield assertion still expected `/tzel/v1/state/tree/size == 1` — the pre-fees value — which makes the smoke stall at line 698 even though the shield applied cleanly and the public balance drained to zero. Update the assertion to `2` and leave a comment pointing at apply_shield so the next person knows why. These two regressions were hidden behind an earlier one (configure messages exceeding sc_rollup_message_size_limit, fixed in 5071e2e on this branch): the script never got past configure-verifier, so neither bde1347-induced break ever executed. Once configures route through DAL, the smoke advances through the shield and both breaks surface in sequence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two size sentinels next to this test lock exact byte counts for `ConfigureVerifier` and `ConfigureBridge`. They would have caught the specific regression they target (a WOTS signature change growing those two messages past `sc_rollup_message_size_limit = 4096`) — but only because someone knew to write them *after* the regression surfaced. A more general failure mode: a new field lands on any `KernelInboxMessage` variant, pushes its serialized size past 4096, and nothing fails until an operator tries `octez-client send smart rollup message` against a real node and gets rejected at the L1 inbox. That is how commit 2c45d9c broke admin config: unit tests all passed; the break surfaced weeks later in the sandbox smoke. Add a third test that makes the invariant structural: - A `Routing` enum (`FitsL1` / `RequiresDal`) classifies every variant. `required_routing` is an **exhaustive match** on `KernelInboxMessage` with no `_` arm — the compiler forces any future variant author to classify the new message before the crate builds. - `framed_len` computes the on-wire size the L1 inbox actually sees, i.e. `encode_kernel_inbox_message(...).len() + ExternalMessageFrame::Targetted` overhead (21 bytes: 1 tag + 20 bytes of `SmartRollupHash`). The existing sentinels measure the unframed envelope and under-count by 21 bytes — a message that lands just below 4096 unframed can still be rejected on wire. - The assertion is two-sided: * FitsL1 with `framed > 4096` fails: the L1 routing is broken. * RequiresDal with `framed <= 4096` fails: the DAL plumbing for that variant is dead code and the classification needs revisiting. Representative instances: - `ConfigureVerifier` / `ConfigureBridge`: real WOTS-signed configs (same construction as the sentinels). - `Shield` / `Transfer` / `Unshield`: built with a 4096-byte `proof_bytes` stub — the cheapest size that keeps the RequiresDal classification unambiguous without requiring a full STARK proof in the test harness. - `Withdraw`: small string fields + `u64`, representative of production. - `DalPointer`: single-chunk pointer, representative of what the kernel emits. The frame overhead is replicated as a local constant rather than pulled from `tezos-smart-rollup-encoding` at dev-dep time: that crate pins `tezos_data_encoding = 0.5.2` while `tzel-core` already depends on `tezos_data_encoding = 0.6`, and introducing both majors into the test build for a single 21-byte constant is not worth the friction. The constant is documented with the layout it replicates and verified empirically against `octez_kernel_message dal-pointer` output on a real sr1 address. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After bde1347 ("Add burned rollup fees and DAL producer note outputs"), `apply_shield` debits `v + fee + producer_fee` from the sender's public rollup balance, not just `v + fee`. The tutorial instructed readers to deposit `300000` mutez before shielding `200000` mutez, which covers `v + burn (100000)` exactly — leaving the shield short by the configured DAL-producer fee (`dal_fee = 1` mutez as set by the init-shadownet example). The shield step then fails with "insufficient balance" and the tutorial cannot be completed as written. Bump the deposit to `300001` mutez and update the expected post-deposit balance line to match. The extra paragraph explains the math so the next reader understands why the deposit is not a round number. This matches the sandbox smoke fix in 44adaa8, one tree up. The live-shadownet smoke script (`scripts/shadownet_live_e2e_smoke.sh`) needs a similar bump plus `--dal-fee` / `--dal-fee-address` plumbing in `init_profile`; that is more invasive (producer-address generation + operator fee-policy alignment) and is tracked as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small corrections flagged by the 2nd / 3rd adversarial review that are cheap enough to land in the same PR rather than trailing as issues. 1. `apps/wallet/src/lib.rs::kernel_message_kind` used to fold `Withdraw` / `ConfigureVerifier` / `ConfigureBridge` into the same `RollupSubmissionKind::Withdraw` arm. The wallet never submits admin `Configure*` messages (they flow through `octez_kernel_message` + `octez-client` directly), so that arm was dead code — and silently mislabelling an admin message as a `Withdraw` would be a hard-to-spot footgun if some future caller ever reached it. Split the arm: `Withdraw` keeps its own mapping, admin `Configure*` variants become an `unreachable!()` with a message pointing at the admin CLI. 2. `tezos/rollup-kernel/README.md` step 3 of "Adding a new oversized message type" still instructed the next contributor to mirror any new variant into `RollupSubmissionKind` and the operator's submission-matcher — which directly contradicts the design established in commit 5071e2e (admin-signed payloads bypass the operator on purpose, so a bearer-token leak cannot authorise admin injection). The old text also named a function (`submission_kind_matches_message`) that no longer exists. Rewrite step 3 to split the decision by submission path (user-facing via operator, admin-signed via `octez_kernel_message` directly), and add a pointer to the variant-exhaustive size test added in 5c308b4 so the next reader understands it will compile- break on an unclassified variant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`fetch_kernel_message_from_dal` had five near-identical arms that
checked a `KernelDalPayloadKind` tag against the decoded
`KernelInboxMessage` variant and returned the same "payload kind
mismatch" error when they disagreed.
Collapse the pattern into one `match` that produces a boolean
("does pointer.kind match message?") followed by a single early-
return with the shared error message. The `match pointer.kind` arms
are still exhaustive (no `_ =>`), so any new `KernelDalPayloadKind`
variant added in the future remains a compile error here until it is
classified — the structural guarantee is preserved.
Net: -37 / +18 lines, same behaviour, same error message, same
compile-time exhaustiveness guarantee.
Suggested by the second adversarial review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…per) Emit a framed `KernelInboxMessage::Withdraw` ready to be submitted via `octez-client send smart rollup message`, with no signature and no proof — the `KernelWithdrawReq` struct has no such fields, and neither the kernel nor the operator ask for them on the user withdraw path. Used by `scripts/sandbox_withdraw_auth_bypass_poc.sh` and referenced in `docs/analysis/withdraw-auth-gap.md` to make the auth gap reproducible. Kept minimal (one subcommand, three string/integer parameters, same encoding path as the existing `configure-*` paths) so that a later commit can remove it cleanly once authentication is added to `KernelWithdrawReq`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a kernel-level integration test (`bridge_flow.rs`) that exercises
the current auth gap on `KernelInboxMessage::Withdraw`:
1. Configure bridge ticketer (admin path, signed with the dev WOTS ask).
2. Deposit 500_001 mutez to `alice` via the legitimate bridge flow.
3. Submit a Withdraw with `sender = "alice"` and a recipient the
attacker controls — as an external inbox message, no signature,
no proof. Nothing in the caller's provenance is checked by the
kernel (the PVM has no access to the L1 tx author anyway, and the
`KernelWithdrawReq` struct has no sig/proof field that could bind
the withdraw to the actual owner of `alice`).
4. Assert the withdraw applied, alice's balance is zero, and the
outbox message credits the attacker's recipient.
The test is **positive-passing today** (the kernel accepts the
attack), which is exactly what documents the gap. When authentication
is ever added — e.g. a Tezos sig verified against an owner stored at
deposit time, or a per-account WOTS leaf registered and checked — this
test MUST be updated to expect a rejection and flip its assertions.
At that point it turns into a regression trap against accidentally
removing the auth.
Reference: `docs/analysis/withdraw-auth-gap.md` for the full analysis
and mitigation sketches. Sandbox-level reproduction at
`scripts/sandbox_withdraw_auth_bypass_poc.sh`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two artefacts:
1. `docs/analysis/withdraw-auth-gap.md` — full write-up: what the
gap is, evidence in the code (KernelWithdrawReq struct,
apply_kernel_message match arm, operator's bearer-only auth,
direct-L1-submit bypass), threat model, blast radius, and
four mitigation sketches with tradeoffs (not an endorsement —
the design decision belongs upstream).
2. `scripts/sandbox_withdraw_auth_bypass_poc.sh` — end-to-end
reproduction on an octez sandbox. Forked from the DAL smoke,
keeps setup + originate + configure + deposit unchanged, then
replaces the shield fixture step with:
bootstrap2 (NOT operator) → octez_kernel_message withdraw
→ octez-client send smart rollup message from bootstrap2
→ kernel applies → alice's balance drained to 0
Terminates with "VULNERABILITY CONFIRMED" on success.
The sandbox PoC and the kernel-level test in `bridge_flow.rs`
(previous commit) are complementary: the Rust test exercises the
kernel PVM directly (runs in CI under `cargo test`, no sandbox
needed), while the sandbox PoC demonstrates the full end-to-end
attack path including the "submit from a non-operator tz1 via
`octez-client send smart rollup message`" step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch documents and provides reproducible proofs of an authentication gap in the kernel's
Withdrawpath (and, by the same mechanism,Shield). Any entity that can submit an external inbox message to the rollup — i.e., any Tezos L1 account holder — can drain any knownpublic_accountto a recipient they control. TheKernelWithdrawReqstruct has no signature field,apply_kernel_messageruns no authentication check, and the operator's single bearer token does not protect against directoctez-client send smart rollup messagesubmissions.This is not a bug report framed as "fix me in this PR". It is an analysis branch that (a) makes the gap trivially reproducible in CI, (b) walks through the evidence in the codebase, and (c) sketches the design space. The design decision — accept single-tenant as the intended model, add a Tezos sig, add a WOTS leaf per account, or something else — belongs upstream and is out of scope here.
Built on top of PR #3 (
fix/configure-messages-via-dal) to keep the kernel tree consistent with admin-DAL routing.The gap in one paragraph
KernelWithdrawReqis three fields —sender,recipient,amount— no signature, no proof.apply_kernel_messageonWithdrawchecks thatbalance(sender) >= amountand thatrecipientis a parseable tz1/KT1, writes an outbox message debitingsenderand creditingrecipient, and returns. The kernel has no access to the L1 tx source that carried the inbox message, and the struct carries no information that could bind the withdraw to the true owner of the public account. The tzel-operator'ssubmit_rollup_messagehandler verifies a single bearer token shared across the whole instance, but this is irrelevant becauseoctez-client send smart rollup messageis callable by any Tezos account holder, bypassing the operator entirely. The Shield path has the same structural absence of sender authentication (the STARK proof bindshash(sender)but has no private input tying to ownership).What this branch ships
docs/analysis/withdraw-auth-gap.mdtezos/rollup-kernel/tests/bridge_flow.rs(+104)withdraw_poc_drains_unauthorized_sender— runs undercargo test --test bridge_flow, no sandbox needed. Configure → deposit 500_001 mutez toalice→ unauthorized third party submits Withdraw withsender = "alice"→ asserts the drain succeeded. Positive-passing today (documenting the gap); flip to negative-asserting once auth lands.scripts/sandbox_withdraw_auth_bypass_poc.shbootstrap2(explicitly NOT the operator'ssource_alias) viaoctez-client send smart rollup message. Terminates withVULNERABILITY CONFIRMEDon success.tezos/rollup-kernel/src/bin/octez_kernel_message.rs(+21)withdrawsubcommand — minimal PoC helper that emits a framedKernelInboxMessage::Withdrawready foroctez-client. Removes cleanly once authentication is added.Evidence (abridged)
core/src/kernel_wire.rs:110-115—KernelWithdrawReq { sender, recipient, amount }. No signature, no proof.tezos/rollup-kernel/src/lib.rs:~1009— Withdraw match arm: balance check, recipient format check (TezosContract::from_b58checkat :509), outbox write. Nothing comparessenderwith anything the caller can prove.services/tzel/src/bin/tzel_operator.rs:304—require_bearer_authis a single-token check againstconfig.bearer_token. No per-user mapping.apps/wallet/src/lib.rs:6501— the legitimate CLI withdraw constructs aKernelWithdrawReqwith the user's chosensenderstring and posts through the operator; the same construction is reachable by any third party.The attack path used in the sandbox PoC is exactly:
octez-client send smart rollup message "hex:[ \"<framed withdraw>\" ]" from bootstrap2From
bootstrap2, which is not the operator source. Kernel processes. Balance drains.Verification
Open question for the kernel maintainer
The design intent here needs to be stated explicitly before any further UX / multi-tenant deployment work proceeds. Specifically:
Is the current model single-tenant by intent? If yes, documenting this constraint in deployment guides + operator runbooks + wallet UX would prevent misuse. In that case, these PoCs serve as a regression trap rather than a fix target.
Or is sender authentication at the kernel level planned? The analysis doc sketches two families (Tezos-sig-bound-at-deposit, WOTS-leaf-per-account), both are post-quantum-compatible with the existing kernel structure. If this is the intended direction, it shapes downstream work: bridge contract changes, wallet submission flow, operator submission API, etc.
Follow-up hygiene
This branch does not propose a fix. The
withdrawsubcommand inoctez_kernel_message.rsis a PoC helper; it should be removed once authentication lands. The Rust test and sandbox script are kept as regression traps.🤖 Generated with Claude Code