Skip to content

Kernel auth gap on Withdraw — reproducible PoC + analysis#4

Closed
saroupille wants to merge 11 commits intotrilitech:mainfrom
saroupille:analysis/withdraw-auth-gap-poc
Closed

Kernel auth gap on Withdraw — reproducible PoC + analysis#4
saroupille wants to merge 11 commits intotrilitech:mainfrom
saroupille:analysis/withdraw-auth-gap-poc

Conversation

@saroupille
Copy link
Copy Markdown

Summary

This branch documents and provides reproducible proofs of an authentication gap in the kernel's Withdraw path (and, by the same mechanism, Shield). Any entity that can submit an external inbox message to the rollup — i.e., any Tezos L1 account holder — can drain any known public_account to a recipient they control. The KernelWithdrawReq struct has no signature field, apply_kernel_message runs no authentication check, and the operator's single bearer token does not protect against direct octez-client send smart rollup message submissions.

This is not a bug report framed as "fix me in this PR". It is an analysis branch that (a) makes the gap trivially reproducible in CI, (b) walks through the evidence in the codebase, and (c) sketches the design space. The design decision — accept single-tenant as the intended model, add a Tezos sig, add a WOTS leaf per account, or something else — belongs upstream and is out of scope here.

Built on top of PR #3 (fix/configure-messages-via-dal) to keep the kernel tree consistent with admin-DAL routing.

The gap in one paragraph

KernelWithdrawReq is three fields — sender, recipient, amount — no signature, no proof. apply_kernel_message on Withdraw checks that balance(sender) >= amount and that recipient is a parseable tz1/KT1, writes an outbox message debiting sender and crediting recipient, and returns. The kernel has no access to the L1 tx source that carried the inbox message, and the struct carries no information that could bind the withdraw to the true owner of the public account. The tzel-operator's submit_rollup_message handler verifies a single bearer token shared across the whole instance, but this is irrelevant because octez-client send smart rollup message is callable by any Tezos account holder, bypassing the operator entirely. The Shield path has the same structural absence of sender authentication (the STARK proof binds hash(sender) but has no private input tying to ownership).

What this branch ships

File Purpose
docs/analysis/withdraw-auth-gap.md Full write-up: evidence with line references, threat model (public_accounts are enumerable via durable state RPC + bridge deposits are public on L1), blast radius, and four mitigation sketches with tradeoffs.
tezos/rollup-kernel/tests/bridge_flow.rs (+104) withdraw_poc_drains_unauthorized_sender — runs under cargo test --test bridge_flow, no sandbox needed. Configure → deposit 500_001 mutez to alice → unauthorized third party submits Withdraw with sender = "alice" → asserts the drain succeeded. Positive-passing today (documenting the gap); flip to negative-asserting once auth lands.
scripts/sandbox_withdraw_auth_bypass_poc.sh End-to-end sandbox smoke that forks the DAL smoke, keeps setup + deposit, then attacks: submit a Withdraw from bootstrap2 (explicitly NOT the operator's source_alias) via octez-client send smart rollup message. Terminates with VULNERABILITY CONFIRMED on success.
tezos/rollup-kernel/src/bin/octez_kernel_message.rs (+21) withdraw subcommand — minimal PoC helper that emits a framed KernelInboxMessage::Withdraw ready for octez-client. Removes cleanly once authentication is added.

Evidence (abridged)

  • core/src/kernel_wire.rs:110-115KernelWithdrawReq { sender, recipient, amount }. No signature, no proof.
  • tezos/rollup-kernel/src/lib.rs:~1009 — Withdraw match arm: balance check, recipient format check (TezosContract::from_b58check at :509), outbox write. Nothing compares sender with anything the caller can prove.
  • services/tzel/src/bin/tzel_operator.rs:304require_bearer_auth is a single-token check against config.bearer_token. No per-user mapping.
  • apps/wallet/src/lib.rs:6501 — the legitimate CLI withdraw constructs a KernelWithdrawReq with the user's chosen sender string and posts through the operator; the same construction is reachable by any third party.

The attack path used in the sandbox PoC is exactly:

octez-client send smart rollup message "hex:[ \"<framed withdraw>\" ]" from bootstrap2

From bootstrap2, which is not the operator source. Kernel processes. Balance drains.

Verification

$ cargo test --test bridge_flow withdraw_poc_drains_unauthorized_sender
cargo test: 1 passed, 8 filtered out (1 suite, 0.02s)

$ TZEL_OCTEZ_SANDBOX_PRESERVE=1 ./scripts/sandbox_withdraw_auth_bypass_poc.sh
...
==========================================================
VULNERABILITY CONFIRMED: alice's 500001 mutez was drained
by a withdraw message signed by bootstrap2 (not operator).
No bearer token was needed.  No proof was needed.
==========================================================

Open question for the kernel maintainer

The design intent here needs to be stated explicitly before any further UX / multi-tenant deployment work proceeds. Specifically:

  1. Is the current model single-tenant by intent? If yes, documenting this constraint in deployment guides + operator runbooks + wallet UX would prevent misuse. In that case, these PoCs serve as a regression trap rather than a fix target.

  2. Or is sender authentication at the kernel level planned? The analysis doc sketches two families (Tezos-sig-bound-at-deposit, WOTS-leaf-per-account), both are post-quantum-compatible with the existing kernel structure. If this is the intended direction, it shapes downstream work: bridge contract changes, wallet submission flow, operator submission API, etc.

Follow-up hygiene

This branch does not propose a fix. The withdraw subcommand in octez_kernel_message.rs is a PoC helper; it should be removed once authentication lands. The Rust test and sandbox script are kept as regression traps.

🤖 Generated with Claude Code

saroupille and others added 11 commits April 21, 2026 00:13
A WOTS-signed `ConfigureVerifier` KernelInboxMessage serializes to
4923 bytes, and `ConfigureBridge` to 4835 bytes.  Both exceed the
Tezos smart-rollup protocol constant `sc_rollup_message_size_limit`
(4096 bytes), so they cannot transit through the L1 external-message
path that `octez-client send smart rollup message` uses.

This commit extends the existing DAL delivery path — already routing
Shield / Transfer / Unshield payloads too large for L1 — to cover the
two admin configuration messages.

Kernel-side:
  - `core/src/kernel_wire.rs`
      * `KernelDalPayloadKind` gains `ConfigureVerifier` (wire tag 3)
        and `ConfigureBridge` (wire tag 4).
      * `kernel_dal_payload_kind_{to,from}_wire` handle both.
      * A comment clarifies that tag numbering here is independent of
        the tags used by `WireKernelInboxMessage`.
      * `KERNEL_WIRE_VERSION` bumped to 10: older clients that read a
        `DalPointer` with `kind=3|4` now see an explicit envelope
        version mismatch rather than an opaque tag error.
  - `tezos/rollup-kernel/src/lib.rs`
      * `fetch_kernel_message_from_dal` accepts the two new kinds.
      * The dispatch match is reshaped to be exhaustive on
        `KernelDalPayloadKind`: any future variant will be a compile
        error until handled here, instead of silently hitting the old
        `_ => Err("kind mismatch")` arm.
      * Docstring explains that the kind-vs-content check is a
        defense-in-depth control (forces honest labeling for auditors)
        and that authenticity itself comes from the WOTS signature /
        STARK proof inside the payload — DAL is a public bulletin
        board with no transport-level authentication.
      * `dal_payload_kind_name` labels the two new variants.

Tooling:
  - `tezos/rollup-kernel/src/bin/octez_kernel_message.rs`
      * New `configure-verifier-payload` and `configure-bridge-payload`
        subcommands emit the raw unframed KernelInboxMessage hex — the
        input for chunking and DAL publication.
      * `dal-pointer` accepts the `configure_verifier` and
        `configure_bridge` kind tokens.
      * When `TZEL_ROLLUP_CONFIG_ADMIN_ASK_HEX` is unset in debug
        builds, the fallback to the public dev ask now emits a
        `eprintln!` warning; silent fallback paired with a
        release-profile kernel built without admin material would be a
        footgun.
  - `tezos/rollup-kernel/build.rs` (new)
      * Emits `cargo:rerun-if-env-changed=` for the three admin
        material env vars consumed by `option_env!()` in the kernel
        source (the config-admin public seed plus the verifier and
        bridge config-admin WOTS leaves), so a rotation of the admin
        material always re-bakes the WASM.

Tests:
  - `core/src/kernel_wire.rs::tests`
      * Size sentinels for `ConfigureVerifier` (4923 bytes) and
        `ConfigureBridge` (4835 bytes).  The tests pass today; they
        fail loudly if a future encoding change shifts either size,
        forcing a review of DAL routing assumptions.

Operator and wallet-server are intentionally left unchanged: by
design, admin config messages flow directly from an admin's
`octez_kernel_message` + `octez-client` (with the admin's own L1 key
and WOTS ask), never through the user-facing operator API.  This keeps
the operator's interface narrow, preserves admin availability
independent of operator health, and prevents a bearer-token leak from
granting the ability to inject admin configs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an "Admin configuration messages and DAL routing" section to the
rollup-kernel README describing:

  - why `ConfigureVerifier` and `ConfigureBridge` must use DAL (WOTS
    signature bloat pushes each message above 4096 bytes);
  - the delivery flow end-to-end (admin computes unframed payload,
    chunks to DAL, injects a `DalPointer` on L1; kernel reassembles
    pages, verifies the payload hash, decodes, and dispatches);
  - a checklist for adding a new oversized message type in the
    future (wire tag, dispatch arm, CLI subcommand, size sentinel).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five fixes bundled to get `octez_rollup_sandbox_dal_smoke.sh` through
configure / deposit on an Octez master built from recent trunk:

1. `attestation_lags` invariant
   The Alpha protocol gained a restriction (tezos master 8499ce19ac,
   2025-12-04) that the last element of `attestation_lags` must equal
   `attestation_lag`.  The mockup generates `attestation_lags =
   [1,2,3,4,5]` by default, and the script overrides `attestation_lag`
   to 2 via `DAL_ATTESTATION_LAG`, so `activate_alpha` fails.  Force
   `attestation_lags = [attestation_lag]` in `build_alpha_sandbox_params`.

2. Configure messages via DAL
   `configure-verifier` (4944 bytes framed) and `configure-bridge`
   (4856 bytes framed) exceed `sc_rollup_message_size_limit = 4096`,
   so the old direct `octez-client send smart rollup message` path
   fails at encoding.  Route both via the DAL delivery path instead,
   using the new `configure-{verifier,bridge}-payload` CLI subcommands
   and a generalized `publish_payload_via_dal_and_inject_pointer`
   helper factored out of `publish_shield_via_dal_and_inject_pointer`.

3. Admin material baked into the release kernel WASM
   The release kernel's `authenticate_{verifier,bridge}_config` only
   accepts admin-signed payloads when the admin leaves are baked in at
   compile time via `TZEL_ROLLUP_CONFIG_ADMIN_*_HEX`.  Without this,
   the kernel silently rejects every configure payload.  Call
   `scripts/prepare_rollup_config_admin.sh` before the kernel build
   and source both the runtime (secret ask) and build (public leaves)
   env files.

4. `xxd -ps -c 0` newline workaround
   On our xxd version `-c 0` still wraps at ~60 characters, inserting
   newlines that silently break string matches and URL / Michelson
   arg construction.  Pipe to `tr -d '\n'` on the affected call sites:
   `await_bridge_ticketer`, `deposit_to_bridge`, and the balance-key
   construction in `main`.  Without this, `await_bridge_ticketer`
   reports "ticketer did not appear" even after the kernel applied
   the configuration.

5. Caveat on `set -a` scope
   A comment makes explicit that exporting
   `TZEL_ROLLUP_CONFIG_ADMIN_ASK_HEX` to every descendant process is
   acceptable in sandbox (ephemeral per-workdir ask) but must not be
   copied to a production runner.

After these fixes, the smoke reaches and applies configure-verifier +
configure-bridge + the initial bridge deposit; the subsequent fixture
shield step is not within the scope of this patch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bde1347 ("Add burned rollup fees and DAL producer note outputs")
made two changes the sandbox smoke script never absorbed:

1. apply_shield now debits `v + fee + producer_fee` (not just `v`)
   from the sender's public balance:

       let debit = req.v
           .checked_add(req.fee)
           .and_then(|value| value.checked_add(req.producer_fee))?;
       if bal < debit { return Err("insufficient balance"); }

   The fixture metadata still exposed only `shield_amount: fixture.shield.v`
   and the sandbox deposited exactly that.  Post-bde1347 the balance is
   short by `fee + producer_fee`, the shield fails with "insufficient
   balance", and the public drain never happens.

   For the checked-in fixture (v=400_000, fee=100_000, producer_fee=1
   mutez) the required deposit is 500_001 mutez instead of 400_000.

   Rename `shield_amount` to `shield_bridge_deposit` in `FixtureMetadata`
   and compute it as `v + fee + producer_fee`.  The sandbox script picks
   up the new field and uses the same value for both the bridge deposit
   and the pre-shield balance assertion.

2. apply_shield now appends *two* notes to the Merkle tree per shield:
   the sender's own commitment and the producer's compensation commitment.
   The smoke's post-shield assertion still expected
   `/tzel/v1/state/tree/size == 1` — the pre-fees value — which makes the
   smoke stall at line 698 even though the shield applied cleanly and the
   public balance drained to zero.

   Update the assertion to `2` and leave a comment pointing at
   apply_shield so the next person knows why.

These two regressions were hidden behind an earlier one (configure
messages exceeding sc_rollup_message_size_limit, fixed in 5071e2e on
this branch): the script never got past configure-verifier, so neither
bde1347-induced break ever executed.  Once configures route through
DAL, the smoke advances through the shield and both breaks surface in
sequence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two size sentinels next to this test lock exact byte counts for
`ConfigureVerifier` and `ConfigureBridge`.  They would have caught the
specific regression they target (a WOTS signature change growing those
two messages past `sc_rollup_message_size_limit = 4096`) — but only
because someone knew to write them *after* the regression surfaced.

A more general failure mode: a new field lands on any
`KernelInboxMessage` variant, pushes its serialized size past 4096,
and nothing fails until an operator tries `octez-client send smart
rollup message` against a real node and gets rejected at the L1 inbox.
That is how commit 2c45d9c broke admin config: unit tests all passed;
the break surfaced weeks later in the sandbox smoke.

Add a third test that makes the invariant structural:

  - A `Routing` enum (`FitsL1` / `RequiresDal`) classifies every
    variant.  `required_routing` is an **exhaustive match** on
    `KernelInboxMessage` with no `_` arm — the compiler forces any
    future variant author to classify the new message before the
    crate builds.

  - `framed_len` computes the on-wire size the L1 inbox actually
    sees, i.e. `encode_kernel_inbox_message(...).len() +
    ExternalMessageFrame::Targetted` overhead (21 bytes: 1 tag + 20
    bytes of `SmartRollupHash`).  The existing sentinels measure the
    unframed envelope and under-count by 21 bytes — a message that
    lands just below 4096 unframed can still be rejected on wire.

  - The assertion is two-sided:
      * FitsL1 with `framed > 4096` fails: the L1 routing is broken.
      * RequiresDal with `framed <= 4096` fails: the DAL plumbing
        for that variant is dead code and the classification needs
        revisiting.

Representative instances:
  - `ConfigureVerifier` / `ConfigureBridge`: real WOTS-signed configs
    (same construction as the sentinels).
  - `Shield` / `Transfer` / `Unshield`: built with a 4096-byte
    `proof_bytes` stub — the cheapest size that keeps the RequiresDal
    classification unambiguous without requiring a full STARK proof
    in the test harness.
  - `Withdraw`: small string fields + `u64`, representative of
    production.
  - `DalPointer`: single-chunk pointer, representative of what the
    kernel emits.

The frame overhead is replicated as a local constant rather than
pulled from `tezos-smart-rollup-encoding` at dev-dep time: that crate
pins `tezos_data_encoding = 0.5.2` while `tzel-core` already depends
on `tezos_data_encoding = 0.6`, and introducing both majors into the
test build for a single 21-byte constant is not worth the friction.
The constant is documented with the layout it replicates and
verified empirically against `octez_kernel_message dal-pointer` output
on a real sr1 address.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After bde1347 ("Add burned rollup fees and DAL producer note outputs"),
`apply_shield` debits `v + fee + producer_fee` from the sender's public
rollup balance, not just `v + fee`.  The tutorial instructed readers to
deposit `300000` mutez before shielding `200000` mutez, which covers `v
+ burn (100000)` exactly — leaving the shield short by the configured
DAL-producer fee (`dal_fee = 1` mutez as set by the init-shadownet
example).  The shield step then fails with "insufficient balance" and
the tutorial cannot be completed as written.

Bump the deposit to `300001` mutez and update the expected post-deposit
balance line to match.  The extra paragraph explains the math so the
next reader understands why the deposit is not a round number.

This matches the sandbox smoke fix in 44adaa8, one tree up.  The
live-shadownet smoke script (`scripts/shadownet_live_e2e_smoke.sh`)
needs a similar bump plus `--dal-fee` / `--dal-fee-address` plumbing in
`init_profile`; that is more invasive (producer-address generation +
operator fee-policy alignment) and is tracked as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small corrections flagged by the 2nd / 3rd adversarial review that
are cheap enough to land in the same PR rather than trailing as issues.

1. `apps/wallet/src/lib.rs::kernel_message_kind` used to fold
   `Withdraw` / `ConfigureVerifier` / `ConfigureBridge` into the same
   `RollupSubmissionKind::Withdraw` arm.  The wallet never submits
   admin `Configure*` messages (they flow through
   `octez_kernel_message` + `octez-client` directly), so that arm was
   dead code — and silently mislabelling an admin message as a
   `Withdraw` would be a hard-to-spot footgun if some future caller
   ever reached it.  Split the arm: `Withdraw` keeps its own
   mapping, admin `Configure*` variants become an `unreachable!()`
   with a message pointing at the admin CLI.

2. `tezos/rollup-kernel/README.md` step 3 of "Adding a new oversized
   message type" still instructed the next contributor to mirror any
   new variant into `RollupSubmissionKind` and the operator's
   submission-matcher — which directly contradicts the design
   established in commit 5071e2e (admin-signed payloads bypass the
   operator on purpose, so a bearer-token leak cannot authorise
   admin injection).  The old text also named a function
   (`submission_kind_matches_message`) that no longer exists.

   Rewrite step 3 to split the decision by submission path
   (user-facing via operator, admin-signed via `octez_kernel_message`
   directly), and add a pointer to the variant-exhaustive size test
   added in 5c308b4 so the next reader understands it will compile-
   break on an unclassified variant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`fetch_kernel_message_from_dal` had five near-identical arms that
checked a `KernelDalPayloadKind` tag against the decoded
`KernelInboxMessage` variant and returned the same "payload kind
mismatch" error when they disagreed.

Collapse the pattern into one `match` that produces a boolean
("does pointer.kind match message?") followed by a single early-
return with the shared error message.  The `match pointer.kind` arms
are still exhaustive (no `_ =>`), so any new `KernelDalPayloadKind`
variant added in the future remains a compile error here until it is
classified — the structural guarantee is preserved.

Net: -37 / +18 lines, same behaviour, same error message, same
compile-time exhaustiveness guarantee.

Suggested by the second adversarial review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…per)

Emit a framed `KernelInboxMessage::Withdraw` ready to be submitted via
`octez-client send smart rollup message`, with no signature and no
proof — the `KernelWithdrawReq` struct has no such fields, and neither
the kernel nor the operator ask for them on the user withdraw path.

Used by `scripts/sandbox_withdraw_auth_bypass_poc.sh` and referenced
in `docs/analysis/withdraw-auth-gap.md` to make the auth gap
reproducible.  Kept minimal (one subcommand, three string/integer
parameters, same encoding path as the existing `configure-*` paths)
so that a later commit can remove it cleanly once authentication is
added to `KernelWithdrawReq`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a kernel-level integration test (`bridge_flow.rs`) that exercises
the current auth gap on `KernelInboxMessage::Withdraw`:

  1. Configure bridge ticketer (admin path, signed with the dev WOTS ask).
  2. Deposit 500_001 mutez to `alice` via the legitimate bridge flow.
  3. Submit a Withdraw with `sender = "alice"` and a recipient the
     attacker controls — as an external inbox message, no signature,
     no proof.  Nothing in the caller's provenance is checked by the
     kernel (the PVM has no access to the L1 tx author anyway, and the
     `KernelWithdrawReq` struct has no sig/proof field that could bind
     the withdraw to the actual owner of `alice`).
  4. Assert the withdraw applied, alice's balance is zero, and the
     outbox message credits the attacker's recipient.

The test is **positive-passing today** (the kernel accepts the
attack), which is exactly what documents the gap.  When authentication
is ever added — e.g. a Tezos sig verified against an owner stored at
deposit time, or a per-account WOTS leaf registered and checked — this
test MUST be updated to expect a rejection and flip its assertions.
At that point it turns into a regression trap against accidentally
removing the auth.

Reference: `docs/analysis/withdraw-auth-gap.md` for the full analysis
and mitigation sketches.  Sandbox-level reproduction at
`scripts/sandbox_withdraw_auth_bypass_poc.sh`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two artefacts:

1. `docs/analysis/withdraw-auth-gap.md` — full write-up: what the
   gap is, evidence in the code (KernelWithdrawReq struct,
   apply_kernel_message match arm, operator's bearer-only auth,
   direct-L1-submit bypass), threat model, blast radius, and
   four mitigation sketches with tradeoffs (not an endorsement —
   the design decision belongs upstream).

2. `scripts/sandbox_withdraw_auth_bypass_poc.sh` — end-to-end
   reproduction on an octez sandbox.  Forked from the DAL smoke,
   keeps setup + originate + configure + deposit unchanged, then
   replaces the shield fixture step with:
       bootstrap2 (NOT operator) → octez_kernel_message withdraw
       → octez-client send smart rollup message from bootstrap2
       → kernel applies → alice's balance drained to 0
   Terminates with "VULNERABILITY CONFIRMED" on success.

The sandbox PoC and the kernel-level test in `bridge_flow.rs`
(previous commit) are complementary: the Rust test exercises the
kernel PVM directly (runs in CI under `cargo test`, no sandbox
needed), while the sandbox PoC demonstrates the full end-to-end
attack path including the "submit from a non-operator tz1 via
`octez-client send smart rollup message`" step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saroupille
Copy link
Copy Markdown
Author

Reopening on my fork with base=fix/configure-messages-via-dal (PR #3's branch) so the diff shows only the 3 analysis commits instead of the 8 carried over from PR #3. Will link the new PR here once created.

@saroupille saroupille closed this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant