Configurable native stack size for Stylus with auto-retry on overflow#4538
Configurable native stack size for Stylus with auto-retry on overflow#4538
Conversation
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #4538 +/- ##
==========================================
- Coverage 35.08% 34.20% -0.89%
==========================================
Files 498 494 -4
Lines 59096 59339 +243
==========================================
- Hits 20735 20298 -437
- Misses 34651 35446 +795
+ Partials 3710 3595 -115 |
❌ 12 Tests Failed:
View the top 3 failed tests by shortest run time
📣 Thoughts on this report? Let Codecov know! | Powered by Codecov |
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
| if status == userNativeStackOverflow { | ||
| return nil, fmt.Errorf("%w (program=%v, module=%v, depth=%d, allowFallback=%v, onChain=%v)", | ||
| ErrNativeStackOverflow, address, moduleHash, depth, GetAllowFallback(), runCtx.IsExecutedOnChain()) | ||
| } |
There was a problem hiding this comment.
do we want to return error on all userNativeStackOverflow or only when status == userNativeStackOverflow && (!GetAllowFallback() || !runCtx.IsExecutedOnChain())? Or do we want to panic here?
| "program", address, "module", moduleHash) | ||
| return userNativeStackOverflow, nil | ||
| } | ||
| if !runCtx.IsExecutedOnChain() { |
There was a problem hiding this comment.
IsExecutedOnChain is true for when the call is executed in:
messageCommitMode- new synced/sequenced blockmessageRecordingMode- block re-executed to record preimages required to create ValidationInputmessageReplayMode- block/transaction is executed i.e. in trace RPC call
IsExecutedOnChain is means:
these message modes are executed onchain so cannot make any gas shortcuts
I am not sure if we want to use fallback in any other mode then messageCommitMode. Do we need to support fallback in messageRecordingMode @tsahee ?
| // doubling path. No corruption or crash results. | ||
| baseStackSize := configuredNativeStackSize.Load() | ||
| defer func() { | ||
| SetNativeStackSize(baseStackSize) |
There was a problem hiding this comment.
Not sure yet, but I think that there might be an issue with setting stack size globally.
Because IsExecutedOnChain is true at this point the concurrent execution of trace calls can possibly mess with the head block execution.
If we decide that we want to allow fallback only when IsCommitMode is true, then I think we should have some guarantee that only one thread will set the stack size. That still will mess with stack size used by block recording thread and RPC calls - not sure how serious is that.
There was a problem hiding this comment.
Let me give some more context.
SetNativeStackSize is process-wide — it modifies a global AtomicUsize (DEFAULT_STACK_SIZE) in Wasmer shared across all threads.
DrainStackPool is best-effort (lock-free queue). So yes, when retryOnStackOverflow doubles the stack, every concurrent Stylus execution in the process is affected.
Concurrent execution paths that share this global:
┌──────────────────────────────────┬───────────────────┬──────────────┬───────────────────────────────────────┐
│ Context │ IsExecutedOnChain │ IsCommitMode │ Concurrent? │
├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
│ Sequencer (block production) │ true │ true │ Yes, with all below │
├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
│ Block validation (non-sequencer) │ true │ true │ Yes │
├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
│ Debug trace (debug_traceBlock) │ true │ false │ Highly — runtime.NumCPU() goroutines │
├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
│ Block recording (proofs) │ true │ false │ Yes │
├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
│ eth_call / gas estimation │ false │ false │ Yes, but fallback is already disabled │
└──────────────────────────────────┴───────────────────┴──────────────┴───────────────────────────────────────┘
The key issue: debug trace calls run with IsExecutedOnChain=true and spawn multiple goroutines sharing one runCtx. If the sequencer doubles the stack during block production, concurrent trace goroutines may see the doubled size (benign — they just get more stack than needed). If the sequencer's defer restores the base size while a trace goroutine is about to allocate a coroutine, that goroutine gets the smaller stack and may overflow again — but that's equivalent to the trace never seeing the doubled value. No corruption, just a duplicate overflow.
Lets take 2 threads example in a happy path:
┌──────┬──────────────────────────────────┬─────────────────────────────────────┬──────────────────────┐
│ Step │ G1 │ G2 │ DEFAULT_STACK_SIZE │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 1 │ SetNativeStackSize(2MB) │ │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 2 │ DrainStackPool() │ │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 3 │ doStylusCall() → enters Rust │ │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 4 │ catch_traps: reads 2MB, allocs │ │ 2MB │
│ │ 2MB stack │ │ │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 5 │ executing wasm... │ SetNativeStackSize(2MB) │ 2MB (no-op, same │
│ │ │ │ value) │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 6 │ executing wasm... │ DrainStackPool() │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 7 │ executing wasm... │ doStylusCall() → enters Rust │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 8 │ executing wasm... │ catch_traps: reads 2MB, │ 2MB │
│ │ │ DefaultStack::new(2MB) │ │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 9 │ returns from Rust │ executing wasm with 2MB stack... │ 2MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 10 │ defer: SetNativeStackSize(1MB) │ executing wasm with 2MB stack... │ 1MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 11 │ defer: DrainStackPool() │ executing wasm with 2MB stack... │ 1MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 12 │ done │ returns successfully │ 1MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 13 │ │ defer: SetNativeStackSize(1MB) │ 1MB │
├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
│ 14 │ │ defer: DrainStackPool() │ 1MB │
└──────┴──────────────────────────────────┴─────────────────────────────────────┴──────────────────────┘
And now the same threads in an unhappy path:
┌──────┬─────────────────────────────────┬─────────────────────────────────────────┬────────────────────┐
│ Step │ G1 │ G2 │ DEFAULT_STACK_SIZE │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 1 │ SetNativeStackSize(2MB) │ │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 2 │ DrainStackPool() │ │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 3 │ doStylusCall() → enters Rust │ │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 4 │ catch_traps: reads 2MB, allocs │ │ 2MB │
│ │ 2MB │ │ │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 5 │ executing wasm... │ SetNativeStackSize(2MB) │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 6 │ executing wasm... │ DrainStackPool() │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 7 │ returns from Rust │ │ 2MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 8 │ defer: SetNativeStackSize(1MB) │ │ 1MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 9 │ defer: DrainStackPool() │ │ 1MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 10 │ done │ doStylusCall() → enters Rust │ 1MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 11 │ │ catch_traps: reads 1MB, │ 1MB │
│ │ │ DefaultStack::new(1MB) │ │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 12 │ │ overflows again on 1MB stack │ 1MB │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 13 │ │ returns NativeStackOverflow → tx │ 1MB │
│ │ │ reverts │ │
├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
│ 14 │ │ defer: SetNativeStackSize(1MB) │ 1MB │
└──────┴─────────────────────────────────┴─────────────────────────────────────────┴────────────────────┘
the worst case in any race scenario is that a concurrent thread overflows again (same outcome as if the doubling never happened) which is the last scenario above. Does that help?
|
|
||
| evmApi := newApi(evm, tracingInfo, scope, memoryModel) | ||
| defer evmApi.drop() | ||
| savedGas := scope.Contract.Gas |
There was a problem hiding this comment.
Is this only field that we need to save for possible later revert?
I think that at least Contract.UseMultiGas and Contract.RetainedMultiGas might need to also be reverted.
Also what about the rest of vm.ScopeContext?
I haven't read into that deep enough, but I think that we might need to save whole vm.ScopeContext as it includes Stack and Memory context of the contract and we will need to revert those also.
There was a problem hiding this comment.
Here's a quick analysis by Claude and was able to confirm by looking further at the code:
scope.Stack— This is the EVM operand stack (used by the interpreter forPUSH,POP,ADD, etc.). Stylus programs run inside Wasmer with their own WASM stack. The CGo stylus_call never touches scope.Stack. In fact, when Stylus is invoked, the EVM interpreter isn't running — callProgram is called directly from the precompile path, so the EVM stack is dormant.scope.Memory— Same story. This is EVM memory (used byMLOAD,MSTORE, etc.). Stylus programs have their own WASM linear memory managed by Wasmer. The stylus_call FFI doesn't read or writescope.Memory.scope.Contract(other fields) — The immutable fields (caller, address, code, codehash, value) don't change during execution.RetainedMultiGasis only modified by EVM instructions, never by Stylus host I/O.
I'll save UsedMultiGas as well. Does the above make sense, or did I miss something and we need to save other fields?
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
arbos/programs/native.go
Outdated
|
|
||
| if compiled { | ||
| // Freshly compiled means localAsm was singlepass, so cranelift is different | ||
| // code worth retrying. If !compiled, the cranelift ASM was already in the |
There was a problem hiding this comment.
not necessarily true. If database has both we have previously singlepass.
I think it's worthwhile to retry with cranelift anywat
There was a problem hiding this comment.
I was thinking about the very first time this gets compiled, but for subsequent calls, when db has both, it'll have been singlepass, so yeah you're right
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
--stylus-target.native-stack-sizenode config to set the initial Wasmer coroutine stack size for Stylus execution (default: 0 = Wasmer's 1 MB default)NativeStackOverflowvariant toUserOutcome/UserOutcomeKindto distinguish retriable native overflows from deterministic OutOfStack (DepthChecker)--stylus-target.allow-fallback=true(default):a. Recompile the program with Cranelift and retry
b. Persist the Cranelift-compiled ASM to the wasm store so subsequent overflows skip recompilation
c. If Cranelift also overflows, double the stack size once (capped at 100 MB) and retry with Cranelift
d. If still overflowing, fail gracefully as out-of-stack
--stylus-target.allow-fallback=false, no retry is attempted on overfloweth_call, gas estimation) do not trigger retries or Cranelift compilationpulls in OffchainLabs/go-ethereum#645
pulls in OffchainLabs/wasmer#37
closes NIT-4686