Skip to content

Configurable native stack size for Stylus with auto-retry on overflow#4538

Open
bragaigor wants to merge 29 commits intomasterfrom
nm004-config-stylus-stack-size
Open

Configurable native stack size for Stylus with auto-retry on overflow#4538
bragaigor wants to merge 29 commits intomasterfrom
nm004-config-stylus-stack-size

Conversation

@bragaigor
Copy link
Copy Markdown
Contributor

@bragaigor bragaigor commented Mar 20, 2026

  • Add --stylus-target.native-stack-size node config to set the initial Wasmer coroutine stack size for Stylus execution (default: 0 = Wasmer's 1 MB default)
  • Add NativeStackOverflow variant to UserOutcome/UserOutcomeKind to distinguish retriable native overflows from deterministic OutOfStack (DepthChecker)
  • On native stack overflow with --stylus-target.allow-fallback=true (default):
    a. Recompile the program with Cranelift and retry
    b. Persist the Cranelift-compiled ASM to the wasm store so subsequent overflows skip recompilation
    c. If Cranelift also overflows, double the stack size once (capped at 100 MB) and retry with Cranelift
    d. If still overflowing, fail gracefully as out-of-stack
  • With --stylus-target.allow-fallback=false, no retry is attempted on overflow
  • Off-chain calls (eth_call, gas estimation) do not trigger retries or Cranelift compilation
  • Add new WasmTarget variants for Cranelift ASM storage (arm64-cranelift, amd64-cranelift, host-cranelift) in go-ethereum

pulls in OffchainLabs/go-ethereum#645
pulls in OffchainLabs/wasmer#37
closes NIT-4686

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 9.15179% with 407 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.20%. Comparing base (620aa42) to head (f166a5c).
⚠️ Report is 40 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4538      +/-   ##
==========================================
- Coverage   35.08%   34.20%   -0.89%     
==========================================
  Files         498      494       -4     
  Lines       59096    59339     +243     
==========================================
- Hits        20735    20298     -437     
- Misses      34651    35446     +795     
+ Partials     3710     3595     -115     

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 20, 2026

❌ 12 Tests Failed:

Tests completed Failed Passed Skipped
4816 12 4804 0
View the top 3 failed tests by shortest run time
TestAliasingFlaky
Stack Traces | -0.000s run time
=== RUN   TestAliasingFlaky
=== PAUSE TestAliasingFlaky
=== CONT  TestAliasingFlaky
    common_test.go:777: BuildL1 deployConfig: DeployBold=true, DeployReferenceDAContracts=false
INFO [04-02|20:41:58.953] New local node record                    seq=1,775,162,518,952 id=bc2f79172e4754c0                        ip=127.0.0.1 udp=0 tcp=0
INFO [04-02|20:41:58.953] Started P2P networking                   self=enode://a75b4c0b95837768c233fcebeeb3d90736bb5d8ea2ec23fd43b2b8a97fcbaf1a4e2c3f321e59b95211c3e07ddb27918b5cfe7862d0ec3b921336c4b61998e9c1@127.0.0.1:0
WARN [04-02|20:41:58.953] Getting file info                        dir= error="stat : no such file or directory"
TestPruningDBSizeReduction
Stack Traces | 0.000s run time
=== RUN   TestPruningDBSizeReduction
--- FAIL: TestPruningDBSizeReduction (0.00s)
TestBatchPosterL1SurplusMatchesBatchGasFlaky
Stack Traces | 0.540s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
panic: runtime error: invalid memory address or nil pointer dereference [recovered, repanicked]
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2081ed2]

goroutine 36 [running]:
testing.tRunner.func1.2({0x37ec1c0, 0x620b9e0})
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1872 +0x237
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1875 +0x35b
panic({0x37ec1c0?, 0x620b9e0?})
	/opt/hostedtoolcache/go/1.25.8/x64/src/runtime/panic.go:783 +0x132
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).GetBatchCount(0xfac1900?)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:210 +0x12
github.com/offchainlabs/nitro/arbnode.(*InboxTracker).FindInboxBatchContainingMessage(0x0, 0x7)
	/home/runner/work/nitro/nitro/arbnode/inbox_tracker.go:225 +0x2f
github.com/offchainlabs/nitro/system_tests.TestBatchPosterL1SurplusMatchesBatchGasFlaky(0xc000502c40)
	/home/runner/work/nitro/nitro/system_tests/batch_poster_test.go:839 +0x725
testing.tRunner(0xc000502c40, 0x41bf9a8)
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 1
	/opt/hostedtoolcache/go/1.25.8/x64/src/testing/testing.go:1997 +0x465

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
@bragaigor bragaigor marked this pull request as ready for review March 30, 2026 22:22
@bragaigor bragaigor marked this pull request as draft March 31, 2026 00:46
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Comment on lines +496 to +499
if status == userNativeStackOverflow {
return nil, fmt.Errorf("%w (program=%v, module=%v, depth=%d, allowFallback=%v, onChain=%v)",
ErrNativeStackOverflow, address, moduleHash, depth, GetAllowFallback(), runCtx.IsExecutedOnChain())
}
Copy link
Copy Markdown
Contributor Author

@bragaigor bragaigor Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to return error on all userNativeStackOverflow or only when status == userNativeStackOverflow && (!GetAllowFallback() || !runCtx.IsExecutedOnChain())? Or do we want to panic here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's panic here

@bragaigor bragaigor marked this pull request as ready for review April 1, 2026 13:26
@bragaigor bragaigor requested a review from magicxyyz April 1, 2026 13:27
"program", address, "module", moduleHash)
return userNativeStackOverflow, nil
}
if !runCtx.IsExecutedOnChain() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsExecutedOnChain is true for when the call is executed in:

  • messageCommitMode - new synced/sequenced block
  • messageRecordingMode - block re-executed to record preimages required to create ValidationInput
  • messageReplayMode - block/transaction is executed i.e. in trace RPC call

IsExecutedOnChain is means:

these message modes are executed onchain so cannot make any gas shortcuts

https://github.com/OffchainLabs/go-ethereum/blob/461e5177cdbb8d237702dbb889ed2db19515f8c7/core/state_transition.go#L296-L299

I am not sure if we want to use fallback in any other mode then messageCommitMode. Do we need to support fallback in messageRecordingMode @tsahee ?

// doubling path. No corruption or crash results.
baseStackSize := configuredNativeStackSize.Load()
defer func() {
SetNativeStackSize(baseStackSize)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet, but I think that there might be an issue with setting stack size globally.

Because IsExecutedOnChain is true at this point the concurrent execution of trace calls can possibly mess with the head block execution.

If we decide that we want to allow fallback only when IsCommitMode is true, then I think we should have some guarantee that only one thread will set the stack size. That still will mess with stack size used by block recording thread and RPC calls - not sure how serious is that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me give some more context.

SetNativeStackSize is process-wide — it modifies a global AtomicUsize (DEFAULT_STACK_SIZE) in Wasmer shared across all threads.
DrainStackPool is best-effort (lock-free queue). So yes, when retryOnStackOverflow doubles the stack, every concurrent Stylus execution in the process is affected.

Concurrent execution paths that share this global:

  ┌──────────────────────────────────┬───────────────────┬──────────────┬───────────────────────────────────────┐
  │             Context              │ IsExecutedOnChain │ IsCommitMode │              Concurrent?              │
  ├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
  │ Sequencer (block production)     │ true              │ true         │ Yes, with all below                   │
  ├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
  │ Block validation (non-sequencer) │ true              │ true         │ Yes                                   │
  ├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
  │ Debug trace (debug_traceBlock)   │ true              │ false        │ Highly — runtime.NumCPU() goroutines  │
  ├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
  │ Block recording (proofs)         │ true              │ false        │ Yes                                   │
  ├──────────────────────────────────┼───────────────────┼──────────────┼───────────────────────────────────────┤
  │ eth_call / gas estimation        │ false             │ false        │ Yes, but fallback is already disabled │
  └──────────────────────────────────┴───────────────────┴──────────────┴───────────────────────────────────────┘

The key issue: debug trace calls run with IsExecutedOnChain=true and spawn multiple goroutines sharing one runCtx. If the sequencer doubles the stack during block production, concurrent trace goroutines may see the doubled size (benign — they just get more stack than needed). If the sequencer's defer restores the base size while a trace goroutine is about to allocate a coroutine, that goroutine gets the smaller stack and may overflow again — but that's equivalent to the trace never seeing the doubled value. No corruption, just a duplicate overflow.

Lets take 2 threads example in a happy path:

  ┌──────┬──────────────────────────────────┬─────────────────────────────────────┬──────────────────────┐    
  │ Step │                G1                │                 G2                  │  DEFAULT_STACK_SIZE  │    
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 1    │ SetNativeStackSize(2MB)          │                                     │ 2MB                  │    
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 2    │ DrainStackPool()                 │                                     │ 2MB                  │    
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
  │ 3    │ doStylusCall() → enters Rust     │                                     │ 2MB                  │    
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 4    │ catch_traps: reads 2MB, allocs   │                                     │ 2MB                  │
  │      │ 2MB stack                        │                                     │                      │    
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
  │ 5    │ executing wasm...                │ SetNativeStackSize(2MB)             │ 2MB (no-op, same     │    
  │      │                                  │                                     │ value)               │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 6    │ executing wasm...                │ DrainStackPool()                    │ 2MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 7    │ executing wasm...                │ doStylusCall() → enters Rust        │ 2MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤
  │ 8    │ executing wasm...                │ catch_traps: reads 2MB,             │ 2MB                  │
  │      │                                  │ DefaultStack::new(2MB)              │                      │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 9    │ returns from Rust                │ executing wasm with 2MB stack...    │ 2MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 10   │ defer: SetNativeStackSize(1MB)   │ executing wasm with 2MB stack...    │ 1MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 11   │ defer: DrainStackPool()          │ executing wasm with 2MB stack...    │ 1MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 12   │ done                             │ returns successfully                │ 1MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 13   │                                  │ defer: SetNativeStackSize(1MB)      │ 1MB                  │
  ├──────┼──────────────────────────────────┼─────────────────────────────────────┼──────────────────────┤    
  │ 14   │                                  │ defer: DrainStackPool()             │ 1MB                  │
  └──────┴──────────────────────────────────┴─────────────────────────────────────┴──────────────────────┘

And now the same threads in an unhappy path:

  ┌──────┬─────────────────────────────────┬─────────────────────────────────────────┬────────────────────┐
  │ Step │               G1                │                   G2                    │ DEFAULT_STACK_SIZE │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 1    │ SetNativeStackSize(2MB)         │                                         │ 2MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 2    │ DrainStackPool()                │                                         │ 2MB                │
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤   
  │ 3    │ doStylusCall() → enters Rust    │                                         │ 2MB                │
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤   
  │ 4    │ catch_traps: reads 2MB, allocs  │                                         │ 2MB                │
  │      │ 2MB                             │                                         │                    │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 5    │ executing wasm...               │ SetNativeStackSize(2MB)                 │ 2MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 6    │ executing wasm...               │ DrainStackPool()                        │ 2MB                │
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 7    │ returns from Rust               │                                         │ 2MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 8    │ defer: SetNativeStackSize(1MB)  │                                         │ 1MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 9    │ defer: DrainStackPool()         │                                         │ 1MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 10   │ done                            │ doStylusCall() → enters Rust            │ 1MB                │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 11   │                                 │ catch_traps: reads 1MB,                 │ 1MB                │   
  │      │                                 │ DefaultStack::new(1MB)                  │                    │
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤   
  │ 12   │                                 │ overflows again on 1MB stack            │ 1MB                │
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 13   │                                 │ returns NativeStackOverflow → tx        │ 1MB                │
  │      │                                 │ reverts                                 │                    │   
  ├──────┼─────────────────────────────────┼─────────────────────────────────────────┼────────────────────┤
  │ 14   │                                 │ defer: SetNativeStackSize(1MB)          │ 1MB                │   
  └──────┴─────────────────────────────────┴─────────────────────────────────────────┴────────────────────┘

the worst case in any race scenario is that a concurrent thread overflows again (same outcome as if the doubling never happened) which is the last scenario above. Does that help?


evmApi := newApi(evm, tracingInfo, scope, memoryModel)
defer evmApi.drop()
savedGas := scope.Contract.Gas
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only field that we need to save for possible later revert?
I think that at least Contract.UseMultiGas and Contract.RetainedMultiGas might need to also be reverted.

Also what about the rest of vm.ScopeContext?

I haven't read into that deep enough, but I think that we might need to save whole vm.ScopeContext as it includes Stack and Memory context of the contract and we will need to revert those also.

Copy link
Copy Markdown
Contributor Author

@bragaigor bragaigor Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a quick analysis by Claude and was able to confirm by looking further at the code:

  • scope.Stack — This is the EVM operand stack (used by the interpreter for PUSH, POP, ADD, etc.). Stylus programs run inside Wasmer with their own WASM stack. The CGo stylus_call never touches scope.Stack. In fact, when Stylus is invoked, the EVM interpreter isn't running — callProgram is called directly from the precompile path, so the EVM stack is dormant.
  • scope.Memory — Same story. This is EVM memory (used by MLOAD, MSTORE, etc.). Stylus programs have their own WASM linear memory managed by Wasmer. The stylus_call FFI doesn't read or write scope.Memory.
  • scope.Contract (other fields) — The immutable fields (caller, address, code, codehash, value) don't change during execution. RetainedMultiGas is only modified by EVM instructions, never by Stylus host I/O.

I'll save UsedMultiGas as well. Does the above make sense, or did I miss something and we need to save other fields?

Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>

if compiled {
// Freshly compiled means localAsm was singlepass, so cranelift is different
// code worth retrying. If !compiled, the cranelift ASM was already in the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily true. If database has both we have previously singlepass.
I think it's worthwhile to retry with cranelift anywat

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the very first time this gets compiled, but for subsequent calls, when db has both, it'll have been singlepass, so yeah you're right

@bragaigor bragaigor self-assigned this Apr 2, 2026
Signed-off-by: Igor Braga <5835477+bragaigor@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants