ZJIT: Stack Maps for Lightweight Frames

## Background
We're done optimizing CFP push, for now. We want to look into fixing "double spills": stop spilling values onto the VM stack and only spill them onto the C stack. Instead of eagerly spilling VM state for every safepoint, we want to put enough information into JITFrame out-of-line so the VM state could be reconsutrcted in event of deoptimization.

To do that, we need to be able to lazily copy values on the C stack to the VM stack when a C function called from JIT code uses the rb_debug_inspector C API or raises. Below is my design doc for the problem.

## Problem

```rb
require 'binding_of_caller'

def update_foo
  b = binding.of_caller(1)
  b.local_variable_set(:foo, b.local_variable_get(:foo) + 1)
end

def entry(x)
  foo = x + 1
  update_foo
  foo
end

entry(0) #=> 2
```

Right now, we would spill the result of `x + 1`, which would be `Opnd::VReg`, onto the VM stack slot for local variable `foo` on the call of `update_foo`. The rb_debug_inspector C API (called inside `of_caller`) would then look at the VM stack to escape the local variable `foo` into a Binding object on the heap.

We want to stop writing this VM stack slot on the `update_foo` call. Instead, since the "C call" for `update_foo` (we use `asm.ccall_with_iseq_call` for JIT-to-JIT calls) would spill any live registers like `Opnd::VReg` for `foo`, we want to let the C call remember the stack index of the spilled `Opnd::VReg` to prepare for retrieving it from the C stack as needed.

## Solution
We push down the pair of `JITFrame` pointer and `FrameState` (its `stack` and `locals` converted to `Vec<lir::Opnd>` to be precise) into the LIR for each C call.

Whenever we have a non-leaf C call, including any JIT-to-JIT calls, we currently have codegen like this:

```rs
gen_prepare_non_leaf_call(jit, asm, state); // or its equivalent broken down into multiple calls
// ... some other instructions ...
asm_ccall!(asm, some_c_function, args...);
```

Every such `gen_prepare_non_leaf_call` + `ccall` pair could work as is if `gen_prepare_non_leaf_call` did:

1. Allocate a `JITFrame` (Lightweight Frame). Write it into `cfp->jit_return` (instead of writing PC into `cfp->pc`).
2. Insert an LIR instruction that remembers `stack: Vec<Opnd>, locals: Vec<Opnd>, jit_frame: *const <JITFrame>` for the next C call.
    * `Vec<hir::InsnId>` in `hir::FrameState` should be converted to `Vec<lir::Opnd>` just like `build_side_exit()`.
    * Maybe the struct should be called `StackMap` and the LIR API to set it can be `asm.stack_map(...)`.
3. In the register allocator's `handle_caller_saved_regs` phase, when it pushes `VReg`s before a C call, remember the `{ VReg => stack index }` assignments in the `JITFrame` for any `VReg` referenced in `stack` or `locals`.
    * Open question: When any `Opnd` is callee-save `Reg` or (`Opnd::Mem` that) references calle-save `Reg`, which wouldn't be pushed by this process, what should we do?
        * I assume such `Reg` usages are only `SP`/`CFP`/`EC` and therefore don't exist.
        * For `Opnd::Mem` cases, I think we can either eagerly load such operands into `VReg`s (and prohibit leaving them in stack maps) or handle only registers used for `SP`/`CFP`/`EC` specially (we know what `SP`/`CFP`/`EC` should be for any given frame).
4. When rb_debug_inspector API needs to query locals (or when rb_raise needs to materialize locals on the VM stack before the C stack gets expired by longjmp), check `cfp->jit_frame`, and read locals (and stack for rb_raise) using the `StackMap` remembered in the `JITFrame`.

### GC

When we stop spilling objects onto the VM stack, the VM stack may leave uninitialized stack slots, which will be scanned on GC. To avoid crashing GC, we need to do something about it.

* Skip gaps on `rb_execution_context_mark`
    * Most promising?
* Use `rb_gc_mark_maybe` on `rb_execution_context_mark`
    * Easy to implement, but
    * Risks: Over-retention relative to before, longer marking time, more pinned objects (conservative marking never moves)
* Materialize JITFrame on `rb_execution_context_mark`
    * Most likely slowest

### side-exit

On a JIT-to-JIT call, when a callee side-exits, it needs to materialize the VM stack of the caller.

* We could let `zjit_materialize_frames` materialize stack slots on `jit_exec`/`JIT_EXEC`, or
    * Slow, but smaller code size
* We could generate code that writes stack slots on callee exits
    * Faster on exit, but generates more code

### JITFrame

To include stack maps in `JITFrame` that `zjit_materialize_frames` can read, we need to encode `lir::Opnd` in C structs and bind-gen it to Rust. That should also differentiate immediates and stack references, and remember the stack index associated to the VReg.

## Performance impact

JITFrame will have more information, so we'll spend more on memory. JIT code will be faster, because mainline code will spill less. Frame materialization will be slower because frame reconstruction will take more work, but hopefully this is happens rarely enough that we are faster overall.

The speed calculus has major contributions from the side-exit rate, as we reconstruct VM state on exit. So Lightweight Frames may not deliver an immediate speed boost on benchmarks with a high exit rate. However, we believe there are long term benefits as Lightweight Frames is comebined with other optimizations and side-exit rates are reduced.

## References

* HotSpot (OpenJDK)
    * [C2: The JIT In HotSpot](https://assets.ctfassets.net/oxjq45e8ilak/12JQgkvXnnXcPoAGoxB6le/5481932e755600401d607e20345d81d4/100752_1543361625_Cliff_Click_The_Sea_of_Nodes_and_the_HotSpot_JIT.pdf): C2 tracks JVM state at safepoints as a mapping from registers, stack slots, and constants back to the JVM interpreter stack. Safepoints are themselves IR nodes with use→def edges to every JVM state value. Because safepoints have very low exec frequency, the code scheduler spills bits needed only by the interpreter (but not JIT code) and moves them into off-paths.
* JavaScriptCore
    * [Speculation in JavaScriptCore](https://webkit.org/blog/10308/speculation-in-javascriptcore/): This paper discusses how they do OSR exit using stack maps. In DFG IR, they have `MovHint` instruction for delta compression of stack maps, and the `MovHint` instructions allow OSR exits to tell where in the stack values should reside for LLInt and Baseline JIT.
    * [Introducing the B3 JIT Compiler](https://webkit.org/blog/5852/introducing-the-b3-jit-compiler/): B3 does register allocation over Air. Later phases transform spill slots into concrete stack addresses. Their `StackmapGenerationParams`, which is parsed from binary stackmap blob, stores the list of in-use registers at each patchpoint, the mapping from high-level operands to registers, and the information necessary to emit each patchpoint’s machine code.
* V8
    * [Maglev - V8’s Fastest Optimizing JIT](https://v8.dev/blog/maglev): "Maglev attaches abstract interpreter frame state to nodes that can deoptimize. This state maps interpreter registers to SSA values. This state turns into metadata during code generation, providing a mapping from optimized state to unoptimized state."
    * [Deoptimization in V8](https://docs.google.com/presentation/d/1Z6oCocRASCfTqGq1GCo1jbULDGS-w-nzxkbVF7Up0u0/edit): TurboFan inserts `Checkpoint` nodes in the effect chain before potentially deoptimizing operations, with `Framestate` nodes capturing execution state and `StateValues` representing parameter, local, and accumulator states. "If a node deoptimizes, it will deoptimize to the state in the last checkpoint in the effect chain."
* LLVM
  * [Stack maps and patch points in LLVM](https://llvm.org/docs/StackMaps.html):
  * > A stack map records the location of live values at a particular instruction address. These live values do not refer to all the LLVM values live across the stack map. Instead, they are only the values that the runtime requires to be live at this point. For example, they may be the values the runtime will need to resume program execution at that point independent of the compiled function containing the stack map.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZJIT: Stack Maps for Lightweight Frames #979

Background

Problem

Solution

GC

side-exit

JITFrame

Performance impact

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ZJIT: Stack Maps for Lightweight Frames #979

Description

Background

Problem

Solution

GC

side-exit

JITFrame

Performance impact

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions