Skip to content

ZJIT: Stack Maps for Lightweight Frames #979

@k0kubun

Description

@k0kubun

Background

We're done optimizing CFP push, for now. We want to look into fixing "double spills": stop spilling values onto the VM stack and only spill them onto the C stack. Instead of eagerly spilling VM state for every safepoint, we want to put enough information into JITFrame out-of-line so the VM state could be reconsutrcted in event of deoptimization.

To do that, we need to be able to lazily copy values on the C stack to the VM stack when a C function called from JIT code uses the rb_debug_inspector C API or raises. Below is my design doc for the problem.

Problem

require 'binding_of_caller'

def update_foo
  b = binding.of_caller(1)
  b.local_variable_set(:foo, b.local_variable_get(:foo) + 1)
end

def entry(x)
  foo = x + 1
  update_foo
  foo
end

entry(0) #=> 2

Right now, we would spill the result of x + 1, which would be Opnd::VReg, onto the VM stack slot for local variable foo on the call of update_foo. The rb_debug_inspector C API (called inside of_caller) would then look at the VM stack to escape the local variable foo into a Binding object on the heap.

We want to stop writing this VM stack slot on the update_foo call. Instead, since the "C call" for update_foo (we use asm.ccall_with_iseq_call for JIT-to-JIT calls) would spill any live registers like Opnd::VReg for foo, we want to let the C call remember the stack index of the spilled Opnd::VReg to prepare for retrieving it from the C stack as needed.

Solution

We push down the pair of JITFrame pointer and FrameState (its stack and locals converted to Vec<lir::Opnd> to be precise) into the LIR for each C call.

Whenever we have a non-leaf C call, including any JIT-to-JIT calls, we currently have codegen like this:

gen_prepare_non_leaf_call(jit, asm, state); // or its equivalent broken down into multiple calls
// ... some other instructions ...
asm_ccall!(asm, some_c_function, args...);

Every such gen_prepare_non_leaf_call + ccall pair could work as is if gen_prepare_non_leaf_call did:

  1. Allocate a JITFrame (Lightweight Frame). Write it into cfp->jit_return (instead of writing PC into cfp->pc).
  2. Insert an LIR instruction that remembers stack: Vec<Opnd>, locals: Vec<Opnd>, jit_frame: *const <JITFrame> for the next C call.
    • Vec<hir::InsnId> in hir::FrameState should be converted to Vec<lir::Opnd> just like build_side_exit().
    • Maybe the struct should be called StackMap and the LIR API to set it can be asm.stack_map(...).
  3. In the register allocator's handle_caller_saved_regs phase, when it pushes VRegs before a C call, remember the { VReg => stack index } assignments in the JITFrame for any VReg referenced in stack or locals.
    • Open question: When any Opnd is callee-save Reg or (Opnd::Mem that) references calle-save Reg, which wouldn't be pushed by this process, what should we do?
      • I assume such Reg usages are only SP/CFP/EC and therefore don't exist.
      • For Opnd::Mem cases, I think we can either eagerly load such operands into VRegs (and prohibit leaving them in stack maps) or handle only registers used for SP/CFP/EC specially (we know what SP/CFP/EC should be for any given frame).
  4. When rb_debug_inspector API needs to query locals (or when rb_raise needs to materialize locals on the VM stack before the C stack gets expired by longjmp), check cfp->jit_frame, and read locals (and stack for rb_raise) using the StackMap remembered in the JITFrame.

GC

When we stop spilling objects onto the VM stack, the VM stack may leave uninitialized stack slots, which will be scanned on GC. To avoid crashing GC, we need to do something about it.

  • Skip gaps on rb_execution_context_mark
    • Most promising?
  • Use rb_gc_mark_maybe on rb_execution_context_mark
    • Easy to implement, but
    • Risks: Over-retention relative to before, longer marking time, more pinned objects (conservative marking never moves)
  • Materialize JITFrame on rb_execution_context_mark
    • Most likely slowest

side-exit

On a JIT-to-JIT call, when a callee side-exits, it needs to materialize the VM stack of the caller.

  • We could let zjit_materialize_frames materialize stack slots on jit_exec/JIT_EXEC, or
    • Slow, but smaller code size
  • We could generate code that writes stack slots on callee exits
    • Faster on exit, but generates more code

JITFrame

To include stack maps in JITFrame that zjit_materialize_frames can read, we need to encode lir::Opnd in C structs and bind-gen it to Rust. That should also differentiate immediates and stack references, and remember the stack index associated to the VReg.

Performance impact

JITFrame will have more information, so we'll spend more on memory. JIT code will be faster, because mainline code will spill less. Frame materialization will be slower because frame reconstruction will take more work, but hopefully this is happens rarely enough that we are faster overall.

The speed calculus has major contributions from the side-exit rate, as we reconstruct VM state on exit. So Lightweight Frames may not deliver an immediate speed boost on benchmarks with a high exit rate. However, we believe there are long term benefits as Lightweight Frames is comebined with other optimizations and side-exit rates are reduced.

References

  • HotSpot (OpenJDK)
    • C2: The JIT In HotSpot: C2 tracks JVM state at safepoints as a mapping from registers, stack slots, and constants back to the JVM interpreter stack. Safepoints are themselves IR nodes with use→def edges to every JVM state value. Because safepoints have very low exec frequency, the code scheduler spills bits needed only by the interpreter (but not JIT code) and moves them into off-paths.
  • JavaScriptCore
    • Speculation in JavaScriptCore: This paper discusses how they do OSR exit using stack maps. In DFG IR, they have MovHint instruction for delta compression of stack maps, and the MovHint instructions allow OSR exits to tell where in the stack values should reside for LLInt and Baseline JIT.
    • Introducing the B3 JIT Compiler: B3 does register allocation over Air. Later phases transform spill slots into concrete stack addresses. Their StackmapGenerationParams, which is parsed from binary stackmap blob, stores the list of in-use registers at each patchpoint, the mapping from high-level operands to registers, and the information necessary to emit each patchpoint’s machine code.
  • V8
    • Maglev - V8’s Fastest Optimizing JIT: "Maglev attaches abstract interpreter frame state to nodes that can deoptimize. This state maps interpreter registers to SSA values. This state turns into metadata during code generation, providing a mapping from optimized state to unoptimized state."
    • Deoptimization in V8: TurboFan inserts Checkpoint nodes in the effect chain before potentially deoptimizing operations, with Framestate nodes capturing execution state and StateValues representing parameter, local, and accumulator states. "If a node deoptimizes, it will deoptimize to the state in the last checkpoint in the effect chain."
  • LLVM
    • Stack maps and patch points in LLVM:
    • A stack map records the location of live values at a particular instruction address. These live values do not refer to all the LLVM values live across the stack map. Instead, they are only the values that the runtime requires to be live at this point. For example, they may be the values the runtime will need to resume program execution at that point independent of the compiled function containing the stack map.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions