Background
We're done optimizing CFP push, for now. We want to look into fixing "double spills": stop spilling values onto the VM stack and only spill them onto the C stack. Instead of eagerly spilling VM state for every safepoint, we want to put enough information into JITFrame out-of-line so the VM state could be reconsutrcted in event of deoptimization.
To do that, we need to be able to lazily copy values on the C stack to the VM stack when a C function called from JIT code uses the rb_debug_inspector C API or raises. Below is my design doc for the problem.
Problem
require 'binding_of_caller'
def update_foo
b = binding.of_caller(1)
b.local_variable_set(:foo, b.local_variable_get(:foo) + 1)
end
def entry(x)
foo = x + 1
update_foo
foo
end
entry(0) #=> 2
Right now, we would spill the result of x + 1, which would be Opnd::VReg, onto the VM stack slot for local variable foo on the call of update_foo. The rb_debug_inspector C API (called inside of_caller) would then look at the VM stack to escape the local variable foo into a Binding object on the heap.
We want to stop writing this VM stack slot on the update_foo call. Instead, since the "C call" for update_foo (we use asm.ccall_with_iseq_call for JIT-to-JIT calls) would spill any live registers like Opnd::VReg for foo, we want to let the C call remember the stack index of the spilled Opnd::VReg to prepare for retrieving it from the C stack as needed.
Solution
We push down the pair of JITFrame pointer and FrameState (its stack and locals converted to Vec<lir::Opnd> to be precise) into the LIR for each C call.
Whenever we have a non-leaf C call, including any JIT-to-JIT calls, we currently have codegen like this:
gen_prepare_non_leaf_call(jit, asm, state); // or its equivalent broken down into multiple calls
// ... some other instructions ...
asm_ccall!(asm, some_c_function, args...);
Every such gen_prepare_non_leaf_call + ccall pair could work as is if gen_prepare_non_leaf_call did:
- Allocate a
JITFrame (Lightweight Frame). Write it into cfp->jit_return (instead of writing PC into cfp->pc).
- Insert an LIR instruction that remembers
stack: Vec<Opnd>, locals: Vec<Opnd>, jit_frame: *const <JITFrame> for the next C call.
Vec<hir::InsnId> in hir::FrameState should be converted to Vec<lir::Opnd> just like build_side_exit().
- Maybe the struct should be called
StackMap and the LIR API to set it can be asm.stack_map(...).
- In the register allocator's
handle_caller_saved_regs phase, when it pushes VRegs before a C call, remember the { VReg => stack index } assignments in the JITFrame for any VReg referenced in stack or locals.
- Open question: When any
Opnd is callee-save Reg or (Opnd::Mem that) references calle-save Reg, which wouldn't be pushed by this process, what should we do?
- I assume such
Reg usages are only SP/CFP/EC and therefore don't exist.
- For
Opnd::Mem cases, I think we can either eagerly load such operands into VRegs (and prohibit leaving them in stack maps) or handle only registers used for SP/CFP/EC specially (we know what SP/CFP/EC should be for any given frame).
- When rb_debug_inspector API needs to query locals (or when rb_raise needs to materialize locals on the VM stack before the C stack gets expired by longjmp), check
cfp->jit_frame, and read locals (and stack for rb_raise) using the StackMap remembered in the JITFrame.
GC
When we stop spilling objects onto the VM stack, the VM stack may leave uninitialized stack slots, which will be scanned on GC. To avoid crashing GC, we need to do something about it.
- Skip gaps on
rb_execution_context_mark
- Use
rb_gc_mark_maybe on rb_execution_context_mark
- Easy to implement, but
- Risks: Over-retention relative to before, longer marking time, more pinned objects (conservative marking never moves)
- Materialize JITFrame on
rb_execution_context_mark
side-exit
On a JIT-to-JIT call, when a callee side-exits, it needs to materialize the VM stack of the caller.
- We could let
zjit_materialize_frames materialize stack slots on jit_exec/JIT_EXEC, or
- Slow, but smaller code size
- We could generate code that writes stack slots on callee exits
- Faster on exit, but generates more code
JITFrame
To include stack maps in JITFrame that zjit_materialize_frames can read, we need to encode lir::Opnd in C structs and bind-gen it to Rust. That should also differentiate immediates and stack references, and remember the stack index associated to the VReg.
Performance impact
JITFrame will have more information, so we'll spend more on memory. JIT code will be faster, because mainline code will spill less. Frame materialization will be slower because frame reconstruction will take more work, but hopefully this is happens rarely enough that we are faster overall.
The speed calculus has major contributions from the side-exit rate, as we reconstruct VM state on exit. So Lightweight Frames may not deliver an immediate speed boost on benchmarks with a high exit rate. However, we believe there are long term benefits as Lightweight Frames is comebined with other optimizations and side-exit rates are reduced.
References
- HotSpot (OpenJDK)
- C2: The JIT In HotSpot: C2 tracks JVM state at safepoints as a mapping from registers, stack slots, and constants back to the JVM interpreter stack. Safepoints are themselves IR nodes with use→def edges to every JVM state value. Because safepoints have very low exec frequency, the code scheduler spills bits needed only by the interpreter (but not JIT code) and moves them into off-paths.
- JavaScriptCore
- Speculation in JavaScriptCore: This paper discusses how they do OSR exit using stack maps. In DFG IR, they have
MovHint instruction for delta compression of stack maps, and the MovHint instructions allow OSR exits to tell where in the stack values should reside for LLInt and Baseline JIT.
- Introducing the B3 JIT Compiler: B3 does register allocation over Air. Later phases transform spill slots into concrete stack addresses. Their
StackmapGenerationParams, which is parsed from binary stackmap blob, stores the list of in-use registers at each patchpoint, the mapping from high-level operands to registers, and the information necessary to emit each patchpoint’s machine code.
- V8
- Maglev - V8’s Fastest Optimizing JIT: "Maglev attaches abstract interpreter frame state to nodes that can deoptimize. This state maps interpreter registers to SSA values. This state turns into metadata during code generation, providing a mapping from optimized state to unoptimized state."
- Deoptimization in V8: TurboFan inserts
Checkpoint nodes in the effect chain before potentially deoptimizing operations, with Framestate nodes capturing execution state and StateValues representing parameter, local, and accumulator states. "If a node deoptimizes, it will deoptimize to the state in the last checkpoint in the effect chain."
- LLVM
- Stack maps and patch points in LLVM:
-
A stack map records the location of live values at a particular instruction address. These live values do not refer to all the LLVM values live across the stack map. Instead, they are only the values that the runtime requires to be live at this point. For example, they may be the values the runtime will need to resume program execution at that point independent of the compiled function containing the stack map.
Background
We're done optimizing CFP push, for now. We want to look into fixing "double spills": stop spilling values onto the VM stack and only spill them onto the C stack. Instead of eagerly spilling VM state for every safepoint, we want to put enough information into JITFrame out-of-line so the VM state could be reconsutrcted in event of deoptimization.
To do that, we need to be able to lazily copy values on the C stack to the VM stack when a C function called from JIT code uses the rb_debug_inspector C API or raises. Below is my design doc for the problem.
Problem
Right now, we would spill the result of
x + 1, which would beOpnd::VReg, onto the VM stack slot for local variablefooon the call ofupdate_foo. The rb_debug_inspector C API (called insideof_caller) would then look at the VM stack to escape the local variablefoointo a Binding object on the heap.We want to stop writing this VM stack slot on the
update_foocall. Instead, since the "C call" forupdate_foo(we useasm.ccall_with_iseq_callfor JIT-to-JIT calls) would spill any live registers likeOpnd::VRegforfoo, we want to let the C call remember the stack index of the spilledOpnd::VRegto prepare for retrieving it from the C stack as needed.Solution
We push down the pair of
JITFramepointer andFrameState(itsstackandlocalsconverted toVec<lir::Opnd>to be precise) into the LIR for each C call.Whenever we have a non-leaf C call, including any JIT-to-JIT calls, we currently have codegen like this:
Every such
gen_prepare_non_leaf_call+ccallpair could work as is ifgen_prepare_non_leaf_calldid:JITFrame(Lightweight Frame). Write it intocfp->jit_return(instead of writing PC intocfp->pc).stack: Vec<Opnd>, locals: Vec<Opnd>, jit_frame: *const <JITFrame>for the next C call.Vec<hir::InsnId>inhir::FrameStateshould be converted toVec<lir::Opnd>just likebuild_side_exit().StackMapand the LIR API to set it can beasm.stack_map(...).handle_caller_saved_regsphase, when it pushesVRegs before a C call, remember the{ VReg => stack index }assignments in theJITFramefor anyVRegreferenced instackorlocals.Opndis callee-saveRegor (Opnd::Memthat) references calle-saveReg, which wouldn't be pushed by this process, what should we do?Regusages are onlySP/CFP/ECand therefore don't exist.Opnd::Memcases, I think we can either eagerly load such operands intoVRegs (and prohibit leaving them in stack maps) or handle only registers used forSP/CFP/ECspecially (we know whatSP/CFP/ECshould be for any given frame).cfp->jit_frame, and read locals (and stack for rb_raise) using theStackMapremembered in theJITFrame.GC
When we stop spilling objects onto the VM stack, the VM stack may leave uninitialized stack slots, which will be scanned on GC. To avoid crashing GC, we need to do something about it.
rb_execution_context_markrb_gc_mark_maybeonrb_execution_context_markrb_execution_context_markside-exit
On a JIT-to-JIT call, when a callee side-exits, it needs to materialize the VM stack of the caller.
zjit_materialize_framesmaterialize stack slots onjit_exec/JIT_EXEC, orJITFrame
To include stack maps in
JITFramethatzjit_materialize_framescan read, we need to encodelir::Opndin C structs and bind-gen it to Rust. That should also differentiate immediates and stack references, and remember the stack index associated to the VReg.Performance impact
JITFrame will have more information, so we'll spend more on memory. JIT code will be faster, because mainline code will spill less. Frame materialization will be slower because frame reconstruction will take more work, but hopefully this is happens rarely enough that we are faster overall.
The speed calculus has major contributions from the side-exit rate, as we reconstruct VM state on exit. So Lightweight Frames may not deliver an immediate speed boost on benchmarks with a high exit rate. However, we believe there are long term benefits as Lightweight Frames is comebined with other optimizations and side-exit rates are reduced.
References
MovHintinstruction for delta compression of stack maps, and theMovHintinstructions allow OSR exits to tell where in the stack values should reside for LLInt and Baseline JIT.StackmapGenerationParams, which is parsed from binary stackmap blob, stores the list of in-use registers at each patchpoint, the mapping from high-level operands to registers, and the information necessary to emit each patchpoint’s machine code.Checkpointnodes in the effect chain before potentially deoptimizing operations, withFramestatenodes capturing execution state andStateValuesrepresenting parameter, local, and accumulator states. "If a node deoptimizes, it will deoptimize to the state in the last checkpoint in the effect chain."