Fixed-Step Simulation Hot-Path Improvements#21
Conversation
tuckermcclure
left a comment
There was a problem hiding this comment.
Some good stuff, some questionable stuff, and some big changes. I love the short-circuiting that was added for random subtrees and empty RatesOutputs and UpdatesOutputs.
I'm pretty unsure about the change to when discrete updates run (now: always; here: only when any model requests one). That might be a good paradigm, but it is different.
There's a failure in CI about a non-pure generated function, which spooks me. I'd need to spend more time looking at that.
| return (t_last, msd, stop, t_next_suggested) | ||
| end | ||
|
|
||
| run_discrete_update = (t_next == t_next_from_user) || (t_next == t_next_from_models) |
There was a problem hiding this comment.
This is a big difference. Before, the discrete updates always run after a continuous-time step. With this update, they only run when any model wants a discrete step. That is, at least some model has to explicitly request a discrete step for any model to actually get one. That might be fine, but it's a big switch from "discrete steps always happen at the end of continuous-time steps."
|
|
||
| # Make the discrete draws. | ||
| msd = draw_wd(t_next, ommd, msd) | ||
| msd = draw_wd(t_next, ommd, msd) |
There was a problem hiding this comment.
It's truly bizarre that the comment wasn't indented. Why is Codex so bad about comments?
| # the models requested. For fixed-step RK4 with logging/monitors disabled, let the solver | ||
| # consume its own substeps internally so this outer loop advances only at real event | ||
| # boundaries. | ||
| t_next = if mh === nothing && isempty(monitors) && Solvers.handles_internal_substepping(solver) |
There was a problem hiding this comment.
Here, the internal sub-stepping is only used when we're not logging. Sure, that makes the run faster, but I wonder how often we'll run without logging. I was running without logging primarily to help me zoom in on inefficiencies. With this change, the models actually run differently (though the results will be the same). I'm not sure how much this helps, as it is. However, there might be a feature for this, like only_log_on_discrete_samples. Did we really need the continuous-time outputs on every single point between the discrete updates? Then, this feature would help quite a bit.
| end | ||
|
|
||
| function update(msd::ModelStateDescription, updates_output::UpdatesOutput) | ||
| is_empty_updates_output(updates_output) && return msd |
There was a problem hiding this comment.
I like these little short circuits. We might also have a singleton for an empty RatesOutput and empty DiscreteOutput and simply compare to those.
| is_empty_rates_output(k1) && is_empty_rates_output(k2) && | ||
| is_empty_rates_output(k3) && is_empty_rates_output(k4) && return msd |
There was a problem hiding this comment.
If any of these is empty when the others aren't, that would be an error, so it's sufficient to check only one.
| return map( | ||
| (sm, ro1, ro2, ro3, ro4) -> propagate_rk4(sm, dt, ro1, ro2, ro3, ro4), | ||
| submodels, complete_m1, complete_m2, complete_m3, complete_m4, |
There was a problem hiding this comment.
I'm surprised about this part! The multi-input map often seems to optimize more poorly.
| end | ||
|
|
||
| function propagate_set(x::T1, dt, x_dot::T2) where {T1, T2} | ||
| @generated function propagate_set(x::T1, dt, x_dot::T2) where {T1, T2} |
There was a problem hiding this comment.
I was pretty happy that I had no allocations for RK4 without generated functions! I wonder if this is actually an improvement.
Summary
This PR speeds up fixed-step simulation in
SystemsOfSystems.jlfor theRK4(dt = 0.004 s)use case that drives the aircraft control-analysis benchmark.The main goal was to remove generic immutable-tree overhead from the inner integration loop so that the top-level simulation can run well above
100xreal time when logging is disabled.Problem
The original fixed-step
RK4path still spent a large fraction of its time in framework code rather than model physics. The main issues were:NamedTuplepropagation through nested model-state trees on every RK stageRK4could safely consume its own internal substepsThese are all hot-path problems. They matter at
250 Hzbecause a120 ssimulation requires30000solver steps, andRK4evaluates the RHS four times per step.What Changed
1. Specialized state propagation
The propagation helpers in
src/Solvers.jlwere changed from genericmap-basedNamedTuplereconstruction to generated, field-specialized builders for:This removes the runtime completion of missing submodel outputs and the repeated generic tuple plumbing from the solver hot path.
2. Deterministic random-subtree fast paths
TypedModelDescriptionnow stores:has_continuous_random_subtreehas_discrete_random_subtreeThese flags are computed once during
strip_fluff_from_model_description.draw_wcanddraw_wdnow return immediately for deterministic subtrees instead of rebuilding state descriptions that do not change.3. Empty-update fast paths
The sim loop now short-circuits when there is no real work to do:
RatesOutputpropagation returns the original stateUpdatesOutputreturns the original state4. Event-only discrete updates
step!now runs discrete updates only at actual user-requested or model-requested event boundaries. Fixed-step internal solver substeps no longer trigger discrete-update work that cannot change anything.5. Let fixed-step RK4 own its internal substeps
When logging is disabled and monitors are empty,
step!now advances the outer loop only to true event boundaries and letsRungeKutta4consume its internaldt = 0.004 ssubsteps insidesolve.This avoids forcing the top-level sim loop to re-enter framework logic for every fixed substep.
Why These Changes
The key observation from profiling was that the fixed-step benchmark was still paying framework costs that scale with solver stage count:
The physics model was not the only bottleneck. The framework itself needed to become more monomorphic and allocation-free in the inner loop.
Validation
Micro-level
During the investigation, the hot
propagate(msd, dt, ro)path on the benchmark model went from roughly:14.4 usper call28432 Ballocated per callto roughly:
0.93 usper call0 Ballocated per callafter specialization.
End-to-end
With the matching
GradientModels.jlchanges applied, the full120 sbenchmark atRK4(dt = 0.004 s)and logging disabled improved from about:3.73 swall time32xreal timeto warmed runs of about:
0.77 sto0.81 s148xto155xreal timeThis clears the
100xtarget.Compatibility
This PR does not change the public simulation API. It changes internal execution behavior only:
The new fast path is activated by runtime conditions, primarily fixed-step
RK4with logging disabled and no monitors.Risks and Follow-Ups
BasicLogpath is now the next major bottleneck for logging-enabled runs.Logs.jlandTimeSeries.jl, not the solver core.