Skip to content

Fixed-Step Simulation Hot-Path Improvements#21

Draft
CarlOlsson wants to merge 1 commit into
mainfrom
carl/speed_up_test
Draft

Fixed-Step Simulation Hot-Path Improvements#21
CarlOlsson wants to merge 1 commit into
mainfrom
carl/speed_up_test

Conversation

@CarlOlsson
Copy link
Copy Markdown
Member

Summary

This PR speeds up fixed-step simulation in SystemsOfSystems.jl for the RK4(dt = 0.004 s) use case that drives the aircraft control-analysis benchmark.

The main goal was to remove generic immutable-tree overhead from the inner integration loop so that the top-level simulation can run well above 100x real time when logging is disabled.

Problem

The original fixed-step RK4 path still spent a large fraction of its time in framework code rather than model physics. The main issues were:

  • generic NamedTuple propagation through nested model-state trees on every RK stage
  • repeated work for deterministic random-variable subtrees
  • unnecessary discrete-update work between actual event boundaries
  • outer-loop stepping overhead when fixed-step RK4 could safely consume its own internal substeps

These are all hot-path problems. They matter at 250 Hz because a 120 s simulation requires 30000 solver steps, and RK4 evaluates the RHS four times per step.

What Changed

1. Specialized state propagation

The propagation helpers in src/Solvers.jl were changed from generic map-based NamedTuple reconstruction to generated, field-specialized builders for:

  • single-derivative propagation
  • multi-derivative propagation used by adaptive integrators
  • nested submodel propagation

This removes the runtime completion of missing submodel outputs and the repeated generic tuple plumbing from the solver hot path.

2. Deterministic random-subtree fast paths

TypedModelDescription now stores:

  • has_continuous_random_subtree
  • has_discrete_random_subtree

These flags are computed once during strip_fluff_from_model_description. draw_wc and draw_wd now return immediately for deterministic subtrees instead of rebuilding state descriptions that do not change.

3. Empty-update fast paths

The sim loop now short-circuits when there is no real work to do:

  • empty RatesOutput propagation returns the original state
  • empty UpdatesOutput returns the original state

4. Event-only discrete updates

step! now runs discrete updates only at actual user-requested or model-requested event boundaries. Fixed-step internal solver substeps no longer trigger discrete-update work that cannot change anything.

5. Let fixed-step RK4 own its internal substeps

When logging is disabled and monitors are empty, step! now advances the outer loop only to true event boundaries and lets RungeKutta4 consume its internal dt = 0.004 s substeps inside solve.

This avoids forcing the top-level sim loop to re-enter framework logic for every fixed substep.

Why These Changes

The key observation from profiling was that the fixed-step benchmark was still paying framework costs that scale with solver stage count:

  • rebuild nested immutable state trees
  • rebuild nested submodel rate trees
  • re-run boundary/event logic more often than necessary

The physics model was not the only bottleneck. The framework itself needed to become more monomorphic and allocation-free in the inner loop.

Validation

Micro-level

During the investigation, the hot propagate(msd, dt, ro) path on the benchmark model went from roughly:

  • about 14.4 us per call
  • about 28432 B allocated per call

to roughly:

  • about 0.93 us per call
  • 0 B allocated per call

after specialization.

End-to-end

With the matching GradientModels.jl changes applied, the full 120 s benchmark at RK4(dt = 0.004 s) and logging disabled improved from about:

  • 3.73 s wall time
  • about 32x real time

to warmed runs of about:

  • 0.77 s to 0.81 s
  • about 148x to 155x real time

This clears the 100x target.

Compatibility

This PR does not change the public simulation API. It changes internal execution behavior only:

  • model definitions are unchanged
  • initialization contracts are unchanged
  • solver option types are unchanged
  • YAML-facing model configuration is unchanged

The new fast path is activated by runtime conditions, primarily fixed-step RK4 with logging disabled and no monitors.

Risks and Follow-Ups

  • The BasicLog path is now the next major bottleneck for logging-enabled runs.
  • If logging performance becomes the next target, the right follow-up is to optimize Logs.jl and TimeSeries.jl, not the solver core.

Copy link
Copy Markdown
Member

@tuckermcclure tuckermcclure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some good stuff, some questionable stuff, and some big changes. I love the short-circuiting that was added for random subtrees and empty RatesOutputs and UpdatesOutputs.

I'm pretty unsure about the change to when discrete updates run (now: always; here: only when any model requests one). That might be a good paradigm, but it is different.

There's a failure in CI about a non-pure generated function, which spooks me. I'd need to spend more time looking at that.

Comment thread src/SystemsOfSystems.jl
return (t_last, msd, stop, t_next_suggested)
end

run_discrete_update = (t_next == t_next_from_user) || (t_next == t_next_from_models)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big difference. Before, the discrete updates always run after a continuous-time step. With this update, they only run when any model wants a discrete step. That is, at least some model has to explicitly request a discrete step for any model to actually get one. That might be fine, but it's a big switch from "discrete steps always happen at the end of continuous-time steps."

Comment thread src/SystemsOfSystems.jl

# Make the discrete draws.
msd = draw_wd(t_next, ommd, msd)
msd = draw_wd(t_next, ommd, msd)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's truly bizarre that the comment wasn't indented. Why is Codex so bad about comments?

Comment thread src/SystemsOfSystems.jl
# the models requested. For fixed-step RK4 with logging/monitors disabled, let the solver
# consume its own substeps internally so this outer loop advances only at real event
# boundaries.
t_next = if mh === nothing && isempty(monitors) && Solvers.handles_internal_substepping(solver)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the internal sub-stepping is only used when we're not logging. Sure, that makes the run faster, but I wonder how often we'll run without logging. I was running without logging primarily to help me zoom in on inefficiencies. With this change, the models actually run differently (though the results will be the same). I'm not sure how much this helps, as it is. However, there might be a feature for this, like only_log_on_discrete_samples. Did we really need the continuous-time outputs on every single point between the discrete updates? Then, this feature would help quite a bit.

Comment thread src/SystemsOfSystems.jl
end

function update(msd::ModelStateDescription, updates_output::UpdatesOutput)
is_empty_updates_output(updates_output) && return msd
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these little short circuits. We might also have a singleton for an empty RatesOutput and empty DiscreteOutput and simply compare to those.

Comment thread src/Solvers.jl
Comment on lines +117 to +118
is_empty_rates_output(k1) && is_empty_rates_output(k2) &&
is_empty_rates_output(k3) && is_empty_rates_output(k4) && return msd
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of these is empty when the others aren't, that would be an error, so it's sufficient to check only one.

Comment thread src/Solvers.jl
Comment on lines +110 to +112
return map(
(sm, ro1, ro2, ro3, ro4) -> propagate_rk4(sm, dt, ro1, ro2, ro3, ro4),
submodels, complete_m1, complete_m2, complete_m3, complete_m4,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised about this part! The multi-input map often seems to optimize more poorly.

Comment thread src/Solvers.jl
end

function propagate_set(x::T1, dt, x_dot::T2) where {T1, T2}
@generated function propagate_set(x::T1, dt, x_dot::T2) where {T1, T2}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was pretty happy that I had no allocations for RK4 without generated functions! I wonder if this is actually an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants