Delay WallTimeInfo SYPD estimate#4509
Conversation
16368d0 to
9914cc0
Compare
|
/agent_review |
|
✅ Pull Request Review completed successfully! |
There was a problem hiding this comment.
Scope: 2 files changed — src/callbacks/callback_helpers.jl, src/callbacks/get_callbacks.jl. No tests run locally. PR is draft; no NEWS.md entry (internal behavior only, acceptable).
medium — Logic mismatch: n_steps_per_cycle computed on partial callback set
Affected: get_callbacks.jl, common_callbacks, lines 529–543.
n_steps_per_cycle(callbacks, dt) is called after NaN-checking and conservation callbacks have been added, but before graceful-exit, checkpoint, and gc callbacks. The PR description states the goal is to delay SYPD estimation until "all non-post-simulation callbacks are called at least once," but checkpoint and gc callbacks are excluded from the LCM computation. A daily checkpoint callback with long timesteps could have a step interval larger than check_nan_every (1024), so the SYPD estimate may still be biased by JIT warmup for checkpoint compilation. In practice, check_nan_every=1024 is likely the dominant term and the risk is low, but the implementation does not match the stated invariant. Consider computing steps_per_cycle after all callbacks are assembled, or document which callbacks are intentionally excluded.
low — Duck-typing via hasproperty is fragile
Affected: callback_helpers.jl line 113.
hasproperty(cbs, :discrete_callbacks) ? cbs.discrete_callbacks : cbs distinguishes a CallbackSet from a tuple by checking for the field name. This will silently fall through to collect(cbs) for any unexpected type that also lacks discrete_callbacks. A more explicit check (cbs isa Tuple) would make the intent clearer and catch unintended call-site types with a method error rather than a runtime collect failure. No behavior impact in current call sites.
Open questions
- When
check_nan_every=0andcheck_conservation=false,callbacksis empty andn_steps_per_cyclereturns 1, giving a 1-step delay ordiv(tot_steps,4)delay — whichever is smaller. Is the intent to have no delay in this case (since there are no pre-progress callbacks to wait for)? - The callback execution order has changed (NaN and conservation now fire before progress logging in each timestep). This is arguably correct, but any test that asserts callback ordering by side-effect may need updating.
Residual testing gaps
- No test covers the
first_update_after_t_simulationpath throughWallTimeInfo. A unit test callingcommon_callbackswith a short tspan and verifying no SYPD log entry appears beforesypd_estimate_step_delaysteps would confirm the delay works correctly. - The
atmos_callbacks(tuple)code path (viahaspropertybranch) is new and has no direct test; it is exercised indirectly throughn_steps_per_cycleincommon_callbacks.
Generated by Pull Request Review for issue #4509 · ● 6.7M
|
/agent_review |
|
✅ Pull Request Review completed successfully! |
There was a problem hiding this comment.
Scope: 2 files changed — src/callbacks/callback_helpers.jl, src/callbacks/get_callbacks.jl. No tests run (PR is draft). Focus areas: correctness of the delay logic, the hasproperty guard, compat safety.
medium — Validation — No test coverage for new delay logic
Affected: common_callbacks in src/callbacks/get_callbacks.jl (lines 544–558).
The sypd_estimate_step_delay calculation (min(steps_per_cycle, div(tot_steps, 4))) and the resulting first_update_after_t_simulation are not exercised by any test. The existing test/callbacks.jl only covers n_steps_per_cycle_per_cb. Edge cases — tot_steps == 0, steps_per_cycle > tot_steps, tot_steps < 4 (where div(tot_steps, 4) == 0) — are all reachable in short simulations and could silently suppress SYPD output entirely or produce the wrong first-emission threshold. The infrastructure test group or test/callbacks.jl would be the right place for a targeted unit test.
low — Dead compat code — WallTimeInfo() constructor mismatch
Affected: src/compat.jl (lines 27–39), src/callbacks/get_callbacks.jl (line 240).
The compat WallTimeInfo() constructor (guarded by pkgversion(ClimaUtilities) < v"0.1.20") accepts no arguments, but the new call site uses WallTimeInfo(; first_update_after_t_simulation). Because the Project.toml lower bound is ClimaUtilities = "0.1.27", the compat branch is permanently dead code and will never be reached. No runtime risk, but the compat block is now misleading and could be removed or annotated to prevent future confusion.
Open questions
typemin(dt)is used as the default sentinel forfirst_update_after_t_simulation. ForITimethis would betypemin(Int64)(or similar). Confirm that theWallTimeInfoimplementation in ClimaUtilities ≥ 0.1.20 handles this sentinel correctly and does not underflow or misorder when comparing against simulation time.- The callback ordering change (conservation before checkpoint, progress after both) is intentional per the PR description. Confirm this does not affect any Buildkite jobs that assert a specific output ordering or timestamp structure in logs.
Residual testing gaps
- No Buildkite job explicitly covers the SYPD reporting path or the new delay threshold; the change is validated only by visual inspection of log output in practice.
- Short simulations where
tot_steps < 4(yieldingdiv(tot_steps, 4) == 0and hence zero delay) are not tested.
Generated by Pull Request Review for issue #4509 · ● 4.2M
|
Todo: bump ClimaUtilities compat |
mod order cap delay change kwarg improve comments
Purpose
Delays the SYPD estimate until callbacks are compiled.
To-do
Content
Delay estimating sypd until either a quarter of the tspan is finished, or all non-post-simulation callbacks are called at least once