Skip to content

Delay WallTimeInfo SYPD estimate#4509

Draft
imreddyTeja wants to merge 1 commit into
mainfrom
tr/up-deps
Draft

Delay WallTimeInfo SYPD estimate#4509
imreddyTeja wants to merge 1 commit into
mainfrom
tr/up-deps

Conversation

@imreddyTeja

@imreddyTeja imreddyTeja commented May 21, 2026

Copy link
Copy Markdown
Member

Purpose

Delays the SYPD estimate until callbacks are compiled.

To-do

Content

Delay estimating sypd until either a quarter of the tspan is finished, or all non-post-simulation callbacks are called at least once


  • I have read and checked the items on the review checklist.

@imreddyTeja imreddyTeja force-pushed the tr/up-deps branch 7 times, most recently from 16368d0 to 9914cc0 Compare May 26, 2026 23:15
@imreddyTeja

Copy link
Copy Markdown
Member Author

/agent_review

@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Pull Request Review completed successfully!

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope: 2 files changed — src/callbacks/callback_helpers.jl, src/callbacks/get_callbacks.jl. No tests run locally. PR is draft; no NEWS.md entry (internal behavior only, acceptable).


medium — Logic mismatch: n_steps_per_cycle computed on partial callback set

Affected: get_callbacks.jl, common_callbacks, lines 529–543.

n_steps_per_cycle(callbacks, dt) is called after NaN-checking and conservation callbacks have been added, but before graceful-exit, checkpoint, and gc callbacks. The PR description states the goal is to delay SYPD estimation until "all non-post-simulation callbacks are called at least once," but checkpoint and gc callbacks are excluded from the LCM computation. A daily checkpoint callback with long timesteps could have a step interval larger than check_nan_every (1024), so the SYPD estimate may still be biased by JIT warmup for checkpoint compilation. In practice, check_nan_every=1024 is likely the dominant term and the risk is low, but the implementation does not match the stated invariant. Consider computing steps_per_cycle after all callbacks are assembled, or document which callbacks are intentionally excluded.

low — Duck-typing via hasproperty is fragile

Affected: callback_helpers.jl line 113.

hasproperty(cbs, :discrete_callbacks) ? cbs.discrete_callbacks : cbs distinguishes a CallbackSet from a tuple by checking for the field name. This will silently fall through to collect(cbs) for any unexpected type that also lacks discrete_callbacks. A more explicit check (cbs isa Tuple) would make the intent clearer and catch unintended call-site types with a method error rather than a runtime collect failure. No behavior impact in current call sites.


Open questions

  • When check_nan_every=0 and check_conservation=false, callbacks is empty and n_steps_per_cycle returns 1, giving a 1-step delay or div(tot_steps,4) delay — whichever is smaller. Is the intent to have no delay in this case (since there are no pre-progress callbacks to wait for)?
  • The callback execution order has changed (NaN and conservation now fire before progress logging in each timestep). This is arguably correct, but any test that asserts callback ordering by side-effect may need updating.

Residual testing gaps

  • No test covers the first_update_after_t_simulation path through WallTimeInfo. A unit test calling common_callbacks with a short tspan and verifying no SYPD log entry appears before sypd_estimate_step_delay steps would confirm the delay works correctly.
  • The atmos_callbacks(tuple) code path (via hasproperty branch) is new and has no direct test; it is exercised indirectly through n_steps_per_cycle in common_callbacks.

Generated by Pull Request Review for issue #4509 · ● 6.7M

@imreddyTeja

Copy link
Copy Markdown
Member Author

/agent_review

@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Pull Request Review completed successfully!

@imreddyTeja imreddyTeja changed the title Tr/up deps Delay WallTimeInfo SYPD estimate May 27, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope: 2 files changed — src/callbacks/callback_helpers.jl, src/callbacks/get_callbacks.jl. No tests run (PR is draft). Focus areas: correctness of the delay logic, the hasproperty guard, compat safety.


medium — Validation — No test coverage for new delay logic

Affected: common_callbacks in src/callbacks/get_callbacks.jl (lines 544–558).

The sypd_estimate_step_delay calculation (min(steps_per_cycle, div(tot_steps, 4))) and the resulting first_update_after_t_simulation are not exercised by any test. The existing test/callbacks.jl only covers n_steps_per_cycle_per_cb. Edge cases — tot_steps == 0, steps_per_cycle > tot_steps, tot_steps < 4 (where div(tot_steps, 4) == 0) — are all reachable in short simulations and could silently suppress SYPD output entirely or produce the wrong first-emission threshold. The infrastructure test group or test/callbacks.jl would be the right place for a targeted unit test.


low — Dead compat code — WallTimeInfo() constructor mismatch

Affected: src/compat.jl (lines 27–39), src/callbacks/get_callbacks.jl (line 240).

The compat WallTimeInfo() constructor (guarded by pkgversion(ClimaUtilities) < v"0.1.20") accepts no arguments, but the new call site uses WallTimeInfo(; first_update_after_t_simulation). Because the Project.toml lower bound is ClimaUtilities = "0.1.27", the compat branch is permanently dead code and will never be reached. No runtime risk, but the compat block is now misleading and could be removed or annotated to prevent future confusion.


Open questions

  • typemin(dt) is used as the default sentinel for first_update_after_t_simulation. For ITime this would be typemin(Int64) (or similar). Confirm that the WallTimeInfo implementation in ClimaUtilities ≥ 0.1.20 handles this sentinel correctly and does not underflow or misorder when comparing against simulation time.
  • The callback ordering change (conservation before checkpoint, progress after both) is intentional per the PR description. Confirm this does not affect any Buildkite jobs that assert a specific output ordering or timestamp structure in logs.

Residual testing gaps

  • No Buildkite job explicitly covers the SYPD reporting path or the new delay threshold; the change is validated only by visual inspection of log output in practice.
  • Short simulations where tot_steps < 4 (yielding div(tot_steps, 4) == 0 and hence zero delay) are not tested.

Generated by Pull Request Review for issue #4509 · ● 4.2M

@imreddyTeja

Copy link
Copy Markdown
Member Author

Todo: bump ClimaUtilities compat

mod order

cap delay

change kwarg

improve comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant