-
Notifications
You must be signed in to change notification settings - Fork 26
[Graph] Support member ndarrays for qd.checkpoint and qd.graph_do_while #760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hughperkins
wants to merge
7
commits into
main
Choose a base branch
from
hp/member-ndarray-yield-condition
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
b0101fa
[Graph] Support member ndarrays as qd.checkpoint(yield_on=) / qd.grap…
hughperkins f785888
Merge branch 'main' into hp/member-ndarray-yield-condition
hughperkins d3de129
[Graph] Add tests for member-ndarray yield_on= / graph_do_while + sch…
hughperkins 58bba23
[Graph] Preserve IntEnum identity for checkpoint labels across fast-c…
hughperkins 9c45a87
[Graph] Extract _resolve_ndarray_kernel_arg_id into ast_transformers/…
hughperkins f3c9cbe
[Graph] Rewrap PR comments/docstrings to project's 120c target
hughperkins a691735
Merge branch 'main' into hp/member-ndarray-yield-condition
hughperkins File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a checkpoint uses an
IntEnumcp_idand the kernel is restored fromsrc_ll_cache, this new cache field is serialized through JSON as plain integers and_try_load_fastcacherestorescheckpoint_user_labels_by_cp_idas[1]rather than[Stage.LOAD].maybe_build_graph_status()then returns the raw int forstatus.checkpointon cache hits, breaking the documented/API contract thatqd.checkpoint(Stage.X, ...)round-trips the enum value rather than the underlying int. Persist enough enum metadata/expression information to reconstruct the label, or avoid lossy restoration for enum labels.Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch -- confirmed and fixed in 58bba23.
Pydantic coerces
IntEnumtointatCacheValue.__init__time (the field is typedlist[int | None]), so just persisting the int column was lossy even before JSON. Schema bumped tocachevalue-v4-intenum-qualnames, which adds a parallelcheckpoint_user_label_enum_qualnamescolumn.src_hasher.storederives the per-slotmodule.ClassQualName.MEMBERstring from the live label list (still holding the originalIntEnuminstances at store time, before pydantic strips identity), and_resolve_intenum_memberre-imports the enum class viaimportlibon load. Mismatch / failed import (enum moved or renamed since the cache was written) falls back to the persisted int rather than raising, so stale caches degrade gracefully.Tests:
test_checkpoint_fastcache_preserves_intenum_label_identity-- subprocess cache miss + hit, assertsisinstance(label, _FastcacheStage)after restore (not just int equality).test_src_hasher_intenum_qualname_round_trip-- directCacheValueunit test for mixed IntEnum / None / plain-int slots, qualname derivation, and the resolver fallback.Both pass on x64 and CUDA on the cluster. Older v3 caches just invalidate via the version bump, no migration needed.