[Bugfix] update-node-ready-logic#615
Draft
ankrovv wants to merge 2 commits into
Draft
Conversation
Collaborator
|
@ankrovv This might be hard to review. Could you breakdown the changes into multiple PRs. Also include a high level design doc for the implementation plan? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Updates model-agent node Ready handling in two related ways:
storageUriandstoragePathin each node-scoped model status ConfigMap entry, then ignores stale Ready/Failed entries when the current BaseModel or ClusterBaseModel spec points at a different artifact.When integrity validation finds a concrete local artifact failure, model-agent updates both the node model-ready label and the node-scoped ConfigMap entry to
Failed, allowing BaseModel and ClusterBaseModel status to move the node fromnodesReadytonodesFailed.Why we need it
A model can keep the same BaseModel/ClusterBaseModel name while its storage source or local path changes. Without checking the storage identity, a node can continue advertising Ready for an older artifact and receive workloads for a path that has not been downloaded on that node.
Ready artifacts can also be deleted or corrupted after initial download. The periodic checker closes that gap by validating already-Ready local artifacts and stopping the node from advertising readiness for a bad model copy.
Related to #613.
How to test
Passed locally:
GOCACHE=/private/tmp/ome-go-build-cache go test ./pkg/modelagent ./pkg/ociobjectstore ./cmd/model-agent ./pkg/controller/v1beta1/basemodel git diff --check origin/mainAlso run locally:
GOCACHE=/private/tmp/ome-go-build-cache go test ./...The full suite did not complete because this local environment is missing
/usr/local/kubebuilder/bin/etcdfor envtest packages, and the existingpkg/xettest still fails withcache directory is required. The packages touched by this PR passed in the focused test command above.Checklist