Skip to content

[Bugfix] update-node-ready-logic#615

Draft
ankrovv wants to merge 2 commits into
ome-projects:mainfrom
ankrovv:ankrovv/update-node-ready-logic
Draft

[Bugfix] update-node-ready-logic#615
ankrovv wants to merge 2 commits into
ome-projects:mainfrom
ankrovv:ankrovv/update-node-ready-logic

Conversation

@ankrovv
Copy link
Copy Markdown

@ankrovv ankrovv commented May 19, 2026

What this PR does

Updates model-agent node Ready handling in two related ways:

  • Persists storageUri and storagePath in each node-scoped model status ConfigMap entry, then ignores stale Ready/Failed entries when the current BaseModel or ClusterBaseModel spec points at a different artifact.
  • Adds periodic artifact integrity reconciliation for models that are already marked Ready on the local node. Basic checks validate required local artifacts cheaply; less frequent deep checks validate persisted SHA256 manifests and reuse OCI Object Storage metadata validation where available.

When integrity validation finds a concrete local artifact failure, model-agent updates both the node model-ready label and the node-scoped ConfigMap entry to Failed, allowing BaseModel and ClusterBaseModel status to move the node from nodesReady to nodesFailed.

Why we need it

A model can keep the same BaseModel/ClusterBaseModel name while its storage source or local path changes. Without checking the storage identity, a node can continue advertising Ready for an older artifact and receive workloads for a path that has not been downloaded on that node.

Ready artifacts can also be deleted or corrupted after initial download. The periodic checker closes that gap by validating already-Ready local artifacts and stopping the node from advertising readiness for a bad model copy.

Related to #613.

How to test

Passed locally:

GOCACHE=/private/tmp/ome-go-build-cache go test ./pkg/modelagent ./pkg/ociobjectstore ./cmd/model-agent ./pkg/controller/v1beta1/basemodel
git diff --check origin/main

Also run locally:

GOCACHE=/private/tmp/ome-go-build-cache go test ./...

The full suite did not complete because this local environment is missing /usr/local/kubebuilder/bin/etcd for envtest packages, and the existing pkg/xet test still fails with cache directory is required. The packages touched by this PR passed in the focused test command above.

Checklist

  • Tests added/updated
  • Docs updated
  • Focused model-agent/controller/object-store tests pass locally

@github-actions github-actions Bot added model-agent Model agent changes controller Controller changes tests Test changes labels May 19, 2026
@github-actions github-actions Bot added documentation Documentation changes helm Helm chart changes storage Storage provider changes config Configuration changes labels May 20, 2026
@YouNeedCryDear
Copy link
Copy Markdown
Collaborator

@ankrovv This might be hard to review. Could you breakdown the changes into multiple PRs. Also include a high level design doc for the implementation plan?

@ankrovv ankrovv marked this pull request as draft May 20, 2026 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration changes controller Controller changes documentation Documentation changes helm Helm chart changes model-agent Model agent changes storage Storage provider changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants