fix(node): handle ant-node auto-upgrade without parking nodes in Stopped#53
Merged
jacderida merged 1 commit intoWithAutonomi:rc-2026.4.2from Apr 22, 2026
Merged
Conversation
ant-node replaces its binary on disk and later exits so a service manager can restart it. The daemon was classifying that exit as a clean stop, leaving nodes marked `Stopped` with a stale version and tempting users to restart them manually. Nodes spawned by the daemon now run with `--stop-on-upgrade`, and the supervisor polls each running node's on-disk binary version every 60s. When the disk version drifts from the registry, the node transitions to a new `NodeStatus::UpgradeScheduled` variant (with `pending_version` on the status payload) and `NodeEvent::UpgradeScheduled` fires. On process exit, the supervisor respawns the node against the new binary, refreshes `NodeConfig.version` in the registry, and emits `NodeEvent::NodeUpgraded`. `Stopped` is now reserved for user-initiated stops only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ant node statusno longer leaves auto-upgraded nodes stuck inStoppedwith a stale version. The daemon now recognises ant-node's upgrade-and-exit as an expected lifecycle event and respawns the node against the new binary.NodeStatus::UpgradeScheduled(pluspending_versionfield on status payloads) surfaces the transient state toant node statusand the REST/SSE API.Stoppedis now reserved strictly for user-initiated stops (ant node stop); all other exits either respawn (upgrade) or route through the crash/backoff path.What changed
ant-core/src/node/daemon/supervisor.rsbuild_node_argsnow always appends--stop-on-upgradeso ant-node relies on the daemon to restart it rather than spawning its own grandchild (which races for the node's port during graceful shutdown).spawn_upgrade_monitorbackground task: for eachRunningnode, polls<binary> --versionevery 60 s; on drift from the registry, flips the node toUpgradeScheduledand firesNodeEvent::UpgradeScheduled.monitor_nodebranches on status at exit:Stopping/Stoppedreturns;UpgradeScheduledrespawns immediately without backoff. A synchronous version re-check on code-0 exit catches fast upgrades that slip past the poll. Clean exits are never markedStopped— the intentional-stop path owns that transition.respawn_upgraded_nodehelper spawns the replacement, re-reads the version, persists it to the registry, firesNodeEvent::NodeUpgraded, and returns the node toRunning.ant-core/src/node/types.rs— addsNodeStatus::UpgradeScheduledandpending_version: Option<String>(serde-skipped when None) onNodeStatusSummaryandNodeInfo.ant-core/src/node/events.rs— addsNodeEvent::UpgradeScheduled { node_id, pending_version }andNodeEvent::NodeUpgraded { node_id, old_version, new_version }plus their SSE event type strings.ant-core/src/node/binary.rs—extract_versionis nowpub(crate)so the supervisor can reuse it.ant-core/src/node/daemon/server.rs—AppState.registryisArc-wrapped so the upgrade monitor can access it; handlers populatepending_version;UpgradeScheduledcounts as running in node-count totals; monitor is spawned fromstart().ant-cli/src/commands/node/status.rs— new match arm forUpgradeScheduled(cyan glyph) andcurrent → pendingversion display whenpending_versionis set..gitignore— ignore.claude/scheduled_tasks.lock.Test plan
Automated
cargo test --package ant-core --lib— 141 pass (includes new tests forUpgradeScheduledserialisation,NodeEvent::UpgradeScheduled/NodeUpgradedserialisation,mark_upgrade_scheduled_only_affects_running_nodes, andnode_counts_counts_upgrade_scheduled_as_running)cargo test --package ant-core --test daemon_integration --test node_add_integration— 5 + 3 passcargo clippy --all-targets --all-features -- -D warnings— cleancargo fmt --all -- --check— cleanManual verification on DEV-02 testnet (real auto-upgrade flow)
Three nodes added against DEV-02 at v0.10.1; binary URL downloads a build configured for a 1-hour staged rollout. Each node hit its scheduled upgrade time and restarted on the new binary.
running:0.10.1→running:0.10.11-rc.1within one 20 s monitor tick. New pid, new version in the registry.upgrade_scheduledwithpending_version=0.10.11-rc.1; the state held for 13 minutes until ant-node's natural upgrade window (15:14) triggered the exit, after which the supervisor respawned it torunning:0.10.11-rc.1.Throughout the run (≈150 status snapshots at 20 s intervals), no node was ever observed in
Stoppedexcept when explicitly stopped by the test harness.Representative live CLI output during the synthetic test:
Notes / known follow-ups
UpgradeScheduledwhen the binary has been replaced on disk. Between ant-node deciding "I'll upgrade at T" and physically replacing the binary, there is still a staged-rollout window (up to 1 h) that is invisible to the CLI. Extending visibility to that pre-replacement window would require either parsing ant-node's log output or adding a status-file / RPC to ant-node; out of scope here.🤖 Generated with Claude Code