Skip to content

fix(node): handle ant-node auto-upgrade without parking nodes in Stopped#53

Merged
jacderida merged 1 commit intoWithAutonomi:rc-2026.4.2from
jacderida:fix-node_upgrade_status
Apr 22, 2026
Merged

fix(node): handle ant-node auto-upgrade without parking nodes in Stopped#53
jacderida merged 1 commit intoWithAutonomi:rc-2026.4.2from
jacderida:fix-node_upgrade_status

Conversation

@jacderida
Copy link
Copy Markdown
Contributor

Summary

  • ant node status no longer leaves auto-upgraded nodes stuck in Stopped with a stale version. The daemon now recognises ant-node's upgrade-and-exit as an expected lifecycle event and respawns the node against the new binary.
  • New NodeStatus::UpgradeScheduled (plus pending_version field on status payloads) surfaces the transient state to ant node status and the REST/SSE API.
  • Stopped is now reserved strictly for user-initiated stops (ant node stop); all other exits either respawn (upgrade) or route through the crash/backoff path.

What changed

  • ant-core/src/node/daemon/supervisor.rs
    • build_node_args now always appends --stop-on-upgrade so ant-node relies on the daemon to restart it rather than spawning its own grandchild (which races for the node's port during graceful shutdown).
    • New spawn_upgrade_monitor background task: for each Running node, polls <binary> --version every 60 s; on drift from the registry, flips the node to UpgradeScheduled and fires NodeEvent::UpgradeScheduled.
    • monitor_node branches on status at exit: Stopping/Stopped returns; UpgradeScheduled respawns immediately without backoff. A synchronous version re-check on code-0 exit catches fast upgrades that slip past the poll. Clean exits are never marked Stopped — the intentional-stop path owns that transition.
    • New respawn_upgraded_node helper spawns the replacement, re-reads the version, persists it to the registry, fires NodeEvent::NodeUpgraded, and returns the node to Running.
  • ant-core/src/node/types.rs — adds NodeStatus::UpgradeScheduled and pending_version: Option<String> (serde-skipped when None) on NodeStatusSummary and NodeInfo.
  • ant-core/src/node/events.rs — adds NodeEvent::UpgradeScheduled { node_id, pending_version } and NodeEvent::NodeUpgraded { node_id, old_version, new_version } plus their SSE event type strings.
  • ant-core/src/node/binary.rsextract_version is now pub(crate) so the supervisor can reuse it.
  • ant-core/src/node/daemon/server.rsAppState.registry is Arc-wrapped so the upgrade monitor can access it; handlers populate pending_version; UpgradeScheduled counts as running in node-count totals; monitor is spawned from start().
  • ant-cli/src/commands/node/status.rs — new match arm for UpgradeScheduled (cyan glyph) and current → pending version display when pending_version is set.
  • .gitignore — ignore .claude/scheduled_tasks.lock.

Test plan

Automated

  • cargo test --package ant-core --lib — 141 pass (includes new tests for UpgradeScheduled serialisation, NodeEvent::UpgradeScheduled/NodeUpgraded serialisation, mark_upgrade_scheduled_only_affects_running_nodes, and node_counts_counts_upgrade_scheduled_as_running)
  • cargo test --package ant-core --test daemon_integration --test node_add_integration — 5 + 3 pass
  • cargo clippy --all-targets --all-features -- -D warnings — clean
  • cargo fmt --all -- --check — clean

Manual verification on DEV-02 testnet (real auto-upgrade flow)

Three nodes added against DEV-02 at v0.10.1; binary URL downloads a build configured for a 1-hour staged rollout. Each node hit its scheduled upgrade time and restarted on the new binary.

  • node-1 (apply 14:41:16 UTC): running:0.10.1running:0.10.11-rc.1 within one 20 s monitor tick. New pid, new version in the registry.
  • node-2 (apply 14:57:33 UTC): same clean transition.
  • node-3 synthetic test: atomic-renamed the new binary into its slot while the process was alive. 43 s later the poll flipped status to upgrade_scheduled with pending_version=0.10.11-rc.1; the state held for 13 minutes until ant-node's natural upgrade window (15:14) triggered the exit, after which the supervisor respawned it to running:0.10.11-rc.1.

Throughout the run (≈150 status snapshots at 20 s intervals), no node was ever observed in Stopped except when explicitly stopped by the test harness.

Representative live CLI output during the synthetic test:

ID   Name           Version                 Status
──────────────────────────────────────────────────────
 1   node1          0.10.11-rc.1            ● Running
 2   node2          0.10.11-rc.1            ● Running
 3   node3          0.10.1 → 0.10.11-rc.1   ● Upgrade scheduled

Notes / known follow-ups

  • This PR detects UpgradeScheduled when the binary has been replaced on disk. Between ant-node deciding "I'll upgrade at T" and physically replacing the binary, there is still a staged-rollout window (up to 1 h) that is invisible to the CLI. Extending visibility to that pre-replacement window would require either parsing ant-node's log output or adding a status-file / RPC to ant-node; out of scope here.

🤖 Generated with Claude Code

ant-node replaces its binary on disk and later exits so a service manager
can restart it. The daemon was classifying that exit as a clean stop,
leaving nodes marked `Stopped` with a stale version and tempting users to
restart them manually.

Nodes spawned by the daemon now run with `--stop-on-upgrade`, and the
supervisor polls each running node's on-disk binary version every 60s.
When the disk version drifts from the registry, the node transitions to a
new `NodeStatus::UpgradeScheduled` variant (with `pending_version` on the
status payload) and `NodeEvent::UpgradeScheduled` fires. On process exit,
the supervisor respawns the node against the new binary, refreshes
`NodeConfig.version` in the registry, and emits `NodeEvent::NodeUpgraded`.
`Stopped` is now reserved for user-initiated stops only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacderida jacderida changed the base branch from main to rc-2026.4.2 April 22, 2026 22:31
@jacderida jacderida merged commit 91af993 into WithAutonomi:rc-2026.4.2 Apr 22, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant