fix(launcher+hardware): #72 hung-backend signal + #65 SF2 fallback log#83
Merged
Conversation
- launcher: PR #68 made supervisor.start() non-fatal on /stats timeout, but the REST handler at /api/control/start still returned {ok: true, pid} on the timeout path so external automation couldn't distinguish ready from alive-but-not-ready. New Supervisor.last_start_pending flag is set on the timeout branch (and cleared on success/adoption); REST returns 202 Accepted with {ok, pid, started_pending: true, message: "..."} when pending. Same treatment on /api/control/restart. Drops the now-unreachable StartupTimeout from the except tuples. Closes #72. - hardware: explicit-device probe failures fired a tray balloon but emitted no summary log line, so headless deployments (server, supervisor-managed, agents) had no signal connecting the per-candidate probe WARNINGs to the final "we ended up on CPU" outcome. _detect() now emits one log.warning("Hardware fallback: requested=X active=cpu — ...") on the explicit-device fallback path. auto -> cpu is unchanged (normal outcome on a CPU-only box, not a fallback). Closes #65 SF2. - Tests: * test_launcher_app: test_start_success / test_restart_success now assert started_pending=False; new test_start_pending_returns_202 and test_restart_pending_returns_202 pin the 202 + body shape; test_start_supervisor_error_returns_500 replaces the obsolete timeout-raises-500 test. * test_launcher_supervisor: existing happy-path + timeout tests now also assert last_start_pending=False/True respectively. * test_hardware: test_fallback_emits_summary_warning_for_headless_ operators pins the new WARNING; test_auto_picker_does_not_emit_ fallback_warning pins the no-fire-on-auto-to-cpu invariant. * Full suite: 1930 passed / 15 skipped / 21 deselected / 2 xfailed (master baseline 1924; this PR adds 4 tests + 2 assertions). #65 SF1, SF3, SF4 are already addressed on master (per-rewrite log.info, cost_class in /health + Prometheus info metric + startup WARN, and the WAL-bloat section in docs/TROUBLESHOOTING.md + /admin/checkpoint endpoint); only SF2 needed code. #73, #74 are bench-gated and out of scope for a code-only PR. #64 is three open design questions (HELIX_DEVICE overload, /consolidate body, <helix:no_match/> freshness) that need a decision before code moves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two narrow fixes from the open-issues triage following PR #82.
/api/control/startreturned200 + pidon a hung backend after PR feat(observability): native Grafana telem setup script + spawn-order + port-collision fixes #68 madesupervisor.start()non-fatal on/statstimeout. Newsupervisor.last_start_pendingflag flips on the timeout branch; REST returns 202 Accepted with{ok, pid, started_pending: true, message}so external automation can render a yellow "starting…" badge instead of treating it as success. Same treatment on/api/control/restart. The deadStartupTimeoutpaths are removed from the except tuples._detect()now emits onelog.warning(\"Hardware fallback: requested=X active=cpu — ...\")on the explicit-device fallback path.auto→cpu stays silent (not a fallback, just a CPU-only box).Closes #72, partially closes #65 (SF2 only).
Issue triage (#64, #65, #72, #73, #74)
started_pendingfield + 202).HELIX_SERVER_UPSTREAMsilentlyapp.py:329-339already logs both ON (%s -> %s) and OFF (direct upstream) transitions with both values.WARNINGline in_detect).server.py:677-685emits the startup WARN;/health.ribosome_cost_classfield exists (server.py:2888);helix_ribosome_info{cost_class=...}Prometheus info-metric exists (telemetry.py:534-548).docs/TROUBLESHOOTING.mdalready has the WAL bloat section with detection guidance and the/admin/checkpoint?mode=TRUNCATEadmin endpoint. The issue suggested a separatedocs/operational/wal-bloat.md; the content is there, just under TROUBLESHOOTING. Recommend closing as effectively done unless you want the content split out./consolidatebody /<helix:no_match/>freshness)expression_tokens 12000→6k-8kovernight_logs/. Out of scope for a code-only PR.[plr] enabled = true)stacked_plr.joblib, committing it, and the standard bench delta. Out of scope for a code-only PR.Changes
helix_context/launcher/supervisor.py_last_start_pendingattr (init to False),last_start_pendingproperty, sets on adopt path (False), timeout path (True), success path (False)helix_context/launcher/app.py/api/control/startand/api/control/restartreturn 202 +started_pending: trueon pending; dropsStartupTimeoutfrom except tuples + importhelix_context/hardware.py_detect()emitslog.warning(\"Hardware fallback: requested=X active=cpu — ...\")on the explicit-device fallback pathtests/test_launcher_app.pytest_start_pending_returns_202_with_started_pending_field+ restart counterpart; replaced obsoletetest_start_timeout_returns_500withtest_start_supervisor_error_returns_500tests/test_launcher_supervisor.pylast_start_pendingtests/test_hardware.pytest_fallback_emits_summary_warning_for_headless_operators(pins WARNING content),test_auto_picker_does_not_emit_fallback_warning(pins no-fire-on-auto-to-cpu invariant)CHANGELOG.mdTest plan
pytest tests/test_launcher_app.py tests/test_launcher_supervisor.py tests/test_hardware.py— 101/101 pass.pytest tests/ -m \"not live\"— 1930 passed / 15 skipped / 21 deselected / 2 xfailed in 8m10s. Master baseline was 1924; this PR adds 4 tests + 2 assertions.Recommendation on remaining open issues