Skip to content

fix(launcher+hardware): #72 hung-backend signal + #65 SF2 fallback log#83

Merged
mbachaud merged 1 commit into
masterfrom
fix/followups-72-65sf2
May 12, 2026
Merged

fix(launcher+hardware): #72 hung-backend signal + #65 SF2 fallback log#83
mbachaud merged 1 commit into
masterfrom
fix/followups-72-65sf2

Conversation

@mbachaud
Copy link
Copy Markdown
Owner

Summary

Two narrow fixes from the open-issues triage following PR #82.

Closes #72, partially closes #65 (SF2 only).

Issue triage (#64, #65, #72, #73, #74)

# Status
#72 REST hung backend Fixed in this PR (option 2 from the issue — started_pending field + 202).
#65 SF1 Headroom rewrites HELIX_SERVER_UPSTREAM silently Already fixedapp.py:329-339 already logs both ON (%s -> %s) and OFF (direct upstream) transitions with both values.
#65 SF2 Hardware fallback misses headless Fixed in this PR (summary WARNING line in _detect).
#65 SF3 Ribosome paid WARN easy to miss Already fixedserver.py:677-685 emits the startup WARN; /health.ribosome_cost_class field exists (server.py:2888); helix_ribosome_info{cost_class=...} Prometheus info-metric exists (telemetry.py:534-548).
#65 SF4 WAL bloat from non-canonical readers Effectively addresseddocs/TROUBLESHOOTING.md already has the WAL bloat section with detection guidance and the /admin/checkpoint?mode=TRUNCATE admin endpoint. The issue suggested a separate docs/operational/wal-bloat.md; the content is there, just under TROUBLESHOOTING. Recommend closing as effectively done unless you want the content split out.
#64 Spec-vs-code design questions (HELIX_DEVICE overload / /consolidate body / <helix:no_match/> freshness) Not actionable from a code-only PR. Each is three options that need a design call before any code moves.
#73 BROAD tighten expression_tokens 12000→6k-8k Bench-gated. Issue's acceptance criteria require running the standard bench suite (diamond / needle_1000) at the candidate value, demonstrating no recall regression, and committing the bench artifact under overnight_logs/. Out of scope for a code-only PR.
#74 PLR gate ([plr] enabled = true) Bench-gated + artifact-gated. Requires training a fresh stacked_plr.joblib, committing it, and the standard bench delta. Out of scope for a code-only PR.

Changes

File Diff
helix_context/launcher/supervisor.py _last_start_pending attr (init to False), last_start_pending property, sets on adopt path (False), timeout path (True), success path (False)
helix_context/launcher/app.py /api/control/start and /api/control/restart return 202 + started_pending: true on pending; drops StartupTimeout from except tuples + import
helix_context/hardware.py _detect() emits log.warning(\"Hardware fallback: requested=X active=cpu — ...\") on the explicit-device fallback path
tests/test_launcher_app.py Updated success-shape assertions; new test_start_pending_returns_202_with_started_pending_field + restart counterpart; replaced obsolete test_start_timeout_returns_500 with test_start_supervisor_error_returns_500
tests/test_launcher_supervisor.py Existing happy-path + timeout tests now also assert last_start_pending
tests/test_hardware.py test_fallback_emits_summary_warning_for_headless_operators (pins WARNING content), test_auto_picker_does_not_emit_fallback_warning (pins no-fire-on-auto-to-cpu invariant)
CHANGELOG.md Two new Unreleased entries

Test plan

  • pytest tests/test_launcher_app.py tests/test_launcher_supervisor.py tests/test_hardware.py — 101/101 pass.
  • Full mock suite: pytest tests/ -m \"not live\"1930 passed / 15 skipped / 21 deselected / 2 xfailed in 8m10s. Master baseline was 1924; this PR adds 4 tests + 2 assertions.
  • No live tests in scope (no ribosome / no embedding model paths touched).

Recommendation on remaining open issues

- launcher: PR #68 made supervisor.start() non-fatal on /stats timeout,
  but the REST handler at /api/control/start still returned
  {ok: true, pid} on the timeout path so external automation
  couldn't distinguish ready from alive-but-not-ready. New
  Supervisor.last_start_pending flag is set on the timeout branch
  (and cleared on success/adoption); REST returns 202 Accepted with
  {ok, pid, started_pending: true, message: "..."} when pending.
  Same treatment on /api/control/restart. Drops the now-unreachable
  StartupTimeout from the except tuples. Closes #72.

- hardware: explicit-device probe failures fired a tray balloon but
  emitted no summary log line, so headless deployments (server,
  supervisor-managed, agents) had no signal connecting the
  per-candidate probe WARNINGs to the final "we ended up on CPU"
  outcome. _detect() now emits one
  log.warning("Hardware fallback: requested=X active=cpu — ...")
  on the explicit-device fallback path. auto -> cpu is unchanged
  (normal outcome on a CPU-only box, not a fallback). Closes #65 SF2.

- Tests:
  * test_launcher_app: test_start_success / test_restart_success now
    assert started_pending=False; new test_start_pending_returns_202
    and test_restart_pending_returns_202 pin the 202 + body shape;
    test_start_supervisor_error_returns_500 replaces the obsolete
    timeout-raises-500 test.
  * test_launcher_supervisor: existing happy-path + timeout tests
    now also assert last_start_pending=False/True respectively.
  * test_hardware: test_fallback_emits_summary_warning_for_headless_
    operators pins the new WARNING; test_auto_picker_does_not_emit_
    fallback_warning pins the no-fire-on-auto-to-cpu invariant.
  * Full suite: 1930 passed / 15 skipped / 21 deselected / 2 xfailed
    (master baseline 1924; this PR adds 4 tests + 2 assertions).

#65 SF1, SF3, SF4 are already addressed on master (per-rewrite
log.info, cost_class in /health + Prometheus info metric + startup
WARN, and the WAL-bloat section in docs/TROUBLESHOOTING.md +
/admin/checkpoint endpoint); only SF2 needed code.

#73, #74 are bench-gated and out of scope for a code-only PR.
#64 is three open design questions (HELIX_DEVICE overload,
/consolidate body, <helix:no_match/> freshness) that need a
decision before code moves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mbachaud mbachaud merged commit 1151a03 into master May 12, 2026
3 checks passed
@mbachaud mbachaud deleted the fix/followups-72-65sf2 branch May 12, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant