Skip to content

fix: detect silent filter-wheel move failures#539

Open
Alpaca233 wants to merge 5 commits into
masterfrom
fix/filter-wheel-silent-failure-detection
Open

fix: detect silent filter-wheel move failures#539
Alpaca233 wants to merge 5 commits into
masterfrom
fix/filter-wheel-silent-failure-detection

Conversation

@Alpaca233
Copy link
Copy Markdown
Collaborator

Summary

Adds firmware + host detection for a class of silent filter-wheel desync where the MCU reports a move as completed but the motor never moved. Observed in production: main_hcs.log.1 shows a MOVE_W for slot 2→1 returning "complete" in 5.9 ms when the physical move requires ~150 ms; the host then trusted the ack and updated tracked_position, leaving every subsequent filter switch off by one slot until the user manually re-homed.

Root cause is in stage_commands.cpp: each move callback only set mcu_cmd_execution_in_progress = true inside the if (tmc4361A_moveTo(...) == 0) success branch. When moveTo returned non-zero (or the early if (!enabled) return; fired), the flag stayed false, and the next periodic status broadcast reported COMPLETED_WITHOUT_ERRORS for the new cmd_id even though no motion happened. With no encoder feedback on W/W2 (HAS_ENCODER_W is False, cephla.py:114), the host had no way to notice.

Same bug pattern exists in MOVE_X / MOVE_Y / MOVE_Z and the MOVETO_* callbacks, but for those axes the firmware also broadcasts real position in every status packet (bytes 2-17), so absolute moves self-correct. The filter wheel is the one axis where this manifests as a persistent silent error.

What changed

Firmware

  • New mcu_cmd_execution_status byte global; reset to COMPLETED_WITHOUT_ERRORS on every received command and overwritten to CMD_EXECUTION_ERROR when a move callback fails. send_position_update reports it in byte 1 when the MCU is idle.
  • Move callbacks (MOVE_X/Y/Z/W/W2, MOVETO_X/Y/Z/W) restructured to set in_progress = true before tmc4361A_moveTo and call a mark_move_failed() helper on failure. The move_filterwheel and callback_move_to_w early-returns when enable_filterwheel == false now also mark failed instead of silently no-op'ing.
  • W axis microstep position broadcast in status-packet bytes 19-20 (signed int16, big-endian). Gated on enable_filterwheel because tmc4361[w] is uninitialized at boot (init.cpp:123 only loops < STAGE_AXES); reading from it would deref a NULL config pointer.
  • FIRMWARE_VERSION_MINOR bumped 1 → 2.

Host (Python)

  • Microcontroller parses bytes 19-20 into self.w_pos when firmware >= v1.2. New supports_w_pos_broadcast() helper.
  • SquidFilterWheel._move_to_position snapshots microcontroller.w_pos before each move and verifies the broadcast position changed by the commanded delta (within 50% tolerance) after wait_till_operation_is_completed returns. On mismatch it raises TimeoutError → falls into the existing re-home + retry path (cephla.py:138-159).
  • Extracted _move_and_verify helper so the initial try and the retry-after-rehome branch share one code path.

Failure-mode coverage

Scenario Before After
tmc4361A_moveTo returns non-zero Silent COMPLETED ack, host desyncs Firmware reports CMD_EXECUTION_ERROR, host times out at 5 s, re-homes
Motor stalls / misses steps mid-move Silent COMPLETED ack Host sees W position delta < expected/2, raises TimeoutError, re-homes
enable_filterwheel == false (move arrives before INITFILTERWHEEL) Silent COMPLETED ack Same fail path
Old firmware (< v1.2) n/a Host skips verification — preserves current behavior

Known follow-up: W2

The broadcast only carries W; byte 21 is still reserved. The _verify_w_move path is gated on motor_slot_index == 3, so dual-wheel setups still get the old trust-the-ack behavior for the second wheel. Adding W2 broadcast needs a design decision on encoding (raw int16 — needs another byte and a packet bump, or 1-byte slot index — needs firmware to know usteps_per_slot). Punted because no W2 silent-failure incident is on file.

Test plan

  • pytest tests/squid/test_filter_wheel.py — 8 passed
  • pytest tests/control/test_microcontroller.py — 6 passed, 1 skipped (existing skip)
  • Full suite pytest --ignore=tests/control/test_HighContentScreeningGui.py — 1206 passed, 7 skipped, 1 xfailed
  • black --config pyproject.toml --check clean on edited Python files
  • Build firmware locally and flash to a test Teensy — I can't compile firmware from this environment; please verify the C++ builds and the v1.2 status packet parses correctly on a real bench unit before merging
  • Reproduce the original incident on bench: trigger a MOVE_W while enable_filterwheel == false (boot without sending INITFILTERWHEEL) and confirm the host re-homes instead of advancing tracked_position
  • Confirm boot safety: cold-boot a unit with no filter wheel hardware connected; confirm no SPI hang or hard fault from the (now-gated) W broadcast path

🤖 Generated with Claude Code

Firmware: move callbacks (X/Y/Z/W/W2 + MOVETO variants) now set
mcu_cmd_execution_in_progress=true BEFORE tmc4361A_moveTo and report
CMD_EXECUTION_ERROR if the call fails. Previously, a moveTo failure
left mcu_cmd_execution_in_progress=false, the next status packet
broadcast COMPLETED_WITHOUT_ERRORS, and the host blindly trusted
that the move had happened. With no encoder feedback on W/W2, this
silently desynced the host's tracked filter slot from the physical
wheel (observed in main_hcs.log.1 cmd 48 returning "complete" in
5.9 ms for a move that physically requires ~150 ms).

Firmware also now broadcasts the W axis microstep position in
status-packet bytes 19-20 (gated on enable_filterwheel to avoid
reading from an uninitialized tmc4361[w] struct at boot, since
init.cpp only initializes indices < STAGE_AXES). Firmware version
bumped 1.1 -> 1.2.

Python side: SquidFilterWheel._move_to_position snapshots
microcontroller.w_pos before each move and verifies the broadcast
position changed by the commanded delta (within 50% tolerance) after
wait_till_operation_is_completed returns. On mismatch it raises
TimeoutError to trigger the existing re-home + retry path. Skipped
when the wheel isn't on W (motor_slot != 3) or the firmware doesn't
support the broadcast.

W2 verification is intentionally deferred — no W2 silent-failure
incident on file. Adding W2 to the broadcast would require either a
new SET_FILTER_WHEEL_SLOT_COUNT command (to compute slot indices in
firmware) or extending the packet, and the right encoding depends on
what failure mode we'd be catching. Tracked as follow-up.

Tests: pytest tests/squid/test_filter_wheel.py and
tests/control/test_microcontroller.py all pass; full suite (1206
tests) green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a class of silent filter-wheel desync where the firmware acked a MOVE_W as completed even though the motor never moved (e.g. tmc4361A_moveTo returned non-zero, or enable_filterwheel == false so the early-return fired). The host had no encoder feedback on W and trusted the ack, leaving every subsequent slot index off by one until the user manually re-homed. The fix bumps firmware to v1.2 to (a) report CMD_EXECUTION_ERROR from move callbacks that fail and (b) broadcast the W microstep position in status-packet bytes 19-20, then teaches the host to snapshot W before each filter-wheel move and verify the broadcast position changed by the commanded delta (within 50%). On mismatch it raises TimeoutError, falling into the existing re-home + retry path.

Changes:

  • Firmware: new mcu_cmd_execution_status byte; move callbacks set in_progress=true before tmc4361A_moveTo and call mark_move_failed() on failure; W axis position broadcast in bytes 19-20 (gated on enable_filterwheel); FIRMWARE_VERSION_MINOR 1→2.
  • Host: Microcontroller parses bytes 19-20 into self.w_pos when fw ≥ 1.2; new supports_w_pos_broadcast() helper; new RESPONSE_BYTE_W_POS_* and MIN_FW_VERSION_W_POS_BROADCAST constants.
  • Filter wheel controller: extracted _delta_to_usteps, added _verify_w_move and _move_and_verify helpers that snapshot W, command the move, wait, and verify; W2 still falls back to trust-the-ack.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
firmware/controller/src/constants.h Bumps FIRMWARE_VERSION_MINOR to 2 with version-history comment.
firmware/controller/src/globals.h / globals.cpp Declares/defines new mcu_cmd_execution_status byte global.
firmware/controller/src/serial_communication.cpp Resets execution status per command; broadcasts W position into bytes 19-20 (gated on enable_filterwheel); reports mcu_cmd_execution_status in byte 1 when idle.
firmware/controller/src/commands/stage_commands.cpp Adds mark_move_failed() helper; restructures every move/move-to callback to set in_progress before tmc4361A_moveTo and report failure on non-zero return or disabled wheel.
software/control/_def.py Adds RESPONSE_BYTE_W_POS_HI/LO and MIN_FW_VERSION_W_POS_BROADCAST = (1, 2) constants.
software/control/microcontroller.py Parses bytes 19-20 into self.w_pos when firmware ≥ v1.2; adds supports_w_pos_broadcast() helper.
software/squid/filter_wheel_controller/cephla.py Adds _delta_to_usteps, _verify_w_move, _move_and_verify; uses W broadcast to detect silent failures and trigger the existing re-home + retry path.
Comments suppressed due to low confidence (2)

software/control/microcontroller.py:1624

  • When the firmware now reports CMD_EXECUTION_ERROR in byte 1 (e.g. because tmc4361A_moveTo returned non-zero, or move_filterwheel was called with enable_filterwheel == false), the host's receive loop in microcontroller.py does not recognize that status: only COMPLETED_WITHOUT_ERRORS and CMD_CHECKSUM_ERROR are handled (see lines around 1561-1605). mcu_cmd_execution_in_progress therefore stays True until wait_till_operation_is_completed hits its 5 s timeout. The recovery still works (TimeoutError → re-home), but the host pays a full 5 s of latency for every silent-failure case the firmware now explicitly reports — defeating the point of the new status byte for the X/Y/Z move-only callbacks where there is no W-pos cross-check. Consider handling CMD_EXECUTION_ERROR explicitly so the host can fail fast (clear mcu_cmd_execution_in_progress and surface the error immediately).
                # W position broadcast added in firmware v1.2; older firmware
                # leaves bytes 19-20 as zero. Callers gate on
                # supports_w_pos_broadcast() before trusting w_pos.
                if self.firmware_version >= MIN_FW_VERSION_W_POS_BROADCAST:
                    self.w_pos = self._payload_to_int(msg[RESPONSE_BYTE_W_POS_HI : RESPONSE_BYTE_W_POS_LO + 1], 2)

software/squid/filter_wheel_controller/cephla.py:149

  • In _move_and_verify, w_pos_before and expected_usteps_delta are bound only inside the if can_verify_w_pos: branch and then read again later inside another if can_verify_w_pos: branch. This works today because the predicate is evaluated against the same local variable, but static analyzers (and human readers) flag it as "possibly unbound local". Initializing both names to None before the first branch (or restructuring as a single if can_verify_w_pos: that wraps both the snapshot, the move, the wait, and the verify) would make the data flow explicit and prevent accidental rearrangement from breaking it.
        if can_verify_w_pos:
            w_pos_before = self.microcontroller.w_pos
            expected_usteps_delta = self._delta_to_usteps(delta)
        self._move_wheel(wheel_id, delta)
        self.microcontroller.wait_till_operation_is_completed()
        if can_verify_w_pos:
            self._verify_w_move(wheel_id, w_pos_before, expected_usteps_delta)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +137 to +150
def _move_and_verify(self, wheel_id: int, delta: float, target_pos: int, can_verify_w_pos: bool) -> None:
"""Single move attempt: optionally snapshot W position, command the
move, wait for completion, verify the motor actually moved, update
the tracked position. Raises TimeoutError if either the wait or the
post-move verification fails.
"""
if can_verify_w_pos:
w_pos_before = self.microcontroller.w_pos
expected_usteps_delta = self._delta_to_usteps(delta)
self._move_wheel(wheel_id, delta)
self.microcontroller.wait_till_operation_is_completed()
if can_verify_w_pos:
self._verify_w_move(wheel_id, w_pos_before, expected_usteps_delta)
self._positions[wheel_id] = target_pos
Comment on lines +1620 to +1622
# W position broadcast added in firmware v1.2; older firmware
# leaves bytes 19-20 as zero. Callers gate on
# supports_w_pos_broadcast() before trusting w_pos.
Comment on lines +116 to +119
# If the motor moved less than half the requested microsteps we treat it
# as "didn't actually move" and trigger a re-home. Catches silent firmware
# failures, missed steps, and direction inversions.
_W_POS_TOLERANCE_FRAC = 0.5
Alpaca233 and others added 3 commits May 15, 2026 15:20
- Host: handle CMD_EXECUTION_ERROR explicitly in the receive loop via
  abort_current_command, so callers don't pay the full 5 s
  wait_till_operation_is_completed timeout for a failure the firmware
  has already reported.
- cephla.py: catch the resulting CommandAborted alongside TimeoutError
  in the re-home + retry path, via a new _RECOVERABLE_MOVE_ERRORS tuple.
- cephla.py: hoist _W_POS_TOLERANCE_FRAC next to _MOTOR_SLOT_TO_AXIS
  with the other class-level constants.
- cephla.py: restructure _move_and_verify so w_pos_before /
  expected_usteps_delta only exist in the branch that uses them
  (avoids "possibly unbound local" pattern).
- SimSerial: bumped to v1.2 and now packs the simulated W microstep
  position into bytes 19-20, so the verification path is exercisable
  in simulation. Five hardcoded "== (1, 1)" assertions in existing
  tests updated to read SimSerial's declared version.
- Added 4 new tests covering: verify passes on a real move, verify
  triggers re-home on a missed move, verify is skipped when firmware
  doesn't broadcast W, and verify is skipped for W2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Tolerance: replace 50% fractional tolerance with ±10 microsteps
  absolute. The firmware only clears mcu_cmd_execution_in_progress
  once tmc4361A_currentPosition == W_commanded_target_position
  (operations.cpp:515), so on stepper-only setups the observed delta
  is exact; ±10 ustep budget exists only to absorb PID/encoder
  settling jitter on closed-loop W setups. ±10 is ~0.6% of a slot,
  way too small to admit a missed-move false negative.
- Recovery ladder: try → software retry → re-home + retry. The most
  common failure (firmware ack glitch like the 5.9 ms incident)
  leaves the motor unmoved, so the wheel is still at the previously
  tracked position and a plain resend usually succeeds. Re-homing
  (which takes ~4 s) is now last resort, not first response.
- abort_current_command(recoverable=True): logs at WARNING instead of
  ERROR for failures the host is handling itself, so routine
  CMD_EXECUTION_ERROR recoveries don't fill operator logs with
  scary-looking "ABORTED" messages. Default unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Wire up recoverable=True at the CMD_EXECUTION_ERROR call site. The
  parameter was added in the previous commit but never actually used,
  so routine filter-wheel recoveries were still logging as ERROR.
- Unify abort_current_command message across log levels — same wording,
  level conveys severity. Keeps log-grep predictable.
- Move W_POS_TOLERANCE_USTEPS to control/_def.py next to the other
  per-axis tuning constants (HAS_ENCODER_W, ENABLE_PID_W, etc.).
- Trim _move_to_position docstring to just the rationale; the step-by-
  step list was paraphrasing the code below it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment on lines +175 to +186
try:
self._move_wheel(wheel_id, delta)
self.microcontroller.wait_till_operation_is_completed()
self._positions[wheel_id] = target_pos
except TimeoutError:
_log.warning(f"Filter wheel {wheel_id} movement timed out. " f"Re-homing to re-sync position tracking...")
# Re-home to re-synchronize position tracking
self._home_wheel(wheel_id)

# Retry the movement (position is now at min_index after homing)
current_pos = self._positions[wheel_id]
delta = (target_pos - current_pos) * step_size
try:
self._move_wheel(wheel_id, delta)
self.microcontroller.wait_till_operation_is_completed()
self._positions[wheel_id] = target_pos
_log.info(f"Filter wheel {wheel_id} recovery successful, now at position {target_pos}")
except TimeoutError:
_log.error(
f"Filter wheel {wheel_id} movement failed even after re-home. " f"Hardware may need attention."
)
raise
self._move_and_verify(wheel_id, delta, target_pos)
return
except self._RECOVERABLE_MOVE_ERRORS as e:
_log.warning(f"Filter wheel {wheel_id} movement failed ({e}); retrying without re-home...")

try:
self._move_and_verify(wheel_id, delta, target_pos)
_log.info(f"Filter wheel {wheel_id} software retry succeeded, now at position {target_pos}")
return
except self._RECOVERABLE_MOVE_ERRORS as e:
_log.warning(f"Filter wheel {wheel_id} software retry failed ({e}); re-homing to re-sync...")
2. Re-home the wheel to re-synchronize position tracking
3. Retry the movement to the target position
4. Raise an exception if retry also fails
Allows ±_W_POS_TOLERANCE_USTEPS of jitter; anything larger means the
Comment on lines +92 to +102
if (enable_filterwheel)
{
int16_t W_pos_int16 = (int16_t) tmc4361A_currentPosition(&tmc4361[w]);
buffer_tx[19] = byte((W_pos_int16 >> 8) & 0xFF);
buffer_tx[20] = byte(W_pos_int16 & 0xFF);
}
else
{
buffer_tx[19] = 0;
buffer_tx[20] = 0;
}
Previous recovery ladder did a blind software resend on ANY recoverable
failure. Copilot review correctly flagged that a partial motor stall
defeats this: 1st attempt moves +800/+1600, verification raises, resend
commands another +1600, motor lands at +2400 (one slot past target),
verification on the resend sees delta=1600 ✓ and updates tracked
position — silent desync, exactly the failure mode this PR set out to
eliminate.

Fix: split recovery by failure type. CommandAborted from CMD_EXECUTION_ERROR
means tmc4361A_moveTo returned non-zero BEFORE motor motion started, so
the wheel is provably still at the tracked position and a bare resend is
safe. TimeoutError (ack timeout or verification mismatch) means the motor
state is uncertain — go straight to re-home, no resend.

Also:
- Saturate W broadcast to INT16 bounds instead of wrapping (operator can
  see saturation in the host log if absolute position ever drifts past
  ±32k, vs. a silent wrap that would mis-verify).
- Fix docstring reference: _W_POS_TOLERANCE_USTEPS -> W_POS_TOLERANCE_USTEPS.
- Add two new tests: CommandAborted triggers software resend (no rehome),
  TimeoutError skips resend (rehomes immediately).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants