fix: detect silent filter-wheel move failures#539
Conversation
Firmware: move callbacks (X/Y/Z/W/W2 + MOVETO variants) now set mcu_cmd_execution_in_progress=true BEFORE tmc4361A_moveTo and report CMD_EXECUTION_ERROR if the call fails. Previously, a moveTo failure left mcu_cmd_execution_in_progress=false, the next status packet broadcast COMPLETED_WITHOUT_ERRORS, and the host blindly trusted that the move had happened. With no encoder feedback on W/W2, this silently desynced the host's tracked filter slot from the physical wheel (observed in main_hcs.log.1 cmd 48 returning "complete" in 5.9 ms for a move that physically requires ~150 ms). Firmware also now broadcasts the W axis microstep position in status-packet bytes 19-20 (gated on enable_filterwheel to avoid reading from an uninitialized tmc4361[w] struct at boot, since init.cpp only initializes indices < STAGE_AXES). Firmware version bumped 1.1 -> 1.2. Python side: SquidFilterWheel._move_to_position snapshots microcontroller.w_pos before each move and verifies the broadcast position changed by the commanded delta (within 50% tolerance) after wait_till_operation_is_completed returns. On mismatch it raises TimeoutError to trigger the existing re-home + retry path. Skipped when the wheel isn't on W (motor_slot != 3) or the firmware doesn't support the broadcast. W2 verification is intentionally deferred — no W2 silent-failure incident on file. Adding W2 to the broadcast would require either a new SET_FILTER_WHEEL_SLOT_COUNT command (to compute slot indices in firmware) or extending the packet, and the right encoding depends on what failure mode we'd be catching. Tracked as follow-up. Tests: pytest tests/squid/test_filter_wheel.py and tests/control/test_microcontroller.py all pass; full suite (1206 tests) green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes a class of silent filter-wheel desync where the firmware acked a MOVE_W as completed even though the motor never moved (e.g. tmc4361A_moveTo returned non-zero, or enable_filterwheel == false so the early-return fired). The host had no encoder feedback on W and trusted the ack, leaving every subsequent slot index off by one until the user manually re-homed. The fix bumps firmware to v1.2 to (a) report CMD_EXECUTION_ERROR from move callbacks that fail and (b) broadcast the W microstep position in status-packet bytes 19-20, then teaches the host to snapshot W before each filter-wheel move and verify the broadcast position changed by the commanded delta (within 50%). On mismatch it raises TimeoutError, falling into the existing re-home + retry path.
Changes:
- Firmware: new
mcu_cmd_execution_statusbyte; move callbacks setin_progress=truebeforetmc4361A_moveToand callmark_move_failed()on failure; W axis position broadcast in bytes 19-20 (gated onenable_filterwheel);FIRMWARE_VERSION_MINOR1→2. - Host:
Microcontrollerparses bytes 19-20 intoself.w_poswhen fw ≥ 1.2; newsupports_w_pos_broadcast()helper; newRESPONSE_BYTE_W_POS_*andMIN_FW_VERSION_W_POS_BROADCASTconstants. - Filter wheel controller: extracted
_delta_to_usteps, added_verify_w_moveand_move_and_verifyhelpers that snapshot W, command the move, wait, and verify; W2 still falls back to trust-the-ack.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
firmware/controller/src/constants.h |
Bumps FIRMWARE_VERSION_MINOR to 2 with version-history comment. |
firmware/controller/src/globals.h / globals.cpp |
Declares/defines new mcu_cmd_execution_status byte global. |
firmware/controller/src/serial_communication.cpp |
Resets execution status per command; broadcasts W position into bytes 19-20 (gated on enable_filterwheel); reports mcu_cmd_execution_status in byte 1 when idle. |
firmware/controller/src/commands/stage_commands.cpp |
Adds mark_move_failed() helper; restructures every move/move-to callback to set in_progress before tmc4361A_moveTo and report failure on non-zero return or disabled wheel. |
software/control/_def.py |
Adds RESPONSE_BYTE_W_POS_HI/LO and MIN_FW_VERSION_W_POS_BROADCAST = (1, 2) constants. |
software/control/microcontroller.py |
Parses bytes 19-20 into self.w_pos when firmware ≥ v1.2; adds supports_w_pos_broadcast() helper. |
software/squid/filter_wheel_controller/cephla.py |
Adds _delta_to_usteps, _verify_w_move, _move_and_verify; uses W broadcast to detect silent failures and trigger the existing re-home + retry path. |
Comments suppressed due to low confidence (2)
software/control/microcontroller.py:1624
- When the firmware now reports
CMD_EXECUTION_ERRORin byte 1 (e.g. becausetmc4361A_moveToreturned non-zero, ormove_filterwheelwas called withenable_filterwheel == false), the host's receive loop inmicrocontroller.pydoes not recognize that status: onlyCOMPLETED_WITHOUT_ERRORSandCMD_CHECKSUM_ERRORare handled (see lines around 1561-1605).mcu_cmd_execution_in_progresstherefore stays True untilwait_till_operation_is_completedhits its 5 s timeout. The recovery still works (TimeoutError → re-home), but the host pays a full 5 s of latency for every silent-failure case the firmware now explicitly reports — defeating the point of the new status byte for the X/Y/Z move-only callbacks where there is no W-pos cross-check. Consider handlingCMD_EXECUTION_ERRORexplicitly so the host can fail fast (clearmcu_cmd_execution_in_progressand surface the error immediately).
# W position broadcast added in firmware v1.2; older firmware
# leaves bytes 19-20 as zero. Callers gate on
# supports_w_pos_broadcast() before trusting w_pos.
if self.firmware_version >= MIN_FW_VERSION_W_POS_BROADCAST:
self.w_pos = self._payload_to_int(msg[RESPONSE_BYTE_W_POS_HI : RESPONSE_BYTE_W_POS_LO + 1], 2)
software/squid/filter_wheel_controller/cephla.py:149
- In
_move_and_verify,w_pos_beforeandexpected_usteps_deltaare bound only inside theif can_verify_w_pos:branch and then read again later inside anotherif can_verify_w_pos:branch. This works today because the predicate is evaluated against the same local variable, but static analyzers (and human readers) flag it as "possibly unbound local". Initializing both names toNonebefore the first branch (or restructuring as a singleif can_verify_w_pos:that wraps both the snapshot, the move, the wait, and the verify) would make the data flow explicit and prevent accidental rearrangement from breaking it.
if can_verify_w_pos:
w_pos_before = self.microcontroller.w_pos
expected_usteps_delta = self._delta_to_usteps(delta)
self._move_wheel(wheel_id, delta)
self.microcontroller.wait_till_operation_is_completed()
if can_verify_w_pos:
self._verify_w_move(wheel_id, w_pos_before, expected_usteps_delta)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _move_and_verify(self, wheel_id: int, delta: float, target_pos: int, can_verify_w_pos: bool) -> None: | ||
| """Single move attempt: optionally snapshot W position, command the | ||
| move, wait for completion, verify the motor actually moved, update | ||
| the tracked position. Raises TimeoutError if either the wait or the | ||
| post-move verification fails. | ||
| """ | ||
| if can_verify_w_pos: | ||
| w_pos_before = self.microcontroller.w_pos | ||
| expected_usteps_delta = self._delta_to_usteps(delta) | ||
| self._move_wheel(wheel_id, delta) | ||
| self.microcontroller.wait_till_operation_is_completed() | ||
| if can_verify_w_pos: | ||
| self._verify_w_move(wheel_id, w_pos_before, expected_usteps_delta) | ||
| self._positions[wheel_id] = target_pos |
| # W position broadcast added in firmware v1.2; older firmware | ||
| # leaves bytes 19-20 as zero. Callers gate on | ||
| # supports_w_pos_broadcast() before trusting w_pos. |
| # If the motor moved less than half the requested microsteps we treat it | ||
| # as "didn't actually move" and trigger a re-home. Catches silent firmware | ||
| # failures, missed steps, and direction inversions. | ||
| _W_POS_TOLERANCE_FRAC = 0.5 |
- Host: handle CMD_EXECUTION_ERROR explicitly in the receive loop via abort_current_command, so callers don't pay the full 5 s wait_till_operation_is_completed timeout for a failure the firmware has already reported. - cephla.py: catch the resulting CommandAborted alongside TimeoutError in the re-home + retry path, via a new _RECOVERABLE_MOVE_ERRORS tuple. - cephla.py: hoist _W_POS_TOLERANCE_FRAC next to _MOTOR_SLOT_TO_AXIS with the other class-level constants. - cephla.py: restructure _move_and_verify so w_pos_before / expected_usteps_delta only exist in the branch that uses them (avoids "possibly unbound local" pattern). - SimSerial: bumped to v1.2 and now packs the simulated W microstep position into bytes 19-20, so the verification path is exercisable in simulation. Five hardcoded "== (1, 1)" assertions in existing tests updated to read SimSerial's declared version. - Added 4 new tests covering: verify passes on a real move, verify triggers re-home on a missed move, verify is skipped when firmware doesn't broadcast W, and verify is skipped for W2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Tolerance: replace 50% fractional tolerance with ±10 microsteps absolute. The firmware only clears mcu_cmd_execution_in_progress once tmc4361A_currentPosition == W_commanded_target_position (operations.cpp:515), so on stepper-only setups the observed delta is exact; ±10 ustep budget exists only to absorb PID/encoder settling jitter on closed-loop W setups. ±10 is ~0.6% of a slot, way too small to admit a missed-move false negative. - Recovery ladder: try → software retry → re-home + retry. The most common failure (firmware ack glitch like the 5.9 ms incident) leaves the motor unmoved, so the wheel is still at the previously tracked position and a plain resend usually succeeds. Re-homing (which takes ~4 s) is now last resort, not first response. - abort_current_command(recoverable=True): logs at WARNING instead of ERROR for failures the host is handling itself, so routine CMD_EXECUTION_ERROR recoveries don't fill operator logs with scary-looking "ABORTED" messages. Default unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Wire up recoverable=True at the CMD_EXECUTION_ERROR call site. The parameter was added in the previous commit but never actually used, so routine filter-wheel recoveries were still logging as ERROR. - Unify abort_current_command message across log levels — same wording, level conveys severity. Keeps log-grep predictable. - Move W_POS_TOLERANCE_USTEPS to control/_def.py next to the other per-axis tuning constants (HAS_ENCODER_W, ENABLE_PID_W, etc.). - Trim _move_to_position docstring to just the rationale; the step-by- step list was paraphrasing the code below it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| try: | ||
| self._move_wheel(wheel_id, delta) | ||
| self.microcontroller.wait_till_operation_is_completed() | ||
| self._positions[wheel_id] = target_pos | ||
| except TimeoutError: | ||
| _log.warning(f"Filter wheel {wheel_id} movement timed out. " f"Re-homing to re-sync position tracking...") | ||
| # Re-home to re-synchronize position tracking | ||
| self._home_wheel(wheel_id) | ||
|
|
||
| # Retry the movement (position is now at min_index after homing) | ||
| current_pos = self._positions[wheel_id] | ||
| delta = (target_pos - current_pos) * step_size | ||
| try: | ||
| self._move_wheel(wheel_id, delta) | ||
| self.microcontroller.wait_till_operation_is_completed() | ||
| self._positions[wheel_id] = target_pos | ||
| _log.info(f"Filter wheel {wheel_id} recovery successful, now at position {target_pos}") | ||
| except TimeoutError: | ||
| _log.error( | ||
| f"Filter wheel {wheel_id} movement failed even after re-home. " f"Hardware may need attention." | ||
| ) | ||
| raise | ||
| self._move_and_verify(wheel_id, delta, target_pos) | ||
| return | ||
| except self._RECOVERABLE_MOVE_ERRORS as e: | ||
| _log.warning(f"Filter wheel {wheel_id} movement failed ({e}); retrying without re-home...") | ||
|
|
||
| try: | ||
| self._move_and_verify(wheel_id, delta, target_pos) | ||
| _log.info(f"Filter wheel {wheel_id} software retry succeeded, now at position {target_pos}") | ||
| return | ||
| except self._RECOVERABLE_MOVE_ERRORS as e: | ||
| _log.warning(f"Filter wheel {wheel_id} software retry failed ({e}); re-homing to re-sync...") |
| 2. Re-home the wheel to re-synchronize position tracking | ||
| 3. Retry the movement to the target position | ||
| 4. Raise an exception if retry also fails | ||
| Allows ±_W_POS_TOLERANCE_USTEPS of jitter; anything larger means the |
| if (enable_filterwheel) | ||
| { | ||
| int16_t W_pos_int16 = (int16_t) tmc4361A_currentPosition(&tmc4361[w]); | ||
| buffer_tx[19] = byte((W_pos_int16 >> 8) & 0xFF); | ||
| buffer_tx[20] = byte(W_pos_int16 & 0xFF); | ||
| } | ||
| else | ||
| { | ||
| buffer_tx[19] = 0; | ||
| buffer_tx[20] = 0; | ||
| } |
Previous recovery ladder did a blind software resend on ANY recoverable failure. Copilot review correctly flagged that a partial motor stall defeats this: 1st attempt moves +800/+1600, verification raises, resend commands another +1600, motor lands at +2400 (one slot past target), verification on the resend sees delta=1600 ✓ and updates tracked position — silent desync, exactly the failure mode this PR set out to eliminate. Fix: split recovery by failure type. CommandAborted from CMD_EXECUTION_ERROR means tmc4361A_moveTo returned non-zero BEFORE motor motion started, so the wheel is provably still at the tracked position and a bare resend is safe. TimeoutError (ack timeout or verification mismatch) means the motor state is uncertain — go straight to re-home, no resend. Also: - Saturate W broadcast to INT16 bounds instead of wrapping (operator can see saturation in the host log if absolute position ever drifts past ±32k, vs. a silent wrap that would mis-verify). - Fix docstring reference: _W_POS_TOLERANCE_USTEPS -> W_POS_TOLERANCE_USTEPS. - Add two new tests: CommandAborted triggers software resend (no rehome), TimeoutError skips resend (rehomes immediately). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds firmware + host detection for a class of silent filter-wheel desync where the MCU reports a move as completed but the motor never moved. Observed in production:
main_hcs.log.1shows aMOVE_Wfor slot 2→1 returning "complete" in 5.9 ms when the physical move requires ~150 ms; the host then trusted the ack and updatedtracked_position, leaving every subsequent filter switch off by one slot until the user manually re-homed.Root cause is in
stage_commands.cpp: each move callback only setmcu_cmd_execution_in_progress = trueinside theif (tmc4361A_moveTo(...) == 0)success branch. WhenmoveToreturned non-zero (or the earlyif (!enabled) return;fired), the flag stayed false, and the next periodic status broadcast reportedCOMPLETED_WITHOUT_ERRORSfor the newcmd_ideven though no motion happened. With no encoder feedback on W/W2 (HAS_ENCODER_W is False,cephla.py:114), the host had no way to notice.Same bug pattern exists in
MOVE_X / MOVE_Y / MOVE_Zand theMOVETO_*callbacks, but for those axes the firmware also broadcasts real position in every status packet (bytes 2-17), so absolute moves self-correct. The filter wheel is the one axis where this manifests as a persistent silent error.What changed
Firmware
mcu_cmd_execution_statusbyte global; reset toCOMPLETED_WITHOUT_ERRORSon every received command and overwritten toCMD_EXECUTION_ERRORwhen a move callback fails.send_position_updatereports it in byte 1 when the MCU is idle.MOVE_X/Y/Z/W/W2,MOVETO_X/Y/Z/W) restructured to setin_progress = truebeforetmc4361A_moveToand call amark_move_failed()helper on failure. Themove_filterwheelandcallback_move_to_wearly-returns whenenable_filterwheel == falsenow also mark failed instead of silently no-op'ing.enable_filterwheelbecausetmc4361[w]is uninitialized at boot (init.cpp:123only loops< STAGE_AXES); reading from it would deref a NULL config pointer.FIRMWARE_VERSION_MINORbumped 1 → 2.Host (Python)
Microcontrollerparses bytes 19-20 intoself.w_poswhen firmware >= v1.2. Newsupports_w_pos_broadcast()helper.SquidFilterWheel._move_to_positionsnapshotsmicrocontroller.w_posbefore each move and verifies the broadcast position changed by the commanded delta (within 50% tolerance) afterwait_till_operation_is_completedreturns. On mismatch it raisesTimeoutError→ falls into the existing re-home + retry path (cephla.py:138-159)._move_and_verifyhelper so the initial try and the retry-after-rehome branch share one code path.Failure-mode coverage
tmc4361A_moveToreturns non-zeroCMD_EXECUTION_ERROR, host times out at 5 s, re-homesenable_filterwheel == false(move arrives before INITFILTERWHEEL)Known follow-up: W2
The broadcast only carries W; byte 21 is still reserved. The
_verify_w_movepath is gated onmotor_slot_index == 3, so dual-wheel setups still get the old trust-the-ack behavior for the second wheel. Adding W2 broadcast needs a design decision on encoding (raw int16 — needs another byte and a packet bump, or 1-byte slot index — needs firmware to knowusteps_per_slot). Punted because no W2 silent-failure incident is on file.Test plan
pytest tests/squid/test_filter_wheel.py— 8 passedpytest tests/control/test_microcontroller.py— 6 passed, 1 skipped (existing skip)pytest --ignore=tests/control/test_HighContentScreeningGui.py— 1206 passed, 7 skipped, 1 xfailedblack --config pyproject.toml --checkclean on edited Python filesMOVE_Wwhileenable_filterwheel == false(boot without sendingINITFILTERWHEEL) and confirm the host re-homes instead of advancingtracked_position🤖 Generated with Claude Code