Skip to content

Clear stuck motion state for IGNORE-position devices and skip heartbeat polling during travel#688

Open
mnaggatz wants to merge 8 commits into
Julius2342:masterfrom
mnaggatz:fix/clear-stuck-motion-state
Open

Clear stuck motion state for IGNORE-position devices and skip heartbeat polling during travel#688
mnaggatz wants to merge 8 commits into
Julius2342:masterfrom
mnaggatz:fix/clear-stuck-motion-state

Conversation

@mnaggatz
Copy link
Copy Markdown
Contributor

@mnaggatz mnaggatz commented May 23, 2026

Background

After PR #665 (preserve movement state during executing frames with unavailable
positions), is_opening / is_closing can get stuck on True for actuators
whose position frames stay at IGNORE for most of their travel — typically
gates and some garage doors. The session ends only via
FrameCommandRunStatusNotification (CRSN), which the NodeUpdater previously
did not consider as a motion terminator. Downstream consumers (Home Assistant
cover entities) then remained in opening/closing until the next polled
status sweep, sometimes for a full heartbeat period.

A second, independent issue: the heartbeat sweep calls StatusRequest on
every node — including nodes currently executing a command. This can make the
gateway report the active command as completed prematurely (occasionally with
status_reply = COMMAND_OVERRULED) and on some actuators it can interrupt the
motion. Downstream consumers cannot tell that completion apart from a genuine
end-of-travel completion.

What this PR changes

Four small, complementary fixes:

  1. node_updater._update_opening_device_status — escape from the preserve
    branch when the cached position has reached the open/closed extreme matching
    the target. Defensive cleanup so a repeated EXECUTING frame at the extreme
    cannot trap motion state.
  2. node_updater._process_command_run_status_notification — treat
    run_status ∈ {EXECUTION_COMPLETED, EXECUTION_FAILED} as the
    authoritative end-of-command marker for OpeningDevices. Clear
    is_opening/is_closing and reset state_received_at /
    estimated_completion.
  3. Position sync — only on a clean completion: when run_status = EXECUTION_COMPLETED and status_reply = COMMAND_COMPLETED_OK,
    additionally sync the cached position so consumers don't briefly see a
    stale pre-move value between motion-clear and the next house-monitoring
    frame. The sync source is the CRSN payload's main-parameter value
    (frame.parameter_value when node_parameter == NodeParameter.MP and
    the raw value passes Parameter.is_valid_int + is concrete), with
    node.target as fallback. Other status_reply values (e.g.
    COMMAND_OVERRULED) and EXECUTION_FAILED leave the position untouched,
    because the run was pre-empted or failed before reaching the target.
  4. heartbeat.pulse — skip StatusRequest for any OpeningDevice whose
    is_opening or is_closing is set. House Monitoring delivers in-flight
    position updates anyway, so the sweep adds no value during motion and only
    risks the side effects above.

Tests:

  • New tests covering motion-clear + position sync paths for COMPLETED_OK
    (from CRSN payload), COMPLETED_OK with non-concrete payload (fallback to
    node.target), COMPLETED_OK with an out-of-range payload (defensive
    fallback), OVERRULED (motion clears, position stays), and FAILED (motion
    clears, position stays).
  • New test that the heartbeat skips moving OpeningDevices.
  • Existing live-sequence tests
    (test_gate_opening_live_sequence_keeps_direction_until_done and the
    closing counterpart) updated to reflect the new contract: motion is
    cleared at CRSN COMPLETED_OK with a concrete MP payload (matching what
    live traces show real gateways send at clean session end), and the
    subsequent state=DONE frame confirms the final state.

make ci is green (498 tests; mypy / pylint / pydocstyle / flake8 / isort
clean).

Trace-based context

To validate the fix in the wild, I ran a single-pass test against my live
gateway, cycling every OpeningDevice through open → close → open (or
open → close for gate/garage). The traces and the throw-away driver script
are local artifacts and are not part of this PR; the conclusions are
summarised below.

What the traces show about the gateway's behaviour:

  • Initial state on connect is unreliable. After load_nodes() many
    OpeningDevices first show up with position=UNKNOWN target=UNKNOWN.
    Concrete values arrive later via House-Monitoring frames. Consumers
    must not act on the initial snapshot as if it were authoritative.

  • state = OperatingState.UNKNOWN does NOT imply unusable position.
    The gateway emits a steady stream of passive frames such as
    state=OperatingState.UNKNOWN current=0 % target=0 % for idle nodes.
    Position is typically concrete and correct. State and position have
    to be evaluated independently.

  • Gates and garage doors differ structurally. The garage door emits
    concrete intermediate positions during travel (17 %, 3 %, 0 %). The
    gate emits current_position = IGNORE for almost the entire travel;
    only target stays concrete. The cached-position / cached-target
    fallback from Preserve movement state during executing frames with unavailable positions #665 is the only way to derive direction during that
    IGNORE phase.

  • remaining_time is not a reliable countdown. On the gate
    remaining=3 stayed pinned at the same value across many frames; on
    the garage door several remaining=0 frames arrived before the
    actual session-end. Using remaining_time only as an "is in motion"
    hint, not as a timer, is the right call.

  • CRSN is the authoritative end-of-command marker.
    FrameCommandRunStatusNotification (run_status=EXECUTION_COMPLETED, status_reply=COMMAND_COMPLETED_OK) arrived reliably at clean session
    end, either just before or together with the final state=DONE
    position frame, and carried the final main-parameter value in its
    payload. For gate-style devices that stay at IGNORE until the end
    this is the only robust completion signal — which is why fix (2)
    ties motion-clear into the CRSN path, and fix (3) prefers the CRSN
    payload over the cached node.target.

  • The KLF200 is sensitive to long/aggressive polling. During a
    long marathon trace the first ~12 devices completed cleanly;
    afterwards the gateway dropped the TCP socket and refused new
    connections ([Errno 61] Connect call failed). This PR does not
    claim to fix that, but it does reduce avoidable StatusRequest
    pressure against devices that are already busy.

Caveats on the trace evidence

  • The trace driver uses PyVLX(..., heartbeat_interval=3600), so the
    successful gate cycles (≈ 39 s / 47 s) ran with the regular 30 s
    heartbeat effectively suppressed. The traces therefore confirm the
    conditions under which mid-motion polling is risky (long travel,
    long IGNORE phase, CRSN as only end marker), but they are not a
    direct A/B demonstration of fix (4).
  • The garage door CRSN test path is visible in the traces: at
    EXECUTION_COMPLETED the cached position and motion flags update in
    the same tick, and the following state=DONE frame confirms the
    same final state.

Why a 0.2.35 release matters

The currently released pyvlx 0.2.33 does not have this regression
yet — it was introduced together with PR #665 ("Preserve movement
state during executing frames with unavailable positions") and is in
pyvlx 0.2.34. Home Assistant's stable velux integration still
pins pyvlx == 0.2.33; the Home Assistant dev branch already moved
its pin to pyvlx == 0.2.34. The next Home Assistant stable release
cut from dev will therefore ship the regression to the broader user
base unless a fixed pyvlx release replaces the pin first.

User-visible symptoms on IGNORE-position actuators (gates, certain
garage doors) with pyvlx 0.2.34:

  • Cover entities stay in opening / closing after the actuator has
    physically reached its end position, until the next heartbeat sweep
    (~30 s) or a fresh user command resets the state. Automations keyed
    on is_closed therefore see a wrong terminal state for that
    window.
  • Heartbeat-driven StatusRequest calls during an active motion can
    prematurely complete the running command — on some actuators the
    gateway additionally interrupts the motion mid-travel.

The device remains controllable in all these cases (a new command
resets the state via the usual paths), but the displayed state and
any state-driven automation logic are wrong in the meantime. Cutting
a 0.2.35 with this fix before Home Assistant freezes its next
release would avoid the regression hitting users at all.

Test plan

  • make ci green on this branch.
  • Live-tested on a private installation: gate (long travel, IGNORE),
    garage door (mixed concrete + IGNORE), Velux roof window
    (concrete throughout), and several roller shutters. All went
    through opening → open / closing → closed cleanly without
    sticking or briefly flipping to the wrong end-state.
  • Live-cycled every OpeningDevice via the trace driver in one pass
    (see traces above) — same outcome.

mnaggatz and others added 5 commits May 22, 2026 23:42
The preserve branch added in Julius2342#665 kept is_opening/is_closing set whenever
a frame still signalled motion (EXECUTING or remaining_time > 0) and the
direction could not be derived from the current frame. On garage doors
this trapped is_closing = True after the limit switch tripped: the KLF200
emits a final EXECUTING frame with stale remaining_time > 0 even though
the device is physically at the closed extreme, so downstream consumers
(Home Assistant) kept reporting the cover as "closing" forever.

Exit the preserve branch when the cached position already equals the
target and that target is an open/closed extreme. The else branch then
clears the motion flags as it did before Julius2342#665 for the end-of-travel
case, while the intermediate "passing through" preserve scenarios
covered by the existing tests stay untouched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Gates and garage doors observed in the wild keep current_position at
IGNORE for the entire travel: every FrameNodeStatePositionChangedNotification
during the motion lacks a concrete position, so the cached node.position
never advances toward the target. The session ends with a
FrameCommandRunStatusNotification carrying
run_status = EXECUTION_COMPLETED (often paired with
status_reply = COMMAND_OVERRULED when the device hit its limit switch)
and frequently a final FrameStatusRequestNotification whose MP is still
IGNORE. Neither path used to clear is_opening/is_closing, leaving the
Home Assistant cover entity stuck in opening/closing until the next
heartbeat sweep happened to fetch a concrete position.

Treat the command-run-status COMPLETED/FAILED notification as the
authoritative end-of-command marker and clear motion there. This also
covers the garage-door cases the previous at-extreme escape in the
preserve branch cannot reach when the cached position never updated.

The two existing live-sequence tests asserted the previous behavior
(motion preserved through EXECUTION_COMPLETED, cleared only on the
final DONE position frame); they were encoding the very bug this fix
addresses and now expect motion to clear on EXECUTION_COMPLETED.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…LETED

Without this follow-up, IGNORE-position devices (gates, garage doors)
that get their motion cleared by the previous commit briefly expose the
stale pre-move position to HA. The cover entity then computes the wrong
state for a heartbeat cycle: opening 100%->0% becomes "closed" before
flipping to "open" once the next sweep delivers a concrete 0%; closing
0%->100% becomes "open" before flipping to "closed".

On EXECUTION_COMPLETED the KLF200 confirms the command reached its end
(typically with COMMAND_OVERRULED at the physical limit switch), so it
is safe to snap the cached position to node.target. EXECUTION_FAILED
leaves the position untouched because the device did not actually reach
the target — the next status sweep provides the real position.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Polling a node mid-travel via StatusRequest makes the KLF200 report the
active ActivateScene/SetPosition command as run_status =
EXECUTION_COMPLETED with status_reply = COMMAND_OVERRULED. On some
actuators (gates, garage doors) the gateway actually halts the motion
when this happens.

Two regressions follow from that for downstream consumers:

  - the motion-state clear added in the previous commit fires on a
    spurious "completed" frame, so HA briefly flips to closed/open while
    the device is still travelling;

  - the position sync to node.target overrides the cached pre-move
    position with the wrong end of the travel.

House Monitoring already pushes in-flight position updates while a
command is running, so the heartbeat sweep adds no value during motion.
Skip OpeningDevice nodes whose is_opening or is_closing is set; once
the move ends and the (real) EXECUTION_COMPLETED clears the motion
flags the next heartbeat picks them up again.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier comments framed COMMAND_OVERRULED as the KLF200's physical
"limit-switch completion proof". Live traces of clean gate and garage
runs never produced COMMAND_OVERRULED — only EXECUTION_COMPLETED with
COMMAND_COMPLETED_OK. The fix logic does not actually depend on the
status_reply: EXECUTION_COMPLETED is treated as the authoritative
end-of-command marker regardless of the reply value. Adjust the
comments and one test docstring to match what the code actually does
without overclaiming gateway behaviour.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  src/pyvlx
  heartbeat.py
  node_updater.py
Project Total  

This report was generated by python-coverage-comment-action

mnaggatz and others added 3 commits May 24, 2026 00:00
Address review feedback on PR Julius2342#688:

1. Only sync the cached position when run_status = EXECUTION_COMPLETED
   AND status_reply = COMMAND_COMPLETED_OK. Other status_reply values
   (e.g. COMMAND_OVERRULED) indicate that the run was pre-empted before
   reaching node.target, so the cached position must be left for the
   follow-up state frame to update. COMMAND_OVERRULED still clears the
   motion flags but no longer snaps position to target.

2. Use the CRSN payload's main-parameter (NodeParameter.MP) value as the
   primary sync source. The payload carries the gateway's authoritative
   final position for the run and is fresher than node.target, which is
   a cache of the previous state frame. Fall back to node.target only
   when the payload does not carry a concrete MP value.

Tests reflect the new contract: the OVERRULED test now asserts that
position stays at the pre-CRSN cached value; the COMPLETED_OK test now
sets node_parameter / parameter_value on the frame and verifies the
position is synced from the payload; a new fallback test covers the
case where the payload is UNKNOWN and node.target is used instead. The
two live-sequence tests now use COMMAND_COMPLETED_OK with a concrete
MP payload (which matches what live traces show real gateways send at
clean session end) so they exercise the new sync path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address review feedback on PR Julius2342#688: the previous version called
Position(position=frame.parameter_value) before falling back to
node.target. If the gateway ever emits a raw value outside the set
accepted by Parameter.is_valid_int (concrete 0..MAX or one of the
known special markers), the Position constructor would raise
PyVLXException and surface out of the frame handler instead of
falling through to the fallback path.

Guard the construction with Parameter.is_valid_int. Recognised
special markers (UNKNOWN_VALUE, IGNORE, CURRENT, TARGET,
DUAL_SHUTTER_CURTAINS) still pass validation but are filtered out by
_is_concrete_position; truly invalid values now skip straight to the
node.target fallback. A new test covers the invalid-payload path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codifies the live KLF200 sequence observed when a STOP is issued while
an OPEN/CLOSE is in flight on an IGNORE-position actuator (gate):

  1. The original command's CRSN arrives as EXECUTION_COMPLETED +
     COMMAND_OVERRULED. The OVERRULED handler clears the motion flags
     without syncing position to target (run was pre-empted).
  2. The STOP command's own CRSN then arrives as EXECUTION_COMPLETED +
     COMMAND_COMPLETED_OK with parameter_value = IGNORE.

The COMPLETED_OK in step 2 must NOT trigger a position sync, because
the motion flags are already False from step 1 — the sync block is
gated on (node.is_opening or node.is_closing) precisely for this case.
Without the guard, the STOP's clean completion would retroactively
snap the cached position to the original OPEN/CLOSE target, even
though the device is physically stopped somewhere in between.

The test pins this invariant so a future refactor of the CRSN handler
cannot accidentally start syncing on the follow-up frame.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant