Clear stuck motion state for IGNORE-position devices and skip heartbeat polling during travel#688
Open
mnaggatz wants to merge 8 commits into
Open
Clear stuck motion state for IGNORE-position devices and skip heartbeat polling during travel#688mnaggatz wants to merge 8 commits into
mnaggatz wants to merge 8 commits into
Conversation
The preserve branch added in Julius2342#665 kept is_opening/is_closing set whenever a frame still signalled motion (EXECUTING or remaining_time > 0) and the direction could not be derived from the current frame. On garage doors this trapped is_closing = True after the limit switch tripped: the KLF200 emits a final EXECUTING frame with stale remaining_time > 0 even though the device is physically at the closed extreme, so downstream consumers (Home Assistant) kept reporting the cover as "closing" forever. Exit the preserve branch when the cached position already equals the target and that target is an open/closed extreme. The else branch then clears the motion flags as it did before Julius2342#665 for the end-of-travel case, while the intermediate "passing through" preserve scenarios covered by the existing tests stay untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Gates and garage doors observed in the wild keep current_position at IGNORE for the entire travel: every FrameNodeStatePositionChangedNotification during the motion lacks a concrete position, so the cached node.position never advances toward the target. The session ends with a FrameCommandRunStatusNotification carrying run_status = EXECUTION_COMPLETED (often paired with status_reply = COMMAND_OVERRULED when the device hit its limit switch) and frequently a final FrameStatusRequestNotification whose MP is still IGNORE. Neither path used to clear is_opening/is_closing, leaving the Home Assistant cover entity stuck in opening/closing until the next heartbeat sweep happened to fetch a concrete position. Treat the command-run-status COMPLETED/FAILED notification as the authoritative end-of-command marker and clear motion there. This also covers the garage-door cases the previous at-extreme escape in the preserve branch cannot reach when the cached position never updated. The two existing live-sequence tests asserted the previous behavior (motion preserved through EXECUTION_COMPLETED, cleared only on the final DONE position frame); they were encoding the very bug this fix addresses and now expect motion to clear on EXECUTION_COMPLETED. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…LETED Without this follow-up, IGNORE-position devices (gates, garage doors) that get their motion cleared by the previous commit briefly expose the stale pre-move position to HA. The cover entity then computes the wrong state for a heartbeat cycle: opening 100%->0% becomes "closed" before flipping to "open" once the next sweep delivers a concrete 0%; closing 0%->100% becomes "open" before flipping to "closed". On EXECUTION_COMPLETED the KLF200 confirms the command reached its end (typically with COMMAND_OVERRULED at the physical limit switch), so it is safe to snap the cached position to node.target. EXECUTION_FAILED leaves the position untouched because the device did not actually reach the target — the next status sweep provides the real position. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Polling a node mid-travel via StatusRequest makes the KLF200 report the
active ActivateScene/SetPosition command as run_status =
EXECUTION_COMPLETED with status_reply = COMMAND_OVERRULED. On some
actuators (gates, garage doors) the gateway actually halts the motion
when this happens.
Two regressions follow from that for downstream consumers:
- the motion-state clear added in the previous commit fires on a
spurious "completed" frame, so HA briefly flips to closed/open while
the device is still travelling;
- the position sync to node.target overrides the cached pre-move
position with the wrong end of the travel.
House Monitoring already pushes in-flight position updates while a
command is running, so the heartbeat sweep adds no value during motion.
Skip OpeningDevice nodes whose is_opening or is_closing is set; once
the move ends and the (real) EXECUTION_COMPLETED clears the motion
flags the next heartbeat picks them up again.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier comments framed COMMAND_OVERRULED as the KLF200's physical "limit-switch completion proof". Live traces of clean gate and garage runs never produced COMMAND_OVERRULED — only EXECUTION_COMPLETED with COMMAND_COMPLETED_OK. The fix logic does not actually depend on the status_reply: EXECUTION_COMPLETED is treated as the authoritative end-of-command marker regardless of the reply value. Adjust the comments and one test docstring to match what the code actually does without overclaiming gateway behaviour. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||
Address review feedback on PR Julius2342#688: 1. Only sync the cached position when run_status = EXECUTION_COMPLETED AND status_reply = COMMAND_COMPLETED_OK. Other status_reply values (e.g. COMMAND_OVERRULED) indicate that the run was pre-empted before reaching node.target, so the cached position must be left for the follow-up state frame to update. COMMAND_OVERRULED still clears the motion flags but no longer snaps position to target. 2. Use the CRSN payload's main-parameter (NodeParameter.MP) value as the primary sync source. The payload carries the gateway's authoritative final position for the run and is fresher than node.target, which is a cache of the previous state frame. Fall back to node.target only when the payload does not carry a concrete MP value. Tests reflect the new contract: the OVERRULED test now asserts that position stays at the pre-CRSN cached value; the COMPLETED_OK test now sets node_parameter / parameter_value on the frame and verifies the position is synced from the payload; a new fallback test covers the case where the payload is UNKNOWN and node.target is used instead. The two live-sequence tests now use COMMAND_COMPLETED_OK with a concrete MP payload (which matches what live traces show real gateways send at clean session end) so they exercise the new sync path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address review feedback on PR Julius2342#688: the previous version called Position(position=frame.parameter_value) before falling back to node.target. If the gateway ever emits a raw value outside the set accepted by Parameter.is_valid_int (concrete 0..MAX or one of the known special markers), the Position constructor would raise PyVLXException and surface out of the frame handler instead of falling through to the fallback path. Guard the construction with Parameter.is_valid_int. Recognised special markers (UNKNOWN_VALUE, IGNORE, CURRENT, TARGET, DUAL_SHUTTER_CURTAINS) still pass validation but are filtered out by _is_concrete_position; truly invalid values now skip straight to the node.target fallback. A new test covers the invalid-payload path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codifies the live KLF200 sequence observed when a STOP is issued while
an OPEN/CLOSE is in flight on an IGNORE-position actuator (gate):
1. The original command's CRSN arrives as EXECUTION_COMPLETED +
COMMAND_OVERRULED. The OVERRULED handler clears the motion flags
without syncing position to target (run was pre-empted).
2. The STOP command's own CRSN then arrives as EXECUTION_COMPLETED +
COMMAND_COMPLETED_OK with parameter_value = IGNORE.
The COMPLETED_OK in step 2 must NOT trigger a position sync, because
the motion flags are already False from step 1 — the sync block is
gated on (node.is_opening or node.is_closing) precisely for this case.
Without the guard, the STOP's clean completion would retroactively
snap the cached position to the original OPEN/CLOSE target, even
though the device is physically stopped somewhere in between.
The test pins this invariant so a future refactor of the CRSN handler
cannot accidentally start syncing on the follow-up frame.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
After PR #665 (preserve movement state during executing frames with unavailable
positions),
is_opening/is_closingcan get stuck onTruefor actuatorswhose position frames stay at
IGNOREfor most of their travel — typicallygates and some garage doors. The session ends only via
FrameCommandRunStatusNotification(CRSN), which the NodeUpdater previouslydid not consider as a motion terminator. Downstream consumers (Home Assistant
cover entities) then remained in
opening/closinguntil the next polledstatus sweep, sometimes for a full heartbeat period.
A second, independent issue: the heartbeat sweep calls
StatusRequestonevery node — including nodes currently executing a command. This can make the
gateway report the active command as completed prematurely (occasionally with
status_reply = COMMAND_OVERRULED) and on some actuators it can interrupt themotion. Downstream consumers cannot tell that completion apart from a genuine
end-of-travel completion.
What this PR changes
Four small, complementary fixes:
node_updater._update_opening_device_status— escape from the preservebranch when the cached position has reached the open/closed extreme matching
the target. Defensive cleanup so a repeated EXECUTING frame at the extreme
cannot trap motion state.
node_updater._process_command_run_status_notification— treatrun_status∈ {EXECUTION_COMPLETED,EXECUTION_FAILED} as theauthoritative end-of-command marker for OpeningDevices. Clear
is_opening/is_closingand resetstate_received_at/estimated_completion.run_status = EXECUTION_COMPLETEDandstatus_reply = COMMAND_COMPLETED_OK,additionally sync the cached position so consumers don't briefly see a
stale pre-move value between motion-clear and the next house-monitoring
frame. The sync source is the CRSN payload's main-parameter value
(
frame.parameter_valuewhennode_parameter == NodeParameter.MPandthe raw value passes
Parameter.is_valid_int+ is concrete), withnode.targetas fallback. Otherstatus_replyvalues (e.g.COMMAND_OVERRULED) andEXECUTION_FAILEDleave the position untouched,because the run was pre-empted or failed before reaching the target.
heartbeat.pulse— skipStatusRequestfor anyOpeningDevicewhoseis_openingoris_closingis set. House Monitoring delivers in-flightposition updates anyway, so the sweep adds no value during motion and only
risks the side effects above.
Tests:
(from CRSN payload), COMPLETED_OK with non-concrete payload (fallback to
node.target), COMPLETED_OK with an out-of-range payload (defensivefallback), OVERRULED (motion clears, position stays), and FAILED (motion
clears, position stays).
(
test_gate_opening_live_sequence_keeps_direction_until_doneand theclosing counterpart) updated to reflect the new contract: motion is
cleared at CRSN COMPLETED_OK with a concrete MP payload (matching what
live traces show real gateways send at clean session end), and the
subsequent
state=DONEframe confirms the final state.make ciis green (498 tests; mypy / pylint / pydocstyle / flake8 / isortclean).
Trace-based context
To validate the fix in the wild, I ran a single-pass test against my live
gateway, cycling every OpeningDevice through
open → close → open(oropen → closefor gate/garage). The traces and the throw-away driver scriptare local artifacts and are not part of this PR; the conclusions are
summarised below.
What the traces show about the gateway's behaviour:
Initial state on connect is unreliable. After
load_nodes()manyOpeningDevices first show up with
position=UNKNOWN target=UNKNOWN.Concrete values arrive later via House-Monitoring frames. Consumers
must not act on the initial snapshot as if it were authoritative.
state = OperatingState.UNKNOWNdoes NOT imply unusable position.The gateway emits a steady stream of passive frames such as
state=OperatingState.UNKNOWN current=0 % target=0 %for idle nodes.Position is typically concrete and correct. State and position have
to be evaluated independently.
Gates and garage doors differ structurally. The garage door emits
concrete intermediate positions during travel (17 %, 3 %, 0 %). The
gate emits
current_position = IGNOREfor almost the entire travel;only
targetstays concrete. The cached-position / cached-targetfallback from Preserve movement state during executing frames with unavailable positions #665 is the only way to derive direction during that
IGNORE phase.
remaining_timeis not a reliable countdown. On the gateremaining=3stayed pinned at the same value across many frames; onthe garage door several
remaining=0frames arrived before theactual session-end. Using
remaining_timeonly as an "is in motion"hint, not as a timer, is the right call.
CRSN is the authoritative end-of-command marker.
FrameCommandRunStatusNotification (run_status=EXECUTION_COMPLETED, status_reply=COMMAND_COMPLETED_OK)arrived reliably at clean sessionend, either just before or together with the final
state=DONEposition frame, and carried the final main-parameter value in its
payload. For gate-style devices that stay at
IGNOREuntil the endthis is the only robust completion signal — which is why fix (2)
ties motion-clear into the CRSN path, and fix (3) prefers the CRSN
payload over the cached
node.target.The KLF200 is sensitive to long/aggressive polling. During a
long marathon trace the first ~12 devices completed cleanly;
afterwards the gateway dropped the TCP socket and refused new
connections (
[Errno 61] Connect call failed). This PR does notclaim to fix that, but it does reduce avoidable
StatusRequestpressure against devices that are already busy.
Caveats on the trace evidence
PyVLX(..., heartbeat_interval=3600), so thesuccessful gate cycles (≈ 39 s / 47 s) ran with the regular 30 s
heartbeat effectively suppressed. The traces therefore confirm the
conditions under which mid-motion polling is risky (long travel,
long IGNORE phase, CRSN as only end marker), but they are not a
direct A/B demonstration of fix (4).
EXECUTION_COMPLETEDthe cached position and motion flags update inthe same tick, and the following
state=DONEframe confirms thesame final state.
Why a 0.2.35 release matters
The currently released
pyvlx 0.2.33does not have this regressionyet — it was introduced together with PR #665 ("Preserve movement
state during executing frames with unavailable positions") and is in
pyvlx 0.2.34. Home Assistant's stableveluxintegration stillpins
pyvlx == 0.2.33; the Home Assistantdevbranch already movedits pin to
pyvlx == 0.2.34. The next Home Assistant stable releasecut from
devwill therefore ship the regression to the broader userbase unless a fixed
pyvlxrelease replaces the pin first.User-visible symptoms on IGNORE-position actuators (gates, certain
garage doors) with
pyvlx 0.2.34:opening/closingafter the actuator hasphysically reached its end position, until the next heartbeat sweep
(~30 s) or a fresh user command resets the state. Automations keyed
on
is_closedtherefore see a wrong terminal state for thatwindow.
StatusRequestcalls during an active motion canprematurely complete the running command — on some actuators the
gateway additionally interrupts the motion mid-travel.
The device remains controllable in all these cases (a new command
resets the state via the usual paths), but the displayed state and
any state-driven automation logic are wrong in the meantime. Cutting
a
0.2.35with this fix before Home Assistant freezes its nextrelease would avoid the regression hitting users at all.
Test plan
make cigreen on this branch.garage door (mixed concrete + IGNORE), Velux roof window
(concrete throughout), and several roller shutters. All went
through
opening → open/closing → closedcleanly withoutsticking or briefly flipping to the wrong end-state.
(see traces above) — same outcome.