OCPBUGS-86500: Fix BC holdover timer on SLAVE→MASTER transition.#684
OCPBUGS-86500: Fix BC holdover timer on SLAVE→MASTER transition.#684vitus133 wants to merge 1 commit into
Conversation
|
@vitus133: This pull request references Jira Issue OCPBUGS-86500, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (hhassid@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vitus133 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Caution Review failedFailed to post review comments Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds non-blocking config watcher notifications and buffered UpdateCh; introduces EnsureProcessOptions and lookup helpers for lazy PtpProcessOpts population; hardens threshold lifecycle with SafeClose usage; refactors holdover timer startup/expiry into startHoldoverTimer/handleHoldOverState; and expands tests covering holdover transitions, cancellation, recovery, and options loading. ChangesPTP Holdover Timer & Config Handling
Sequence DiagramsequenceDiagram
participant ParsePTP4l
participant startHoldoverTimer
participant handleHoldOverState
participant PTPEventManager
ParsePTP4l->>startHoldoverTimer: fetch PtpProcessOpts (profile/config) & delegate timer
startHoldoverTimer->>handleHoldOverState: spawn goroutine with timeout and stop channel
handleHoldOverState->>handleHoldOverState: wait on Close channel or timeout
alt Timer cancelled
handleHoldOverState->>handleHoldOverState: log "holdover timer cancelled" and return
else Timeout expires
handleHoldOverState->>handleHoldOverState: validate config stats, master, alias, and HOLDOVER state
handleHoldOverState->>PTPEventManager: transition to FREERUN & publish PTP state-change
handleHoldOverState->>PTPEventManager: update sync-state metrics and publish OS-clock sync change using ptpOpts
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@vitus133: This pull request references Jira Issue OCPBUGS-86500, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (hhassid@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
plugins/ptp_operator/metrics/ptp4lParse_test.go (1)
40-40: ⚡ Quick winReplace fixed sleeps with eventually-based assertions.
Fixed
time.Sleepwindows can make these tests flaky under slower CI conditions.💡 Suggested refactor pattern
- time.Sleep(1500 * time.Millisecond) - assert.Equal(t, ptp.FREERUN, ptpStats[master].LastSyncState()) + assert.Eventually(t, func() bool { + return ptpStats[master].LastSyncState() == ptp.FREERUN + }, 2*time.Second, 25*time.Millisecond)Also applies to: 68-68, 89-89, 109-109, 171-171
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ptp_operator/metrics/ptp4lParse_test.go` at line 40, Tests in ptp4lParse_test.go use fixed time.Sleep(...) (e.g., time.Sleep(1500 * time.Millisecond) and the other occurrences) which makes them flaky; replace each fixed sleep with an eventually-style assertion that polls until the expected condition is met (for example use testify/require.Eventually or a small for-loop with time.After and time.Sleep intervals) and assert the actual parse/metric state (the same condition the test expects after the sleep) within the predicate. Update the specific locations where time.Sleep is used (the instances in the file at the shown lines and the other occurrences) to call require.Eventually(t, func() bool { /* check parsed output / metric state from ptp4lParse or the related variable */ }, timeout, pollInterval) or equivalent, so the test waits up to a timeout with short intervals rather than sleeping a fixed duration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/ptp_operator/metrics/ptp4lParse_test.go`:
- Around line 30-31: Extract the repeated literals "ptp4l.0.config" and
"boundary" into shared test constants (e.g., const testConfigName =
"ptp4l.0.config" and const testProfileName = "boundary") at the top of the file
and replace every occurrence in test fixtures, output/fields literals, and calls
to handleHoldOverState and ParsePTP4l with those constants; additionally,
replace fixed time.Sleep(...) waits in the holdover transition tests with
polling assertions (use assert.Eventually or an equivalent polling loop) that
repeatedly check the expected state from handleHoldOverState/ParsePTP4l until a
timeout to avoid CI flakiness.
In `@plugins/ptp_operator/metrics/ptp4lParse.go`:
- Line 145: Remove the unused blank identifier in the map lookup: change the
assignment that reads "ptpOpts, _ :=
p.PtpConfigMapUpdates.PtpProcessOpts[profileName]" to a single-value assignment
so only ptpOpts is captured (e.g., "ptpOpts :=
p.PtpConfigMapUpdates.PtpProcessOpts[profileName]"); ensure you reference the
same variable name (ptpOpts) wherever it’s used later and do not introduce an
unused boolean.
---
Nitpick comments:
In `@plugins/ptp_operator/metrics/ptp4lParse_test.go`:
- Line 40: Tests in ptp4lParse_test.go use fixed time.Sleep(...) (e.g.,
time.Sleep(1500 * time.Millisecond) and the other occurrences) which makes them
flaky; replace each fixed sleep with an eventually-style assertion that polls
until the expected condition is met (for example use testify/require.Eventually
or a small for-loop with time.After and time.Sleep intervals) and assert the
actual parse/metric state (the same condition the test expects after the sleep)
within the predicate. Update the specific locations where time.Sleep is used
(the instances in the file at the shown lines and the other occurrences) to call
require.Eventually(t, func() bool { /* check parsed output / metric state from
ptp4lParse or the related variable */ }, timeout, pollInterval) or equivalent,
so the test waits up to a timeout with short intervals rather than sleeping a
fixed duration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Enterprise
Run ID: 1b5b2342-eb42-4fcb-9e81-62de565c3bb0
📒 Files selected for processing (4)
plugins/ptp_operator/config/config.goplugins/ptp_operator/metrics/manager.goplugins/ptp_operator/metrics/ptp4lParse.goplugins/ptp_operator/metrics/ptp4lParse_test.go
f703f11 to
ff21fbb
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/ptp_operator/config/config.go`:
- Around line 391-397: The notifyUpdate() currently does a non-blocking send on
an unbuffered channel (LinuxPTPConfigMapUpdate.notifyUpdate -> UpdateCh) which
drops the signal if no receiver is waiting, allowing UpdateConfig to replace
NodeProfiles while EventThreshold/PtpSettings stay stale; fix by making UpdateCh
a buffered channel (buffer size 1) or otherwise ensuring the send is reliable
(e.g., create UpdateCh = make(chan bool, 1) where UpdateCh is initialized) so
notifyUpdate() will succeed even when no goroutine is immediately receiving;
keep the existing notifyUpdate() call-site and behavior but ensure UpdateCh is
created as a buffered channel to prevent lost updates.
In `@plugins/ptp_operator/metrics/manager.go`:
- Around line 549-569: GetPtpProcessOpts currently may call EnsureProcessOptions
which mutates PtpConfigMapUpdates (PtpProcessOpts, EventThreshold, PtpSettings)
while lookupPtpProcessOpts and other reads access those maps unsynchronized;
wrap the read-checks and the lazy-population call in the same
LinuxPTPConfigMapUpdate.lock (i.e., take the PtpConfigMapUpdates.lock before
calling lookupPtpProcessOpts and EnsureProcessOptions, and release after the
second lookup) so readers and the EnsureProcessOptions writer use the same
mutex, or alternatively create and use an immutable snapshot/copy-on-write of
PtpProcessOpts/EventThreshold/PtpSettings and read that snapshot here instead of
accessing the maps directly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 9e3fdd75-6ac0-4d10-991a-e7726681597c
📒 Files selected for processing (4)
plugins/ptp_operator/config/config.goplugins/ptp_operator/metrics/manager.goplugins/ptp_operator/metrics/ptp4lParse.goplugins/ptp_operator/metrics/ptp4lParse_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- plugins/ptp_operator/metrics/ptp4lParse.go
- plugins/ptp_operator/metrics/ptp4lParse_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
plugins/ptp_operator/config/config.go (1)
479-529:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftUnify locking for
LinuxPTPConfigMapUpdatecaches.These helpers now use
l.lock, but other paths still bypass it:plugins/ptp_operator/ptp_operator_plugin.go:214-217writesPtpProcessOpts/EventThreshold/PtpSettingsdirectly, whileplugins/ptp_operator/metrics/manager.go:68-95and619-645read the same state withoutl.lock. Under config reload plus event processing, this can still race intoconcurrent map read and map writeand leave holdover state reading partially refreshed caches. Please make all reads/writes of these caches follow one synchronization strategy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ptp_operator/config/config.go` around lines 479 - 529, The code currently mixes locked and unlocked access to the caches PtpProcessOpts, EventThreshold and PtpSettings causing races; unify by guarding all reads and writes with l.lock: update plugins/ptp_operator/ptp_operator_plugin.go (where PtpProcessOpts, EventThreshold, PtpSettings are written around lines referenced) to perform writes inside l.lock.Lock()/Unlock() (or call existing methods like UpdatePTPProcessOptions/UpdatePTPThreshold/UpdatePTPSetting while holding the lock), and update readers in plugins/ptp_operator/metrics/manager.go to use l.lock.RLock()/RUnlock() (or the existing LookupPtpProcessOpts/LookupOrEnsurePtpProcessOpts helpers) instead of accessing maps directly; when replacing entire maps, build the new map off-lock if doing heavy work but swap the pointer/assignment under l.lock to ensure atomic replacement and avoid concurrent map read/write panics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@plugins/ptp_operator/config/config.go`:
- Around line 479-529: The code currently mixes locked and unlocked access to
the caches PtpProcessOpts, EventThreshold and PtpSettings causing races; unify
by guarding all reads and writes with l.lock: update
plugins/ptp_operator/ptp_operator_plugin.go (where PtpProcessOpts,
EventThreshold, PtpSettings are written around lines referenced) to perform
writes inside l.lock.Lock()/Unlock() (or call existing methods like
UpdatePTPProcessOptions/UpdatePTPThreshold/UpdatePTPSetting while holding the
lock), and update readers in plugins/ptp_operator/metrics/manager.go to use
l.lock.RLock()/RUnlock() (or the existing
LookupPtpProcessOpts/LookupOrEnsurePtpProcessOpts helpers) instead of accessing
maps directly; when replacing entire maps, build the new map off-lock if doing
heavy work but swap the pointer/assignment under l.lock to ensure atomic
replacement and avoid concurrent map read/write panics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: afe285c1-9c50-4169-9cd1-805457c757c0
📒 Files selected for processing (2)
plugins/ptp_operator/config/config.goplugins/ptp_operator/metrics/manager.go
194c41b to
0b272fa
Compare
Currently holdover timer is not started if the corresponding interface can't be associated with a PTP profile. When a TR is lost, the holdover state is published as event and set in metrics, but the timer start is skipped due to the unresolved ptpOpts. 1. Fix nil PtpProcessOpts preventing holdover timer recovery. Populate process options after config file create and resolve profile lookups on holdover entry so the timer always starts even when the one-shot configmap watcher missed the node profile at startup. 2. Ensure the holdover goroutine always starts after HOLDOVER state is published and set in metrics. Close orphaned timer channels on reset. Add lifecycle logging for cluster diagnosis. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Vitaly Grinberg <vgrinber@redhat.com>
0b272fa to
460216c
Compare
|
/verified later @dpopsuev |
|
@vitus133: This PR has been marked to be verified later by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Ensure the holdover goroutine always starts after HOLDOVER entry, close orphaned timer channels on reset, and add lifecycle logging for cluster diagnosis.