Skip to content

OCPBUGS-86500: Fix BC holdover timer on SLAVE→MASTER transition.#684

Open
vitus133 wants to merge 1 commit into
redhat-cne:mainfrom
vitus133:bc-holdover-clean-main
Open

OCPBUGS-86500: Fix BC holdover timer on SLAVE→MASTER transition.#684
vitus133 wants to merge 1 commit into
redhat-cne:mainfrom
vitus133:bc-holdover-clean-main

Conversation

@vitus133
Copy link
Copy Markdown
Member

@vitus133 vitus133 commented May 26, 2026

Ensure the holdover goroutine always starts after HOLDOVER entry, close orphaned timer channels on reset, and add lifecycle logging for cluster diagnosis.

@openshift-ci-robot
Copy link
Copy Markdown

@vitus133: This pull request references Jira Issue OCPBUGS-86500, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (hhassid@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Ensure the holdover goroutine always starts after HOLDOVER entry, close orphaned timer channels on reset, and add lifecycle logging for cluster diagnosis.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from josephdrichard and jzding May 26, 2026 10:17
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vitus133

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Review Change Stack

Caution

Review failed

Failed to post review comments

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • HOLDOVER timing mechanism for master clock state management.
    • Profile process-options retrieval and on-demand profile reload for config-driven behavior.
  • Bug Fixes

    • Improved logging around holdover/lock transitions and timer cancellation.
    • Safer timer/channel handling and coalesced, non-blocking config update signaling.
    • More consistent concurrency protection and corrected metric emission gating.
  • Tests

    • Expanded tests for holdover/freerun flows, timers, parsing, and config-cache plumbing.

Walkthrough

Adds non-blocking config watcher notifications and buffered UpdateCh; introduces EnsureProcessOptions and lookup helpers for lazy PtpProcessOpts population; hardens threshold lifecycle with SafeClose usage; refactors holdover timer startup/expiry into startHoldoverTimer/handleHoldOverState; and expands tests covering holdover transitions, cancellation, recovery, and options loading.

Changes

PTP Holdover Timer & Config Handling

Layer / File(s) Summary
SafeClose channel closure logging
plugins/ptp_operator/config/config.go
PtpClockThreshold.SafeClose() now logs whether the close channel was already closed or is being cancelled, preserving double-close prevention.
Non-blocking config notify & buffered UpdateCh
plugins/ptp_operator/config/config.go
NewLinuxPTPConfUpdate() makes UpdateCh a buffered chan bool (1); added notifyUpdate() and rewired update paths to call it, coalescing notifications via non-blocking sends.
Locking consistency for thresholds
plugins/ptp_operator/config/config.go
Standardized lock use (Lock/Unlock, RLock/RUnlock) around EventThreshold mutations/reads.
EnsureProcessOptions & lookup helpers
plugins/ptp_operator/config/config.go
Added EnsureProcessOptions(nodeName string), LookupPtpProcessOpts(profileName), and LookupOrEnsurePtpProcessOpts(nodeName, profileName) to lazily populate and retrieve per-profile PtpProcessOpts under lock with filesystem I/O outside the lock.
Threshold reset & GetPtpProcessOpts
plugins/ptp_operator/metrics/manager.go
PtpThreshold now uses LookupEventThreshold; when resetting, previous Close channels are SafeClose()d before replacement; added GetPtpProcessOpts(profileName, configName string) with resolveProfileName; debug log added before SafeClose in GenPTPEvent.
Holdover timer startup integration
plugins/ptp_operator/metrics/ptp4lParse.go
ParsePTP4l fetches PtpProcessOpts and delegates HOLDOVER timer startup to startHoldoverTimer only when master process matches expected source; logs/warns and skips when mismatched or opts missing.
handleHoldOverState expiry/cancel behavior
plugins/ptp_operator/metrics/ptp4lParse.go
Cancellation logs a structured message and returns; on expiry validates config stats/master/alias/current state, transitions to FREERUN, publishes PTP state-change, updates sync-state metrics, and triggers OS-clock sync change using stored ptpOpts.
Holdover & options tests
plugins/ptp_operator/metrics/ptp4lParse_test.go
Adds tests for holdover expiry→FREERUN, fault expiry, cancellation, non-HOLDOVER no-op, threshold reset channel close, SLAVE→MASTER recovery, clock-class handling, GetPtpProcessOpts cache population, and startHoldoverTimer recovery with nil opts.
T-BC test formatting updates
plugins/ptp_operator/metrics/tbc_test.go
Tests import fmt and use fmt.Sprintf with configName to construct ptp4l/T-BC log lines dynamically.
Metrics minor change
plugins/ptp_operator/metrics/metrics.go
processDownEvent now uses PtpConfigMapUpdates.LookupPtpProcessOpts(profileName) (nil-checked) instead of direct map access when deciding ClockRealTime emission.

Sequence Diagram

sequenceDiagram
  participant ParsePTP4l
  participant startHoldoverTimer
  participant handleHoldOverState
  participant PTPEventManager
  ParsePTP4l->>startHoldoverTimer: fetch PtpProcessOpts (profile/config) & delegate timer
  startHoldoverTimer->>handleHoldOverState: spawn goroutine with timeout and stop channel
  handleHoldOverState->>handleHoldOverState: wait on Close channel or timeout
  alt Timer cancelled
    handleHoldOverState->>handleHoldOverState: log "holdover timer cancelled" and return
  else Timeout expires
    handleHoldOverState->>handleHoldOverState: validate config stats, master, alias, and HOLDOVER state
    handleHoldOverState->>PTPEventManager: transition to FREERUN & publish PTP state-change
    handleHoldOverState->>PTPEventManager: update sync-state metrics and publish OS-clock sync change using ptpOpts
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly references the bug fix (OCPBUGS-86500) and accurately describes the main change: fixing BC holdover timer behavior during SLAVE→MASTER transition.
Description check ✅ Passed The description is clearly related to the changeset, explaining three key objectives: ensuring holdover goroutine starts, closing orphaned timer channels, and adding lifecycle logging.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot
Copy link
Copy Markdown

@vitus133: This pull request references Jira Issue OCPBUGS-86500, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (hhassid@redhat.com), skipping review request.

Details

In response to this:

Ensure the holdover goroutine always starts after HOLDOVER entry, close orphaned timer channels on reset, and add lifecycle logging for cluster diagnosis.

Summary by CodeRabbit

  • Chores

  • Enhanced debug logging for holdover timer operations, including channel closure and cancellation events

  • Added structured logging for holdover timer startup and state transition details

  • Tests

  • Added comprehensive test coverage for holdover timer expiry, cancellation, and PTP state recovery scenarios

Review Change Stack

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
plugins/ptp_operator/metrics/ptp4lParse_test.go (1)

40-40: ⚡ Quick win

Replace fixed sleeps with eventually-based assertions.

Fixed time.Sleep windows can make these tests flaky under slower CI conditions.

💡 Suggested refactor pattern
-	time.Sleep(1500 * time.Millisecond)
-	assert.Equal(t, ptp.FREERUN, ptpStats[master].LastSyncState())
+	assert.Eventually(t, func() bool {
+		return ptpStats[master].LastSyncState() == ptp.FREERUN
+	}, 2*time.Second, 25*time.Millisecond)

Also applies to: 68-68, 89-89, 109-109, 171-171

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ptp_operator/metrics/ptp4lParse_test.go` at line 40, Tests in
ptp4lParse_test.go use fixed time.Sleep(...) (e.g., time.Sleep(1500 *
time.Millisecond) and the other occurrences) which makes them flaky; replace
each fixed sleep with an eventually-style assertion that polls until the
expected condition is met (for example use testify/require.Eventually or a small
for-loop with time.After and time.Sleep intervals) and assert the actual
parse/metric state (the same condition the test expects after the sleep) within
the predicate. Update the specific locations where time.Sleep is used (the
instances in the file at the shown lines and the other occurrences) to call
require.Eventually(t, func() bool { /* check parsed output / metric state from
ptp4lParse or the related variable */ }, timeout, pollInterval) or equivalent,
so the test waits up to a timeout with short intervals rather than sleeping a
fixed duration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/ptp_operator/metrics/ptp4lParse_test.go`:
- Around line 30-31: Extract the repeated literals "ptp4l.0.config" and
"boundary" into shared test constants (e.g., const testConfigName =
"ptp4l.0.config" and const testProfileName = "boundary") at the top of the file
and replace every occurrence in test fixtures, output/fields literals, and calls
to handleHoldOverState and ParsePTP4l with those constants; additionally,
replace fixed time.Sleep(...) waits in the holdover transition tests with
polling assertions (use assert.Eventually or an equivalent polling loop) that
repeatedly check the expected state from handleHoldOverState/ParsePTP4l until a
timeout to avoid CI flakiness.

In `@plugins/ptp_operator/metrics/ptp4lParse.go`:
- Line 145: Remove the unused blank identifier in the map lookup: change the
assignment that reads "ptpOpts, _ :=
p.PtpConfigMapUpdates.PtpProcessOpts[profileName]" to a single-value assignment
so only ptpOpts is captured (e.g., "ptpOpts :=
p.PtpConfigMapUpdates.PtpProcessOpts[profileName]"); ensure you reference the
same variable name (ptpOpts) wherever it’s used later and do not introduce an
unused boolean.

---

Nitpick comments:
In `@plugins/ptp_operator/metrics/ptp4lParse_test.go`:
- Line 40: Tests in ptp4lParse_test.go use fixed time.Sleep(...) (e.g.,
time.Sleep(1500 * time.Millisecond) and the other occurrences) which makes them
flaky; replace each fixed sleep with an eventually-style assertion that polls
until the expected condition is met (for example use testify/require.Eventually
or a small for-loop with time.After and time.Sleep intervals) and assert the
actual parse/metric state (the same condition the test expects after the sleep)
within the predicate. Update the specific locations where time.Sleep is used
(the instances in the file at the shown lines and the other occurrences) to call
require.Eventually(t, func() bool { /* check parsed output / metric state from
ptp4lParse or the related variable */ }, timeout, pollInterval) or equivalent,
so the test waits up to a timeout with short intervals rather than sleeping a
fixed duration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Enterprise

Run ID: 1b5b2342-eb42-4fcb-9e81-62de565c3bb0

📥 Commits

Reviewing files that changed from the base of the PR and between 8d9f603 and ab0414e.

📒 Files selected for processing (4)
  • plugins/ptp_operator/config/config.go
  • plugins/ptp_operator/metrics/manager.go
  • plugins/ptp_operator/metrics/ptp4lParse.go
  • plugins/ptp_operator/metrics/ptp4lParse_test.go

Comment thread plugins/ptp_operator/metrics/ptp4lParse_test.go Outdated
Comment thread plugins/ptp_operator/metrics/ptp4lParse.go Outdated
@vitus133 vitus133 force-pushed the bc-holdover-clean-main branch from f703f11 to ff21fbb Compare May 27, 2026 07:53
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/ptp_operator/config/config.go`:
- Around line 391-397: The notifyUpdate() currently does a non-blocking send on
an unbuffered channel (LinuxPTPConfigMapUpdate.notifyUpdate -> UpdateCh) which
drops the signal if no receiver is waiting, allowing UpdateConfig to replace
NodeProfiles while EventThreshold/PtpSettings stay stale; fix by making UpdateCh
a buffered channel (buffer size 1) or otherwise ensuring the send is reliable
(e.g., create UpdateCh = make(chan bool, 1) where UpdateCh is initialized) so
notifyUpdate() will succeed even when no goroutine is immediately receiving;
keep the existing notifyUpdate() call-site and behavior but ensure UpdateCh is
created as a buffered channel to prevent lost updates.

In `@plugins/ptp_operator/metrics/manager.go`:
- Around line 549-569: GetPtpProcessOpts currently may call EnsureProcessOptions
which mutates PtpConfigMapUpdates (PtpProcessOpts, EventThreshold, PtpSettings)
while lookupPtpProcessOpts and other reads access those maps unsynchronized;
wrap the read-checks and the lazy-population call in the same
LinuxPTPConfigMapUpdate.lock (i.e., take the PtpConfigMapUpdates.lock before
calling lookupPtpProcessOpts and EnsureProcessOptions, and release after the
second lookup) so readers and the EnsureProcessOptions writer use the same
mutex, or alternatively create and use an immutable snapshot/copy-on-write of
PtpProcessOpts/EventThreshold/PtpSettings and read that snapshot here instead of
accessing the maps directly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9e3fdd75-6ac0-4d10-991a-e7726681597c

📥 Commits

Reviewing files that changed from the base of the PR and between ab0414e and ff21fbb.

📒 Files selected for processing (4)
  • plugins/ptp_operator/config/config.go
  • plugins/ptp_operator/metrics/manager.go
  • plugins/ptp_operator/metrics/ptp4lParse.go
  • plugins/ptp_operator/metrics/ptp4lParse_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • plugins/ptp_operator/metrics/ptp4lParse.go
  • plugins/ptp_operator/metrics/ptp4lParse_test.go

Comment thread plugins/ptp_operator/config/config.go
Comment thread plugins/ptp_operator/metrics/manager.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
plugins/ptp_operator/config/config.go (1)

479-529: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Unify locking for LinuxPTPConfigMapUpdate caches.

These helpers now use l.lock, but other paths still bypass it: plugins/ptp_operator/ptp_operator_plugin.go:214-217 writes PtpProcessOpts/EventThreshold/PtpSettings directly, while plugins/ptp_operator/metrics/manager.go:68-95 and 619-645 read the same state without l.lock. Under config reload plus event processing, this can still race into concurrent map read and map write and leave holdover state reading partially refreshed caches. Please make all reads/writes of these caches follow one synchronization strategy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ptp_operator/config/config.go` around lines 479 - 529, The code
currently mixes locked and unlocked access to the caches PtpProcessOpts,
EventThreshold and PtpSettings causing races; unify by guarding all reads and
writes with l.lock: update plugins/ptp_operator/ptp_operator_plugin.go (where
PtpProcessOpts, EventThreshold, PtpSettings are written around lines referenced)
to perform writes inside l.lock.Lock()/Unlock() (or call existing methods like
UpdatePTPProcessOptions/UpdatePTPThreshold/UpdatePTPSetting while holding the
lock), and update readers in plugins/ptp_operator/metrics/manager.go to use
l.lock.RLock()/RUnlock() (or the existing
LookupPtpProcessOpts/LookupOrEnsurePtpProcessOpts helpers) instead of accessing
maps directly; when replacing entire maps, build the new map off-lock if doing
heavy work but swap the pointer/assignment under l.lock to ensure atomic
replacement and avoid concurrent map read/write panics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@plugins/ptp_operator/config/config.go`:
- Around line 479-529: The code currently mixes locked and unlocked access to
the caches PtpProcessOpts, EventThreshold and PtpSettings causing races; unify
by guarding all reads and writes with l.lock: update
plugins/ptp_operator/ptp_operator_plugin.go (where PtpProcessOpts,
EventThreshold, PtpSettings are written around lines referenced) to perform
writes inside l.lock.Lock()/Unlock() (or call existing methods like
UpdatePTPProcessOptions/UpdatePTPThreshold/UpdatePTPSetting while holding the
lock), and update readers in plugins/ptp_operator/metrics/manager.go to use
l.lock.RLock()/RUnlock() (or the existing
LookupPtpProcessOpts/LookupOrEnsurePtpProcessOpts helpers) instead of accessing
maps directly; when replacing entire maps, build the new map off-lock if doing
heavy work but swap the pointer/assignment under l.lock to ensure atomic
replacement and avoid concurrent map read/write panics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: afe285c1-9c50-4169-9cd1-805457c757c0

📥 Commits

Reviewing files that changed from the base of the PR and between ff21fbb and 573bfe0.

📒 Files selected for processing (2)
  • plugins/ptp_operator/config/config.go
  • plugins/ptp_operator/metrics/manager.go

@vitus133 vitus133 force-pushed the bc-holdover-clean-main branch from 194c41b to 0b272fa Compare May 27, 2026 19:48
Currently holdover timer is not started if the corresponding interface
can't be associated with a PTP profile. When a TR is lost, the holdover
state is published as event and set in metrics, but the timer start is
skipped due to the unresolved ptpOpts.
1. Fix nil PtpProcessOpts preventing holdover timer recovery.
Populate process options after config file create and resolve profile
lookups on holdover entry so the timer always starts even when the
one-shot configmap watcher missed the node profile at startup.
2. Ensure the holdover goroutine always starts after HOLDOVER state is
published and set in metrics. Close orphaned timer channels on reset.
Add lifecycle logging for cluster diagnosis.

Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: Vitaly Grinberg <vgrinber@redhat.com>
@vitus133 vitus133 force-pushed the bc-holdover-clean-main branch from 0b272fa to 460216c Compare May 28, 2026 07:22
@vitus133
Copy link
Copy Markdown
Member Author

/verified later @dpopsuev

@openshift-ci-robot
Copy link
Copy Markdown

@vitus133: This PR has been marked to be verified later by @dpopsuev.

Details

In response to this:

/verified later @dpopsuev

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants