Skip to content

OCPBUGS-85092: make TriggerLogs synchronous and handle CMD LIVE_START#683

Open
edcdavid wants to merge 2 commits into
redhat-cne:mainfrom
edcdavid:replay-fix
Open

OCPBUGS-85092: make TriggerLogs synchronous and handle CMD LIVE_START#683
edcdavid wants to merge 2 commits into
redhat-cne:mainfrom
edcdavid:replay-fix

Conversation

@edcdavid
Copy link
Copy Markdown
Contributor

@edcdavid edcdavid commented May 20, 2026

Summary

  • Make TriggerLogs() call synchronous in processMessages() (remove go func() wrapper)
  • Handle CMD LIVE_START marker in the scanner loop (log and skip, don't pass to ExtractMetrics)

Problem

The async TriggerLogs allowed live data to race with replay data on separate processMessages goroutines. With the daemon-side live gate (linuxptp-daemon PR), no data arrives on ptp4l connections until replay completes, making synchronous TriggerLogs safe.

Fix

TriggerLogs is now called synchronously before the scanner starts. The CMD LIVE_START marker provides a clear demarcation between replay and live data phases, enabling observability and correctness guarantees.

Dependencies

Test plan

  • go vet ./plugins/ptp_operator/... passes
  • e2e test with replay delay verifies clock recovery
  • CMD LIVE_START marker logged in CEP output after reconfiguration

@openshift-ci-robot
Copy link
Copy Markdown

@edcdavid: This pull request references Jira Issue OCPBUGS-85092, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Make TriggerLogs() call synchronous in processMessages() (remove go func() wrapper)
  • Handle CMD LIVE_START marker in the scanner loop (log and skip, don't pass to ExtractMetrics)

Problem

The async TriggerLogs allowed live data to race with replay data on separate processMessages goroutines. With the daemon-side live gate (linuxptp-daemon PR), no data arrives on ptp4l connections until replay completes, making synchronous TriggerLogs safe.

Fix

TriggerLogs is now called synchronously before the scanner starts. The CMD LIVE_START marker provides a clear demarcation between replay and live data phases, enabling observability and correctness guarantees.

Dependencies

Test plan

  • go vet ./plugins/ptp_operator/... passes
  • e2e test with replay delay verifies clock recovery
  • CMD LIVE_START marker logged in CEP output after reconfiguration

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: edcdavid
Once this PR has been reviewed and has the lgtm label, please assign jzding for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for live data stream handling with improved socket communication.
  • Bug Fixes

    • Enhanced error handling for log triggering to prevent connection issues.
  • Tests

    • Added comprehensive test coverage for new live data functionality and edge cases.
  • Chores

    • Expanded debug logging throughout the metrics pipeline for improved observability and troubleshooting.

Walkthrough

Adds CMD LIVE_START, gates connection startup on aliasReady with extra debug logs, triggers eventManager.TriggerLogs() asynchronously on new connections, treats LIVE_START marker as a skip for metric extraction, and adds diagnostic logs across metrics/registry. Multiple unit tests and test helpers validate live-start and gating behavior.

Changes

PTP operator socket & metrics updates

Layer / File(s) Summary
Socket live-start handling and tests
plugins/ptp_operator/ptp_operator_plugin.go, plugins/ptp_operator/ptp_operator_plugin_test.go
Adds CMD LIVE_START, logs around aliasReady, calls eventManager.TriggerLogs() in a goroutine (when non-nil) during connection setup, updates the scanner to log and skip LIVE_START marker lines, and adds multiple processMessages tests and helpers including a mock logs server.
Metrics and registry diagnostic logs
plugins/ptp_operator/metrics/manager.go, plugins/ptp_operator/metrics/metrics.go, plugins/ptp_operator/metrics/registry.go
Insert additional debug/info logs in GenPTPEvent, TriggerLogs, ExtractMetrics, and UpdateSyncStateMetrics to record parsed fields, HTTP request status, offsets, and derived clock/sync state.

Sequence Diagram

sequenceDiagram
  participant Daemon
  participant PluginSocket
  participant EventManager
  participant Scanner

  Daemon->>PluginSocket: connect
  PluginSocket->>PluginSocket: wait aliasReady (debug)
  PluginSocket->>EventManager: TriggerLogs() (goroutine)
  EventManager-->>PluginSocket: emit replayed logs
  PluginSocket->>Scanner: scan incoming lines
  Scanner->>PluginSocket: "CMD LIVE_START" received
  PluginSocket->>Scanner: log marker, skip metrics extraction
  Scanner->>PluginSocket: continue processing live data
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main changes: making TriggerLogs synchronous and handling the CMD LIVE_START marker, which are the primary focus of the changeset.
Description check ✅ Passed The description comprehensively explains the problem, fix, dependencies, and test plan, all of which directly relate to the actual code changes in the pull request.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/ptp_operator/ptp_operator_plugin.go`:
- Around line 527-530: The code currently logs errors from
eventManager.TriggerLogs() and continues into the scan loop, allowing the
connection to proceed without the required replay; change the error path so that
after eventManager.TriggerLogs() returns an error you abort processing the
connection (e.g., log the error with log.Errorf and then close/terminate the
connection handler and return) instead of continuing into the scan loop,
ensuring no further scanning or consumption occurs when TriggerLogs() fails;
update the block around eventManager.TriggerLogs(), log.Warnf, and the
subsequent scan loop to explicitly stop the handler on error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Enterprise

Run ID: 71a600e7-6947-4270-bb37-e7ac0e776b94

📥 Commits

Reviewing files that changed from the base of the PR and between 2e0f1ae and c3ee442.

📒 Files selected for processing (1)
  • plugins/ptp_operator/ptp_operator_plugin.go

Comment thread plugins/ptp_operator/ptp_operator_plugin.go Outdated
@edcdavid edcdavid force-pushed the replay-fix branch 3 times, most recently from 4a78ab5 to e8eb108 Compare May 20, 2026 23:48
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
plugins/ptp_operator/ptp_operator_plugin_test.go (2)

509-529: 💤 Low value

Test correctness depends on port 8081 being free.

The negative-path test infers TriggerLogs() failure purely from the absence of a listener on :8081. If anything else is bound to that port on the test host (e.g. another package's leftover mock, a dev process), TriggerLogs() will silently succeed and this test will hang until the 10s timeout, hiding the actual failure mode.

Consider either pre-asserting the port is free (and t.Skip/fail with a clear message if not) or — preferably — making the logs endpoint injectable so the test can target a private listener it controls.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ptp_operator/ptp_operator_plugin_test.go` around lines 509 - 529, The
test TestProcessMessages_TriggerLogsFailureClosesConn relies on TriggerLogs
using hardcoded :8081 which makes the negative-path flaky; modify
processMessages/TriggerLogs to accept the logs endpoint (host:port or an
http.Client/URL) as an injectable dependency and update the test to create a
private listener (or a dummy URL) that you control so you can deterministically
force TriggerLogs to fail, or alternatively assert the port is free at test
start and t.Skip/fail with a clear message; update references to processMessages
and TriggerLogs to accept and propagate the injectable endpoint so the test no
longer depends on port 8081 being globally available.

416-429: 💤 Low value

Bind the test mock to loopback and drop redundant Addr.

srv.Addr is ignored when you call srv.Serve(ln) — the listener determines the bind — so it can be removed. Also, listening on ":8081" binds to all interfaces, unnecessarily exposing the test mock on dev machines/CI runners (matches the avoid-bind-to-all-interfaces-go static analysis hint). Loopback is sufficient here since TriggerLogs() connects locally.

♻️ Proposed change
-	srv := &http.Server{Addr: ":8081", Handler: mux}
-	ln, err := net.Listen("tcp", ":8081")
+	srv := &http.Server{Handler: mux}
+	ln, err := net.Listen("tcp", "127.0.0.1:8081")
 	if err != nil {
 		t.Fatalf("failed to listen on :8081 for mock logs server: %v", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ptp_operator/ptp_operator_plugin_test.go` around lines 416 - 429, The
mock server binds to all interfaces and sets srv.Addr needlessly; update
startMockLogsServer to listen on loopback only (use net.Listen("tcp",
"127.0.0.1:8081")) and remove the redundant srv.Addr field (create srv :=
&http.Server{Handler: mux}) since srv.Serve(ln) uses the listener for the bind;
keep the rest of the logic (ln, srv.Serve, and srv.Close) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@plugins/ptp_operator/ptp_operator_plugin_test.go`:
- Around line 509-529: The test TestProcessMessages_TriggerLogsFailureClosesConn
relies on TriggerLogs using hardcoded :8081 which makes the negative-path flaky;
modify processMessages/TriggerLogs to accept the logs endpoint (host:port or an
http.Client/URL) as an injectable dependency and update the test to create a
private listener (or a dummy URL) that you control so you can deterministically
force TriggerLogs to fail, or alternatively assert the port is free at test
start and t.Skip/fail with a clear message; update references to processMessages
and TriggerLogs to accept and propagate the injectable endpoint so the test no
longer depends on port 8081 being globally available.
- Around line 416-429: The mock server binds to all interfaces and sets srv.Addr
needlessly; update startMockLogsServer to listen on loopback only (use
net.Listen("tcp", "127.0.0.1:8081")) and remove the redundant srv.Addr field
(create srv := &http.Server{Handler: mux}) since srv.Serve(ln) uses the listener
for the bind; keep the rest of the logic (ln, srv.Serve, and srv.Close)
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Enterprise

Run ID: 04a95807-d0e2-4608-8c04-cddcf6d78abe

📥 Commits

Reviewing files that changed from the base of the PR and between c3ee442 and 1327022.

📒 Files selected for processing (2)
  • plugins/ptp_operator/ptp_operator_plugin.go
  • plugins/ptp_operator/ptp_operator_plugin_test.go

Recognizes the CMD LIVE_START control command sent by linuxptp-daemon
after the replay gate opens. The marker is silently consumed so it
does not reach ExtractMetrics.

TriggerLogs remains asynchronous (goroutine) to prevent a recursive
deadlock: the /emit-logs handler writes replay data through the
EventHandler's socket connection, which CEP accepts as a new
processMessages goroutine that also calls TriggerLogs — blocking the
reader while the handler writes back to the same connection. The
daemon's liveGate independently ensures replay ordering.

Includes unit tests covering CMD LIVE_START handling, TriggerLogs
failure resilience, aliasReady gating, and constant validation.
Temporary debug logs to trace the sequence of events during socket
reconnection and clock state metrics updates. Remove before merge.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/ptp_operator/metrics/metrics.go`:
- Line 124: The ExtractMetrics hot path is using log.Infof for per-message parse
details which will flood logs; change those per-message Infof calls in
ExtractMetrics (the lines logging processName, configName, status and the other
per-message parse log around line 228) to a debug-level log (e.g., log.Debugf)
or wrap them in a debug-enabled guard (e.g., if log.IsDebugEnabled() {
log.Debugf(...) }) so normal operation uses no- or low-overhead logging while
keeping the detailed messages available when debug is enabled.

In `@plugins/ptp_operator/ptp_operator_plugin.go`:
- Around line 533-540: The current async launch of TriggerLogs in
processMessages causes race between replay and live scanning; change it so
TriggerLogs is executed synchronously (or block until it finishes) before the
scanner starts: locate the processMessages function where
eventManager.TriggerLogs() is called (currently inside a go func) and remove the
goroutine, or replace it with explicit synchronization (e.g., a channel or
WaitGroup) that waits for TriggerLogs to complete and surface any error before
proceeding to start the scanner; ensure the log.Warnf/log.Info handling remains
but runs after TriggerLogs returns so replay completes before live consumption
begins.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3db52ec2-e8a4-4e82-80e6-5f3458f5d31e

📥 Commits

Reviewing files that changed from the base of the PR and between 1327022 and 36aa1b8.

📒 Files selected for processing (5)
  • plugins/ptp_operator/metrics/manager.go
  • plugins/ptp_operator/metrics/metrics.go
  • plugins/ptp_operator/metrics/registry.go
  • plugins/ptp_operator/ptp_operator_plugin.go
  • plugins/ptp_operator/ptp_operator_plugin_test.go
✅ Files skipped from review due to trivial changes (2)
  • plugins/ptp_operator/metrics/manager.go
  • plugins/ptp_operator/metrics/registry.go

// Process status messages should always be processed regardless of profile configuration
if strings.Contains(output, ptpProcessStatusIdentifier) {
if status, err := p.parsePTPStatus(output, fields); err == nil {
log.Infof("DEBUG ExtractMetrics: process status detected: process=%s config=%s status=%d", processName, configName, status)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Infof in the hot parser path will flood logs and hurt throughput.

Line 124 and Line 228 log per-message parse details at info level inside ExtractMetrics, which is a high-frequency path. This can create heavy I/O and noisy operational logs under load.

Suggested fix
-			log.Infof("DEBUG ExtractMetrics: process status detected: process=%s config=%s status=%d", processName, configName, status)
+			log.Debugf("DEBUG ExtractMetrics: process status detected: process=%s config=%s status=%d", processName, configName, status)
@@
-			log.Infof("DEBUG ExtractMetrics: offset parsed: process=%s iface=%s offset=%.0f syncState=%s", processName, interfaceName, ptpOffset, syncState)
+			log.Debugf("DEBUG ExtractMetrics: offset parsed: process=%s iface=%s offset=%.0f syncState=%s", processName, interfaceName, ptpOffset, syncState)

Also applies to: 228-228

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ptp_operator/metrics/metrics.go` at line 124, The ExtractMetrics hot
path is using log.Infof for per-message parse details which will flood logs;
change those per-message Infof calls in ExtractMetrics (the lines logging
processName, configName, status and the other per-message parse log around line
228) to a debug-level log (e.g., log.Debugf) or wrap them in a debug-enabled
guard (e.g., if log.IsDebugEnabled() { log.Debugf(...) }) so normal operation
uses no- or low-overhead logging while keeping the detailed messages available
when debug is enabled.

Comment on lines +533 to 540
log.Info("DEBUG processMessages: firing async TriggerLogs")
go func() {
if err := eventManager.TriggerLogs(); err != nil {
log.Warnf("failed to trigger logs on new connection: %v", err)
} else {
log.Info("DEBUG processMessages: TriggerLogs completed successfully")
}
}()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replay/live ordering is still racy due to async TriggerLogs().

At Line 534, TriggerLogs() is launched in a goroutine and the scanner starts immediately, so live lines can be consumed before replay is triggered/completed. This defeats replay-before-live ordering guarantees.

Suggested fix
-	if eventManager != nil {
-		log.Info("DEBUG processMessages: firing async TriggerLogs")
-		go func() {
-			if err := eventManager.TriggerLogs(); err != nil {
-				log.Warnf("failed to trigger logs on new connection: %v", err)
-			} else {
-				log.Info("DEBUG processMessages: TriggerLogs completed successfully")
-			}
-		}()
-	}
+	if eventManager != nil {
+		log.Info("DEBUG processMessages: triggering logs before scan")
+		if err := eventManager.TriggerLogs(); err != nil {
+			log.Warnf("failed to trigger logs on new connection: %v; closing connection", err)
+			_ = c.Close()
+			return
+		}
+		log.Info("DEBUG processMessages: TriggerLogs completed successfully")
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log.Info("DEBUG processMessages: firing async TriggerLogs")
go func() {
if err := eventManager.TriggerLogs(); err != nil {
log.Warnf("failed to trigger logs on new connection: %v", err)
} else {
log.Info("DEBUG processMessages: TriggerLogs completed successfully")
}
}()
if eventManager != nil {
log.Info("DEBUG processMessages: triggering logs before scan")
if err := eventManager.TriggerLogs(); err != nil {
log.Warnf("failed to trigger logs on new connection: %v; closing connection", err)
_ = c.Close()
return
}
log.Info("DEBUG processMessages: TriggerLogs completed successfully")
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/ptp_operator/ptp_operator_plugin.go` around lines 533 - 540, The
current async launch of TriggerLogs in processMessages causes race between
replay and live scanning; change it so TriggerLogs is executed synchronously (or
block until it finishes) before the scanner starts: locate the processMessages
function where eventManager.TriggerLogs() is called (currently inside a go func)
and remove the goroutine, or replace it with explicit synchronization (e.g., a
channel or WaitGroup) that waits for TriggerLogs to complete and surface any
error before proceeding to start the scanner; ensure the log.Warnf/log.Info
handling remains but runs after TriggerLogs returns so replay completes before
live consumption begins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants