OCPBUGS-85092: make TriggerLogs synchronous and handle CMD LIVE_START#683
OCPBUGS-85092: make TriggerLogs synchronous and handle CMD LIVE_START#683edcdavid wants to merge 2 commits into
Conversation
|
@edcdavid: This pull request references Jira Issue OCPBUGS-85092, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: edcdavid The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughAdds ChangesPTP operator socket & metrics updates
Sequence DiagramsequenceDiagram
participant Daemon
participant PluginSocket
participant EventManager
participant Scanner
Daemon->>PluginSocket: connect
PluginSocket->>PluginSocket: wait aliasReady (debug)
PluginSocket->>EventManager: TriggerLogs() (goroutine)
EventManager-->>PluginSocket: emit replayed logs
PluginSocket->>Scanner: scan incoming lines
Scanner->>PluginSocket: "CMD LIVE_START" received
PluginSocket->>Scanner: log marker, skip metrics extraction
Scanner->>PluginSocket: continue processing live data
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/ptp_operator/ptp_operator_plugin.go`:
- Around line 527-530: The code currently logs errors from
eventManager.TriggerLogs() and continues into the scan loop, allowing the
connection to proceed without the required replay; change the error path so that
after eventManager.TriggerLogs() returns an error you abort processing the
connection (e.g., log the error with log.Errorf and then close/terminate the
connection handler and return) instead of continuing into the scan loop,
ensuring no further scanning or consumption occurs when TriggerLogs() fails;
update the block around eventManager.TriggerLogs(), log.Warnf, and the
subsequent scan loop to explicitly stop the handler on error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Enterprise
Run ID: 71a600e7-6947-4270-bb37-e7ac0e776b94
📒 Files selected for processing (1)
plugins/ptp_operator/ptp_operator_plugin.go
4a78ab5 to
e8eb108
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
plugins/ptp_operator/ptp_operator_plugin_test.go (2)
509-529: 💤 Low valueTest correctness depends on port 8081 being free.
The negative-path test infers
TriggerLogs()failure purely from the absence of a listener on:8081. If anything else is bound to that port on the test host (e.g. another package's leftover mock, a dev process),TriggerLogs()will silently succeed and this test will hang until the 10s timeout, hiding the actual failure mode.Consider either pre-asserting the port is free (and
t.Skip/fail with a clear message if not) or — preferably — making the logs endpoint injectable so the test can target a private listener it controls.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ptp_operator/ptp_operator_plugin_test.go` around lines 509 - 529, The test TestProcessMessages_TriggerLogsFailureClosesConn relies on TriggerLogs using hardcoded :8081 which makes the negative-path flaky; modify processMessages/TriggerLogs to accept the logs endpoint (host:port or an http.Client/URL) as an injectable dependency and update the test to create a private listener (or a dummy URL) that you control so you can deterministically force TriggerLogs to fail, or alternatively assert the port is free at test start and t.Skip/fail with a clear message; update references to processMessages and TriggerLogs to accept and propagate the injectable endpoint so the test no longer depends on port 8081 being globally available.
416-429: 💤 Low valueBind the test mock to loopback and drop redundant
Addr.
srv.Addris ignored when you callsrv.Serve(ln)— the listener determines the bind — so it can be removed. Also, listening on":8081"binds to all interfaces, unnecessarily exposing the test mock on dev machines/CI runners (matches theavoid-bind-to-all-interfaces-gostatic analysis hint). Loopback is sufficient here sinceTriggerLogs()connects locally.♻️ Proposed change
- srv := &http.Server{Addr: ":8081", Handler: mux} - ln, err := net.Listen("tcp", ":8081") + srv := &http.Server{Handler: mux} + ln, err := net.Listen("tcp", "127.0.0.1:8081") if err != nil { t.Fatalf("failed to listen on :8081 for mock logs server: %v", err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/ptp_operator/ptp_operator_plugin_test.go` around lines 416 - 429, The mock server binds to all interfaces and sets srv.Addr needlessly; update startMockLogsServer to listen on loopback only (use net.Listen("tcp", "127.0.0.1:8081")) and remove the redundant srv.Addr field (create srv := &http.Server{Handler: mux}) since srv.Serve(ln) uses the listener for the bind; keep the rest of the logic (ln, srv.Serve, and srv.Close) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@plugins/ptp_operator/ptp_operator_plugin_test.go`:
- Around line 509-529: The test TestProcessMessages_TriggerLogsFailureClosesConn
relies on TriggerLogs using hardcoded :8081 which makes the negative-path flaky;
modify processMessages/TriggerLogs to accept the logs endpoint (host:port or an
http.Client/URL) as an injectable dependency and update the test to create a
private listener (or a dummy URL) that you control so you can deterministically
force TriggerLogs to fail, or alternatively assert the port is free at test
start and t.Skip/fail with a clear message; update references to processMessages
and TriggerLogs to accept and propagate the injectable endpoint so the test no
longer depends on port 8081 being globally available.
- Around line 416-429: The mock server binds to all interfaces and sets srv.Addr
needlessly; update startMockLogsServer to listen on loopback only (use
net.Listen("tcp", "127.0.0.1:8081")) and remove the redundant srv.Addr field
(create srv := &http.Server{Handler: mux}) since srv.Serve(ln) uses the listener
for the bind; keep the rest of the logic (ln, srv.Serve, and srv.Close)
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Enterprise
Run ID: 04a95807-d0e2-4608-8c04-cddcf6d78abe
📒 Files selected for processing (2)
plugins/ptp_operator/ptp_operator_plugin.goplugins/ptp_operator/ptp_operator_plugin_test.go
Recognizes the CMD LIVE_START control command sent by linuxptp-daemon after the replay gate opens. The marker is silently consumed so it does not reach ExtractMetrics. TriggerLogs remains asynchronous (goroutine) to prevent a recursive deadlock: the /emit-logs handler writes replay data through the EventHandler's socket connection, which CEP accepts as a new processMessages goroutine that also calls TriggerLogs — blocking the reader while the handler writes back to the same connection. The daemon's liveGate independently ensures replay ordering. Includes unit tests covering CMD LIVE_START handling, TriggerLogs failure resilience, aliasReady gating, and constant validation.
Temporary debug logs to trace the sequence of events during socket reconnection and clock state metrics updates. Remove before merge.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/ptp_operator/metrics/metrics.go`:
- Line 124: The ExtractMetrics hot path is using log.Infof for per-message parse
details which will flood logs; change those per-message Infof calls in
ExtractMetrics (the lines logging processName, configName, status and the other
per-message parse log around line 228) to a debug-level log (e.g., log.Debugf)
or wrap them in a debug-enabled guard (e.g., if log.IsDebugEnabled() {
log.Debugf(...) }) so normal operation uses no- or low-overhead logging while
keeping the detailed messages available when debug is enabled.
In `@plugins/ptp_operator/ptp_operator_plugin.go`:
- Around line 533-540: The current async launch of TriggerLogs in
processMessages causes race between replay and live scanning; change it so
TriggerLogs is executed synchronously (or block until it finishes) before the
scanner starts: locate the processMessages function where
eventManager.TriggerLogs() is called (currently inside a go func) and remove the
goroutine, or replace it with explicit synchronization (e.g., a channel or
WaitGroup) that waits for TriggerLogs to complete and surface any error before
proceeding to start the scanner; ensure the log.Warnf/log.Info handling remains
but runs after TriggerLogs returns so replay completes before live consumption
begins.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 3db52ec2-e8a4-4e82-80e6-5f3458f5d31e
📒 Files selected for processing (5)
plugins/ptp_operator/metrics/manager.goplugins/ptp_operator/metrics/metrics.goplugins/ptp_operator/metrics/registry.goplugins/ptp_operator/ptp_operator_plugin.goplugins/ptp_operator/ptp_operator_plugin_test.go
✅ Files skipped from review due to trivial changes (2)
- plugins/ptp_operator/metrics/manager.go
- plugins/ptp_operator/metrics/registry.go
| // Process status messages should always be processed regardless of profile configuration | ||
| if strings.Contains(output, ptpProcessStatusIdentifier) { | ||
| if status, err := p.parsePTPStatus(output, fields); err == nil { | ||
| log.Infof("DEBUG ExtractMetrics: process status detected: process=%s config=%s status=%d", processName, configName, status) |
There was a problem hiding this comment.
Infof in the hot parser path will flood logs and hurt throughput.
Line 124 and Line 228 log per-message parse details at info level inside ExtractMetrics, which is a high-frequency path. This can create heavy I/O and noisy operational logs under load.
Suggested fix
- log.Infof("DEBUG ExtractMetrics: process status detected: process=%s config=%s status=%d", processName, configName, status)
+ log.Debugf("DEBUG ExtractMetrics: process status detected: process=%s config=%s status=%d", processName, configName, status)
@@
- log.Infof("DEBUG ExtractMetrics: offset parsed: process=%s iface=%s offset=%.0f syncState=%s", processName, interfaceName, ptpOffset, syncState)
+ log.Debugf("DEBUG ExtractMetrics: offset parsed: process=%s iface=%s offset=%.0f syncState=%s", processName, interfaceName, ptpOffset, syncState)Also applies to: 228-228
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/ptp_operator/metrics/metrics.go` at line 124, The ExtractMetrics hot
path is using log.Infof for per-message parse details which will flood logs;
change those per-message Infof calls in ExtractMetrics (the lines logging
processName, configName, status and the other per-message parse log around line
228) to a debug-level log (e.g., log.Debugf) or wrap them in a debug-enabled
guard (e.g., if log.IsDebugEnabled() { log.Debugf(...) }) so normal operation
uses no- or low-overhead logging while keeping the detailed messages available
when debug is enabled.
| log.Info("DEBUG processMessages: firing async TriggerLogs") | ||
| go func() { | ||
| if err := eventManager.TriggerLogs(); err != nil { | ||
| log.Warnf("failed to trigger logs on new connection: %v", err) | ||
| } else { | ||
| log.Info("DEBUG processMessages: TriggerLogs completed successfully") | ||
| } | ||
| }() |
There was a problem hiding this comment.
Replay/live ordering is still racy due to async TriggerLogs().
At Line 534, TriggerLogs() is launched in a goroutine and the scanner starts immediately, so live lines can be consumed before replay is triggered/completed. This defeats replay-before-live ordering guarantees.
Suggested fix
- if eventManager != nil {
- log.Info("DEBUG processMessages: firing async TriggerLogs")
- go func() {
- if err := eventManager.TriggerLogs(); err != nil {
- log.Warnf("failed to trigger logs on new connection: %v", err)
- } else {
- log.Info("DEBUG processMessages: TriggerLogs completed successfully")
- }
- }()
- }
+ if eventManager != nil {
+ log.Info("DEBUG processMessages: triggering logs before scan")
+ if err := eventManager.TriggerLogs(); err != nil {
+ log.Warnf("failed to trigger logs on new connection: %v; closing connection", err)
+ _ = c.Close()
+ return
+ }
+ log.Info("DEBUG processMessages: TriggerLogs completed successfully")
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| log.Info("DEBUG processMessages: firing async TriggerLogs") | |
| go func() { | |
| if err := eventManager.TriggerLogs(); err != nil { | |
| log.Warnf("failed to trigger logs on new connection: %v", err) | |
| } else { | |
| log.Info("DEBUG processMessages: TriggerLogs completed successfully") | |
| } | |
| }() | |
| if eventManager != nil { | |
| log.Info("DEBUG processMessages: triggering logs before scan") | |
| if err := eventManager.TriggerLogs(); err != nil { | |
| log.Warnf("failed to trigger logs on new connection: %v; closing connection", err) | |
| _ = c.Close() | |
| return | |
| } | |
| log.Info("DEBUG processMessages: TriggerLogs completed successfully") | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/ptp_operator/ptp_operator_plugin.go` around lines 533 - 540, The
current async launch of TriggerLogs in processMessages causes race between
replay and live scanning; change it so TriggerLogs is executed synchronously (or
block until it finishes) before the scanner starts: locate the processMessages
function where eventManager.TriggerLogs() is called (currently inside a go func)
and remove the goroutine, or replace it with explicit synchronization (e.g., a
channel or WaitGroup) that waits for TriggerLogs to complete and surface any
error before proceeding to start the scanner; ensure the log.Warnf/log.Info
handling remains but runs after TriggerLogs returns so replay completes before
live consumption begins.
Summary
TriggerLogs()call synchronous inprocessMessages()(removego func()wrapper)CMD LIVE_STARTmarker in the scanner loop (log and skip, don't pass to ExtractMetrics)Problem
The async TriggerLogs allowed live data to race with replay data on separate processMessages goroutines. With the daemon-side live gate (linuxptp-daemon PR), no data arrives on ptp4l connections until replay completes, making synchronous TriggerLogs safe.
Fix
TriggerLogs is now called synchronously before the scanner starts. The
CMD LIVE_STARTmarker provides a clear demarcation between replay and live data phases, enabling observability and correctness guarantees.Dependencies
Test plan
go vet ./plugins/ptp_operator/...passesCMD LIVE_STARTmarker logged in CEP output after reconfiguration