WIP: Pending commits #3
Conversation
| try: | ||
| namespace = arguments.get("namespace", "openshift-ptp") | ||
| command = arguments.get("command", "PARENT_DATA_SET") | ||
| config_file = arguments.get("config_file", "/var/run/ptp4l.0.config") |
There was a problem hiding this comment.
In case we don't have one We will want a tool to lookup the ptpconfig for a given profile as T-BC has two profiles and config files.
Its pretty easy to do as the profile name is embedded as comment in the config file when we generate it.
There was a problem hiding this comment.
OLS has access to resources so it should be able to do that on its own without a specific tool to do it.
There was a problem hiding this comment.
You have to do an exec to find that unfortunatly. We should probably add to a status though
There was a problem hiding this comment.
[Comment authored by Claude]
Agreed — T-BC with two profiles means two config files and the current default of ptp4l.0.config only covers the first. For now the caller can pass the explicit config_file path, but auto-discovery of profiles (by listing /var/run/ptp4l.*.config or reading the embedded profile comment) is a good follow-up. Noted as a known limitation in PR #5.
| first_half_avg = sum(frequencies[:half]) / half | ||
| second_half_avg = sum(frequencies[half:]) / (len(frequencies) - half) | ||
| diff = second_half_avg - first_half_avg | ||
| if abs(diff) < 100: |
There was a problem hiding this comment.
I don't have much of an intuition of if this value is reasonable or not.
I think I'd like to see tools that return a window of all the values we calculate stats for. That way we can workaround somewhat arbitrary values like this if they prove off the mark.
There was a problem hiding this comment.
I'll be honest ... this is a bit of a Claude'ism so possibly half-baked.
| """Determine stability rating from offset stats and clockcheck event count""" | ||
| if offset_stats["std_dev"] is None: | ||
| return "unknown" | ||
| if offset_stats["std_dev"] < 50 and clockcheck_count == 0: |
There was a problem hiding this comment.
50 seems a little high. Although that might depend on hardware. I'm thinking about on WPC I would expect to see single digits for stdev at most 10-15 if threes a couple of hops to the GM and its a bit wonky.
There was a problem hiding this comment.
I think that using standard deviation on a system that analyzes log of an arbitrary length is wrong in general. I noticed that the default is 1000 lines of log, but it can be changed, right? So the standard deviation number we get here is a function of the observation window length. It also depends on the system being perfectly converged and free from any wander, which might not be the case (the global mean you are calculating the RMS error against might move during the observation time).
If it's all about stability ratings as the help message suggests, why not use MTIE (maximum time interval error) over a fixed window? There are other industry accepted metrics to consider and expose to the user / LLM, not a s"stable" / "degraded" / "unstable": MTIE, ADEV. This implementation can be considered:
https://github.com/aewallin/allantools
There was a problem hiding this comment.
This might have some useful stuff around using allantools as this is where we use it for the PTP performance tests.
https://github.com/redhat-partner-solutions/vse-sync-test/tree/main/postprocess/src/vse_sync_pp/analyzers
There was a problem hiding this comment.
I'll be honest ... this is a bit of a Claude'ism so possibly half-baked.
| clockcheck_events = [] | ||
|
|
||
| for log in logs: | ||
| if log.component not in ("ptp4l", "phc2sys"): |
There was a problem hiding this comment.
We will also need to have per profile if that isn't handled at a higher level already.
| overall_servo_state = "unknown" | ||
| for comp in ("ptp4l", "phc2sys"): | ||
| if comp in components and components[comp]["servo_state"] != "unknown": | ||
| overall_servo_state = components[comp]["servo_state"] |
There was a problem hiding this comment.
We probably want to do something like a worse of them for the overall state not just the last one.
| "pmc_priority2": r"grandmasterPriority2\s+(\d+)", | ||
| "pmc_parent_port": r"parentPortIdentity\s+([a-f0-9.-]+)", | ||
| "clockcheck": r"clockcheck:\s*(.+)", | ||
| "servo_state": r"offset\s+(-?\d+)\s+(s[0-2])\s+freq\s+(-?\d+)", |
There was a problem hiding this comment.
We will want to handle switching PHCs to handle HA events as well.
| recommendations.append("Check grandmaster stability") | ||
|
|
||
| if servo_stats["clockcheck_events"]: | ||
| recommendations.append(f"Detected {len(servo_stats['clockcheck_events'])} clockcheck events - investigate frequency changes") |
There was a problem hiding this comment.
Clock check events are also a sign of two source updating the same lock e.g two profiles with both running phc2sys for example.
| result["current_states"][port_num] = to_state | ||
|
|
||
| if port_num not in result["ports"]: | ||
| result["ports"][port_num] = { |
There was a problem hiding this comment.
Port number can be translated in to the name by looking at the ptpconfig would be useful for user understanding.
| return ListToolsResult(tools=tools) | ||
|
|
||
| @self.server.call_tool() | ||
| async def handle_call_tool(name: str, arguments: Dict[str, Any]) -> CallToolResult: |
There was a problem hiding this comment.
I'm kinda surprised the server framework hasn't abstracted this dispatch function away. Seems like some low hanging fruit.
| async def get_port_status(self, arguments: Dict[str, Any]) -> Dict[str, Any]: | ||
| """Get PTP port states and transition history""" | ||
| try: | ||
| namespace = arguments.get("namespace", "openshift-ptp") |
There was a problem hiding this comment.
this could probably be a bit nicer just by ** unpacking the args in the dispatch function.
| port_data = self.log_parser.extract_port_transitions(logs) | ||
|
|
||
| if interface: | ||
| filtered_transitions = [t for t in port_data["transitions"] if t.get("port") == interface] |
There was a problem hiding this comment.
Port name to port num transtion is probably wanted here or in the extraction function.
| return result | ||
|
|
||
| def extract_gnss_status(self, logs: List[LogEntry]) -> Dict[str, Any]: | ||
| """Extract detailed GNSS status from logs""" |
There was a problem hiding this comment.
We might want to think about using ubxtool for this or having it as a back up option like we have with PMC and port data.
a8994d2 to
c794b67
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add analyze_servo_stability MCP tool that analyzes PI servo controller behavior by extracting per-component (ptp4l, phc2sys) offset and frequency statistics, detecting clockcheck events, and providing stability ratings with actionable recommendations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add get_port_status MCP tool that extracts port state transitions from logs and enriches with real-time PMC PORT_DATA_SET data for current port state, delay mechanism, and sync/announce intervals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add get_gnss_status MCP tool that extracts GNSS receiver status from ts2phc and gnss daemon logs including fix quality, NMEA delay, signal stability assessment, and loss event tracking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add analyze_holdover MCP tool that tracks holdover entry/exit events, calculates current and accumulated holdover duration, and reports configured holdover timeout from PtpConfig thresholds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add analyze_frequency_drift MCP tool that extracts frequency adjustment trends from phc2sys logs, detects sudden changes (>1000 ppb), calculates drift rate in ppb/hour, and classifies oscillator stability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
Enrich existing tools with deeper analysis: - analyze_sync_status: add path delay and asymmetry analysis option - get_clock_hierarchy: PMC enrichment, hierarchy chain display with grandmaster/parent/node levels, human-readable summary - extract_sync_status: port state awareness, grandmaster loss detection, DPLL/servo state correlation with timestamp ordering - extract_clock_hierarchy: parse GrandmasterIdentity, parent clock, steps_removed from daemon logs and PMC output Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
Comprehensive troubleshooting reference covering sync loss, clock class changes, GNSS issues, holdover behavior, and ITU-T G.8275.1 compliance. Intended as RAG content for OpenShift Lightspeed integration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add get_ptp_hardware_info MCP tool that exec's into PTP daemon pods to inspect network interface hardware capabilities via ethtool. Detects hw timestamping support, PHC devices, driver info, and provides a PTP readiness assessment per interface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add map_hardware_to_config MCP tool that cross-references PtpConfig profiles against actual interface hardware capabilities to identify misconfigurations such as PTP assigned to interfaces without hardware timestamping or PHC device support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add get_ptp_metrics MCP tool that fetches Prometheus metrics from port 9091 inside PTP daemon pods, parses the text exposition format, filters to PTP-related metrics, and computes summary statistics (min/max/avg) grouped by metric name with label breakdowns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Allain Legacy <alegacy@redhat.com>
c794b67 to
1fc8d79
Compare
No description provided.