Skip to content

WIP: Pending commits #3

Open
alegacy wants to merge 11 commits into
redhat-cne:mainfrom
alegacy:feature/ai-investigation
Open

WIP: Pending commits #3
alegacy wants to merge 11 commits into
redhat-cne:mainfrom
alegacy:feature/ai-investigation

Conversation

@alegacy
Copy link
Copy Markdown
Contributor

@alegacy alegacy commented Apr 2, 2026

No description provided.

Comment thread ptp_tools.py Outdated
try:
namespace = arguments.get("namespace", "openshift-ptp")
command = arguments.get("command", "PARENT_DATA_SET")
config_file = arguments.get("config_file", "/var/run/ptp4l.0.config")
Copy link
Copy Markdown
Collaborator

@nocturnalastro nocturnalastro Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we don't have one We will want a tool to lookup the ptpconfig for a given profile as T-BC has two profiles and config files.

Its pretty easy to do as the profile name is embedded as comment in the config file when we generate it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLS has access to resources so it should be able to do that on its own without a specific tool to do it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to do an exec to find that unfortunatly. We should probably add to a status though

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Comment authored by Claude]

Agreed — T-BC with two profiles means two config files and the current default of ptp4l.0.config only covers the first. For now the caller can pass the explicit config_file path, but auto-discovery of profiles (by listing /var/run/ptp4l.*.config or reading the embedded profile comment) is a good follow-up. Noted as a known limitation in PR #5.

Comment thread ptp_log_parser.py
first_half_avg = sum(frequencies[:half]) / half
second_half_avg = sum(frequencies[half:]) / (len(frequencies) - half)
diff = second_half_avg - first_half_avg
if abs(diff) < 100:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much of an intuition of if this value is reasonable or not.

I think I'd like to see tools that return a window of all the values we calculate stats for. That way we can workaround somewhat arbitrary values like this if they prove off the mark.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest ... this is a bit of a Claude'ism so possibly half-baked.

Comment thread ptp_log_parser.py
"""Determine stability rating from offset stats and clockcheck event count"""
if offset_stats["std_dev"] is None:
return "unknown"
if offset_stats["std_dev"] < 50 and clockcheck_count == 0:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50 seems a little high. Although that might depend on hardware. I'm thinking about on WPC I would expect to see single digits for stdev at most 10-15 if threes a couple of hops to the GM and its a bit wonky.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitus133 WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that using standard deviation on a system that analyzes log of an arbitrary length is wrong in general. I noticed that the default is 1000 lines of log, but it can be changed, right? So the standard deviation number we get here is a function of the observation window length. It also depends on the system being perfectly converged and free from any wander, which might not be the case (the global mean you are calculating the RMS error against might move during the observation time).
If it's all about stability ratings as the help message suggests, why not use MTIE (maximum time interval error) over a fixed window? There are other industry accepted metrics to consider and expose to the user / LLM, not a s"stable" / "degraded" / "unstable": MTIE, ADEV. This implementation can be considered:
https://github.com/aewallin/allantools

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have some useful stuff around using allantools as this is where we use it for the PTP performance tests.
https://github.com/redhat-partner-solutions/vse-sync-test/tree/main/postprocess/src/vse_sync_pp/analyzers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest ... this is a bit of a Claude'ism so possibly half-baked.

Comment thread ptp_log_parser.py
clockcheck_events = []

for log in logs:
if log.component not in ("ptp4l", "phc2sys"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also need to have per profile if that isn't handled at a higher level already.

Comment thread ptp_log_parser.py
overall_servo_state = "unknown"
for comp in ("ptp4l", "phc2sys"):
if comp in components and components[comp]["servo_state"] != "unknown":
overall_servo_state = components[comp]["servo_state"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to do something like a worse of them for the overall state not just the last one.

Comment thread ptp_log_parser.py
"pmc_priority2": r"grandmasterPriority2\s+(\d+)",
"pmc_parent_port": r"parentPortIdentity\s+([a-f0-9.-]+)",
"clockcheck": r"clockcheck:\s*(.+)",
"servo_state": r"offset\s+(-?\d+)\s+(s[0-2])\s+freq\s+(-?\d+)",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will want to handle switching PHCs to handle HA events as well.

Comment thread ptp_tools.py
recommendations.append("Check grandmaster stability")

if servo_stats["clockcheck_events"]:
recommendations.append(f"Detected {len(servo_stats['clockcheck_events'])} clockcheck events - investigate frequency changes")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clock check events are also a sign of two source updating the same lock e.g two profiles with both running phc2sys for example.

Comment thread ptp_log_parser.py
result["current_states"][port_num] = to_state

if port_num not in result["ports"]:
result["ports"][port_num] = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Port number can be translated in to the name by looking at the ptpconfig would be useful for user understanding.

Comment thread ptp_mcp_server.py
return ListToolsResult(tools=tools)

@self.server.call_tool()
async def handle_call_tool(name: str, arguments: Dict[str, Any]) -> CallToolResult:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda surprised the server framework hasn't abstracted this dispatch function away. Seems like some low hanging fruit.

Comment thread ptp_tools.py
async def get_port_status(self, arguments: Dict[str, Any]) -> Dict[str, Any]:
"""Get PTP port states and transition history"""
try:
namespace = arguments.get("namespace", "openshift-ptp")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could probably be a bit nicer just by ** unpacking the args in the dispatch function.

Comment thread ptp_tools.py
port_data = self.log_parser.extract_port_transitions(logs)

if interface:
filtered_transitions = [t for t in port_data["transitions"] if t.get("port") == interface]
Copy link
Copy Markdown
Collaborator

@nocturnalastro nocturnalastro Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Port name to port num transtion is probably wanted here or in the extraction function.

Comment thread ptp_log_parser.py
return result

def extract_gnss_status(self, logs: List[LogEntry]) -> Dict[str, Any]:
"""Extract detailed GNSS status from logs"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to think about using ubxtool for this or having it as a back up option like we have with PMC and port data.

@alegacy alegacy force-pushed the feature/ai-investigation branch from a8994d2 to c794b67 Compare April 16, 2026 17:59
alegacy and others added 11 commits May 13, 2026 20:21
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add analyze_servo_stability MCP tool that analyzes PI servo controller
behavior by extracting per-component (ptp4l, phc2sys) offset and
frequency statistics, detecting clockcheck events, and providing
stability ratings with actionable recommendations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add get_port_status MCP tool that extracts port state transitions from
logs and enriches with real-time PMC PORT_DATA_SET data for current
port state, delay mechanism, and sync/announce intervals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add get_gnss_status MCP tool that extracts GNSS receiver status from
ts2phc and gnss daemon logs including fix quality, NMEA delay, signal
stability assessment, and loss event tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add analyze_holdover MCP tool that tracks holdover entry/exit events,
calculates current and accumulated holdover duration, and reports
configured holdover timeout from PtpConfig thresholds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add analyze_frequency_drift MCP tool that extracts frequency adjustment
trends from phc2sys logs, detects sudden changes (>1000 ppb), calculates
drift rate in ppb/hour, and classifies oscillator stability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
Enrich existing tools with deeper analysis:
- analyze_sync_status: add path delay and asymmetry analysis option
- get_clock_hierarchy: PMC enrichment, hierarchy chain display with
  grandmaster/parent/node levels, human-readable summary
- extract_sync_status: port state awareness, grandmaster loss detection,
  DPLL/servo state correlation with timestamp ordering
- extract_clock_hierarchy: parse GrandmasterIdentity, parent clock,
  steps_removed from daemon logs and PMC output

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
Comprehensive troubleshooting reference covering sync loss, clock class
changes, GNSS issues, holdover behavior, and ITU-T G.8275.1 compliance.
Intended as RAG content for OpenShift Lightspeed integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add get_ptp_hardware_info MCP tool that exec's into PTP daemon pods to
inspect network interface hardware capabilities via ethtool. Detects hw
timestamping support, PHC devices, driver info, and provides a PTP
readiness assessment per interface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add map_hardware_to_config MCP tool that cross-references PtpConfig
profiles against actual interface hardware capabilities to identify
misconfigurations such as PTP assigned to interfaces without hardware
timestamping or PHC device support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
Add get_ptp_metrics MCP tool that fetches Prometheus metrics from port
9091 inside PTP daemon pods, parses the text exposition format, filters
to PTP-related metrics, and computes summary statistics (min/max/avg)
grouped by metric name with label breakdowns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Allain Legacy <alegacy@redhat.com>
@alegacy alegacy force-pushed the feature/ai-investigation branch from c794b67 to 1fc8d79 Compare May 14, 2026 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants