Skip to content

Skaha Platform Metrics#1082

Open
shinybrar wants to merge 22 commits into
mainfrom
CADC-15588/skaha-platform-metrics
Open

Skaha Platform Metrics#1082
shinybrar wants to merge 22 commits into
mainfrom
CADC-15588/skaha-platform-metrics

Conversation

@shinybrar
Copy link
Copy Markdown
Contributor

@shinybrar shinybrar commented May 19, 2026

Summary

CADC-15588: GET /v1/session?view=stats cluster totals from Metrics API, not nodes/pods.

  • MetricsDAO seam + HttpMetricsDAO -> SKAHA_METRICS_BACKEND_URL + /api/v1/metrics/platform
  • Cluster capacity/allocated from Metrics; session maxCPUCores/maxRAM from LimitRange (spec.max) or k8s-resources.json defaultLimit
  • 503 fail-closed on stats only: "Platform statistics unavailable" / "Session resource limits unavailable"
  • NodeDAO gone for stats; PodResourceUsage stays for session listings
  • Lazy HttpMetricsDAO -> missing metrics URL no longer breaks /v1/session, /{id}, view=interactive
  • RAM ceilings use memory-count formatter (24G/20Gi), not bytes overload (0G bug)
  • CPU bad Metrics strings -> 0.0 cores, no NumberFormatException

Test plan

  • ./gradlew clean check (skaha)
  • GetActionResourceStatsTest — 503 paths, lazy DAO, literal 24G/20Gi
  • PlatformStatsIntTest vs live Skaha (optional SKAHA_METRICS_INTTEST_MODE=fixture)

Deploy note

Needs co-deployed Metrics + SKAHA_METRICS_BACKEND_URL for stats. Other session GETs OK without it.

shinybrar added 13 commits May 19, 2026 15:51
When session LimitRange is enabled, max* fields on view=stats come from
LimitRange spec.max; missing LimitRange returns 503 with a stable message.
CADC-15749
Fetch platform metrics from SKAHA_METRICS_BACKEND_URL for view=stats
cluster totals; wire production GetAction to HttpMetricsDAO.
CADC-15750
Map Metrics fetch failures on view=stats to TransientException with a
stable client message; other session endpoints are unaffected.
CADC-15751
Platform stats no longer derive cluster totals from nodes or running
pod sums; PodResourceUsage remains for session listings.
CADC-15752
Validate view=stats JSON schema against live Skaha; optional fixture
mode asserts cluster totals when SKAHA_METRICS_INTTEST_MODE=fixture.
CADC-15753
Session create/delete coverage no longer calls view=stats; platform
stats are covered by PlatformStatsIntTest.
CADC-15754
Complete removal of node-based capacity for platform stats.
CADC-15752
shinybrar added 7 commits May 19, 2026 22:01
Avoid NumberFormatException when Metrics returns non-numeric CPU values.
Defer HttpMetricsDAO until view=stats so unset SKAHA_METRICS_BACKEND_URL does not
break other session GETs. Use memory-count formatting for session RAM ceilings.
Centralize the kubernetes client-java version for the upcoming Metrics API refactor.
Platform test fixtures now live on PlatformMetricsFixtures.
Rename HttpMetricsDAO to PlatformMetricsDAO, add pod metrics DAO/mappers,
and replace kubectl top with the metrics.k8s.io client for session usage.
Route session pod usage through MetricsDAO, document label selector and
primary-container policy, and add mapper, soft-fail, and delegation tests.
shinybrar added 2 commits May 20, 2026 10:21
Avoid NumberFormatException when the numeric prefix is not an integer.
Collapse platform and pod mappers into PlatformMetrics and PodMetrics,
move test fixtures to test sources, and keep eight production metrics types.
@shinybrar shinybrar marked this pull request as ready for review May 20, 2026 17:54
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (ff23d97) to head (46fbaff).
⚠️ Report is 14 commits behind head on main.

Additional details and impacted files
@@     Coverage Diff      @@
##   main   #1082   +/-   ##
============================
============================

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shinybrar shinybrar requested a review from at88mph May 20, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants