Skip to content

observability: add metrics instrumentation to DataParallelInferenceCoordinator#4586

Open
DhineshPonnarasan wants to merge 4 commits intoNVIDIA:mainfrom
DhineshPonnarasan:feature/coordinator-observability
Open

observability: add metrics instrumentation to DataParallelInferenceCoordinator#4586
DhineshPonnarasan wants to merge 4 commits intoNVIDIA:mainfrom
DhineshPonnarasan:feature/coordinator-observability

Conversation

@DhineshPonnarasan
Copy link
Copy Markdown
Contributor

@DhineshPonnarasan DhineshPonnarasan commented May 1, 2026

Summary

Implements PR 2 — Observability Enhancements for the inference coordinator as part of:

This PR builds on protocol robustness and adds production-ready observability for routing quality, reliability, and latency — without modifying routing logic or protocol behavior.


Context

Issue #4176 tracks follow-up improvements after prefix-cache routing enhancements.

This PR is intentionally scoped to observability only to keep review focused and low-risk.


What’s Included

1. Metrics Abstraction

New file:

  • megatron/core/inference/coordinator_metrics.py

Provides:

  • CoordinatorMetrics (minimal interface: inc, observe, gauge)
  • NoOpMetrics (default zero-overhead implementation)

Design goals:

  • Backend-agnostic (Prometheus / StatsD / OpenTelemetry compatible)
  • No global state
  • Safe default when observability is disabled

2. Coordinator Instrumentation

Modified:

  • megatron/core/inference/data_parallel_inference_coordinator.py

Routing Quality

  • routing_cache_hit_total
  • routing_cache_miss_total
  • routing_stale_detected_total

Reliability

  • coordinator_invalid_message_total
  • coordinator_internal_error_total
  • coordinator_unknown_sender_total
  • coordinator_engine_unreachable_total
  • coordinator_all_engines_exhausted_total

Latency

  • coordinator_routing_latency_seconds
  • coordinator_message_processing_latency_seconds

System State

  • coordinator_active_engines (gauge)

3. Implementation Details

  • Metrics injected via:
metrics: CoordinatorMetrics | None = None
  • Defaults to NoOpMetrics (zero overhead when disabled)

  • _log_protocol_error(...):

  • Double-count prevention:

    • record_metrics guard ensures routing metrics are emitted once per request
  • Latency:

    • Uses time.monotonic()
    • Message latency recorded via try/finally (covers failures)
  • No changes to:

    • Routing decisions
    • Prefix cache logic
    • Protocol semantics

4. Tests

New file:

  • tests/unit_tests/inference/test_coordinator_observability.py

Includes 32 unit tests covering:

  • Metric emission correctness
  • Routing scenarios (hit / miss / stale / fallback)
  • Unknown sender handling
  • Engine failure and exhaustion scenarios
  • Latency metrics (routing + message processing)
  • NoOpMetrics behavior
  • Metrics injection via entrypoint
  • Double-count prevention

Metric Naming

Prefix Meaning
coordinator_* System-level metrics
routing_* Routing quality metrics

Testing

pytest tests/unit_tests/inference/test_coordinator_observability.py -v

Result:

  • 32 passed
image

Note:
Full test suite may not run locally due to pre-existing environment constraints (e.g., missing fused_a2a_config, platform-specific signal handling). These are unrelated to this PR.


Scope

  • Observability only
  • No behavior changes
  • No modifications to routing or protocol logic

Follow-up Work

This PR completes PR 2 — Observability Enhancements as outlined in #4419.

Planned PR 3 — Metadata Policy Extensions

  • TTL-based expiration for prefix metadata
  • Integration with existing cap-based eviction

Additional validation and performance testing (as described in #4176) will be addressed separately.


Result

The coordinator now exposes:

  • Actionable routing quality signals
  • Reliability diagnostics
  • Latency visibility

while maintaining existing behavior and performance characteristics.


Closes #4176 (observability section)


@shanmugamr1992 @Phlip79 @YangFei1990 could you please take a look when you have time?

Happy to incorporate any feedback including additional tests, refinements or scope adjustments if needed.

…ordinator

Implements observability enhancements for DataParallelInferenceCoordinator
as described in issue NVIDIA#4176. Adds a backend-agnostic CoordinatorMetrics
abstraction and instruments the coordinator with 10 metrics covering routing
quality, reliability, and latency.

New file: megatron/core/inference/coordinator_metrics.py
- CoordinatorMetrics ABC with inc(), observe(), gauge()
- NoOpMetrics default (near-zero overhead when observability is disabled)
- Fully decoupled from any specific metrics backend (Prometheus, StatsD, etc.)

Modified: megatron/core/inference/data_parallel_inference_coordinator.py
- metrics: CoordinatorMetrics | None = None param in __init__ and entrypoint
- coordinator_active_engines gauge set at init, on engine removal, and re-registration
- _log_protocol_error() centralizes error classification for structured logging
- routing_cache_hit_total / routing_cache_miss_total / routing_stale_detected_total
  emitted from get_best_data_parallel_rank() with record_metrics guard to prevent
  double-counting on retry loops
- coordinator_engine_unreachable_total fired in _send_to_engine() before EHOSTUNREACH
- coordinator_unknown_sender_total covers SUBMIT_REQUEST, control signals, and SHUTDOWN
- coordinator_all_engines_exhausted_total in for-else when every engine is unreachable
- coordinator_routing_latency_seconds observed after successful engine send
- coordinator_message_processing_latency_seconds in try/finally to cover every message
- coordinator_invalid_message_total / coordinator_internal_error_total via _log_protocol_error

New file: tests/unit_tests/inference/test_coordinator_observability.py
- 32 unit tests with in-memory TestMetrics backend
- Covers all 10 metrics, NoOpMetrics default, double-count prevention,
  entrypoint forwarding, and all unknown-sender paths

No routing or protocol behavior changes. Compatible with PR NVIDIA#4419.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prefix-Cache Coordinator Follow-ups: Protocol Hardening, Metrics Integration, TTL Policy and Churn Validation

2 participants