observability: add metrics instrumentation to DataParallelInferenceCoordinator#4586
Open
DhineshPonnarasan wants to merge 4 commits intoNVIDIA:mainfrom
Open
observability: add metrics instrumentation to DataParallelInferenceCoordinator#4586DhineshPonnarasan wants to merge 4 commits intoNVIDIA:mainfrom
DhineshPonnarasan wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
…and Dockerfile.linting
…ordinator Implements observability enhancements for DataParallelInferenceCoordinator as described in issue NVIDIA#4176. Adds a backend-agnostic CoordinatorMetrics abstraction and instruments the coordinator with 10 metrics covering routing quality, reliability, and latency. New file: megatron/core/inference/coordinator_metrics.py - CoordinatorMetrics ABC with inc(), observe(), gauge() - NoOpMetrics default (near-zero overhead when observability is disabled) - Fully decoupled from any specific metrics backend (Prometheus, StatsD, etc.) Modified: megatron/core/inference/data_parallel_inference_coordinator.py - metrics: CoordinatorMetrics | None = None param in __init__ and entrypoint - coordinator_active_engines gauge set at init, on engine removal, and re-registration - _log_protocol_error() centralizes error classification for structured logging - routing_cache_hit_total / routing_cache_miss_total / routing_stale_detected_total emitted from get_best_data_parallel_rank() with record_metrics guard to prevent double-counting on retry loops - coordinator_engine_unreachable_total fired in _send_to_engine() before EHOSTUNREACH - coordinator_unknown_sender_total covers SUBMIT_REQUEST, control signals, and SHUTDOWN - coordinator_all_engines_exhausted_total in for-else when every engine is unreachable - coordinator_routing_latency_seconds observed after successful engine send - coordinator_message_processing_latency_seconds in try/finally to cover every message - coordinator_invalid_message_total / coordinator_internal_error_total via _log_protocol_error New file: tests/unit_tests/inference/test_coordinator_observability.py - 32 unit tests with in-memory TestMetrics backend - Covers all 10 metrics, NoOpMetrics default, double-count prevention, entrypoint forwarding, and all unknown-sender paths No routing or protocol behavior changes. Compatible with PR NVIDIA#4419.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements PR 2 — Observability Enhancements for the inference coordinator as part of:
This PR builds on protocol robustness and adds production-ready observability for routing quality, reliability, and latency — without modifying routing logic or protocol behavior.
Context
Issue #4176 tracks follow-up improvements after prefix-cache routing enhancements.
This PR is intentionally scoped to observability only to keep review focused and low-risk.
What’s Included
1. Metrics Abstraction
New file:
megatron/core/inference/coordinator_metrics.pyProvides:
CoordinatorMetrics(minimal interface:inc,observe,gauge)NoOpMetrics(default zero-overhead implementation)Design goals:
2. Coordinator Instrumentation
Modified:
megatron/core/inference/data_parallel_inference_coordinator.pyRouting Quality
routing_cache_hit_totalrouting_cache_miss_totalrouting_stale_detected_totalReliability
coordinator_invalid_message_totalcoordinator_internal_error_totalcoordinator_unknown_sender_totalcoordinator_engine_unreachable_totalcoordinator_all_engines_exhausted_totalLatency
coordinator_routing_latency_secondscoordinator_message_processing_latency_secondsSystem State
coordinator_active_engines(gauge)3. Implementation Details
Defaults to
NoOpMetrics(zero overhead when disabled)_log_protocol_error(...):client_errorvsinternal_error)Double-count prevention:
record_metricsguard ensures routing metrics are emitted once per requestLatency:
time.monotonic()try/finally(covers failures)No changes to:
4. Tests
New file:
tests/unit_tests/inference/test_coordinator_observability.pyIncludes 32 unit tests covering:
NoOpMetricsbehaviorMetric Naming
Testing
Result:
Note:
Full test suite may not run locally due to pre-existing environment constraints (e.g., missing
fused_a2a_config, platform-specific signal handling). These are unrelated to this PR.Scope
Follow-up Work
This PR completes PR 2 — Observability Enhancements as outlined in #4419.
Planned PR 3 — Metadata Policy Extensions
Additional validation and performance testing (as described in #4176) will be addressed separately.
Result
The coordinator now exposes:
while maintaining existing behavior and performance characteristics.
Closes #4176 (observability section)
@shanmugamr1992 @Phlip79 @YangFei1990 could you please take a look when you have time?
Happy to incorporate any feedback including additional tests, refinements or scope adjustments if needed.