Skip to content

feat(tonic-xds): add OutlierDetectionConfig types (gRFC A50)#2604

Open
LYZJU2019 wants to merge 2 commits intohyperium:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detection-config
Open

feat(tonic-xds): add OutlierDetectionConfig types (gRFC A50)#2604
LYZJU2019 wants to merge 2 commits intohyperium:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detection-config

Conversation

@LYZJU2019
Copy link
Copy Markdown

@LYZJU2019 LYZJU2019 commented Apr 24, 2026

Summary

First of a 4-PR series implementing gRFC A50: xDS Outlier Detection in tonic-xds. This PR is type definitions only — no behavior change.

Defines three validated config types that the algorithm will consume:

  • OutlierDetectionConfig — sweep interval, base/max ejection time, max ejection percent, plus the two optional sub-configs.
  • SuccessRateConfigstdev_factor, enforcement_percentage, minimum_hosts, request_volume.
  • FailurePercentageConfigthreshold, enforcement_percentage, minimum_hosts, request_volume.
  • OutlierDetectionConfig::is_enabled() — true iff at least one ejection algorithm is configured.

Sub-configs are Option<_> so the absence of either algorithm is part of the type, not a sentinel value.

Why split this out

The full A50 implementation pulls in (a) proto parsing, (b) the statistical algorithms + sweep engine, (c) per-RPC outcome interception, and (d) wiring into the load-balancing pipeline. Reviewing them in one PR is hard. This series:

  1. This PR — config types only.
  2. Outlier-detection algorithm + sweep engine (pure, unit-tested in isolation).
  3. Per-endpoint RPC outcome interception (a Service wrapper).
  4. Proto parsing into ClusterResource, wire-up in XdsClusterDiscovery (preserving connections on ejection), end-to-end tests, and the lib.rs feature-table update.

Test plan

  • cargo test -p tonic-xds --lib outlier — 3 tests for is_enabled().
  • cargo fmt -p tonic-xds --check clean.
  • cargo clippy -p tonic-xds --lib --all-features -- -D warnings clean.

Define the validated config types consumed by the outlier-detection
algorithm: OutlierDetectionConfig with the global timing/percentage
parameters, plus SuccessRateConfig and FailurePercentageConfig for the
two ejection algorithms.

This PR contains only the type definitions. Proto parsing from
envoy.config.cluster.v3.OutlierDetection and the ClusterResource field
land in a follow-up PR alongside the load-balancing-pipeline wiring,
keeping the algorithm PR self-contained and easy to review.

Refs: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md
@LYZJU2019 LYZJU2019 force-pushed the lyzju2019/a50-outlier-detection-config branch from e58a837 to 9229eaa Compare April 24, 2026 21:43
@LYZJU2019 LYZJU2019 changed the title feat(tonic-xds): parse Cluster.outlier_detection config (gRFC A50) feat(tonic-xds): add OutlierDetectionConfig types (gRFC A50) Apr 24, 2026
@LYZJU2019 LYZJU2019 marked this pull request as ready for review April 26, 2026 04:42
/// for ejection.
pub threshold: u32,
/// Probability (0-100) that a candidate is actually ejected.
pub enforcement_percentage: u32,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add validation logic? also u16 should suffice here? as max is 100? same with other config values

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^, better enforce invariants through type system for all Resouce types since we have the validation and NACK workflow

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a Percentage(u8) newtype with a fallible new(value: u32) -> Result<Self, PercentageError> constructor (returns the proto wire type u32 so the parser doesn't have to cast at every site, then narrows to u8 once validated). The four percentage fields now hold Percentage instead of u32: max_ejection_percent, enforcing_success_rate, threshold, enforcing_failure_percentage. Once constructed, the algorithm never has to re-validate.

pub stdev_factor: u32,
/// Probability (0-100) that a candidate is actually ejected.
pub enforcement_percentage: u32,
/// Minimum number of candidate endpoints required to run the algorithm.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about enforcing_success_rate as per the spec?

/// Failure-percentage ejection parameters. `None` if the algorithm is
/// disabled.
pub failure_percentage: Option<FailurePercentageConfig>,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no child_policy ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @ankurmittal. For now we only support P2C balancer so omit child_policy in this PR. We can add this field later if needed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will still need child policy to wrap the balancer so that we are not picking failed endpoint again and again, but can be done in future PR

//!
//! These are the validated config inputs consumed by the outlier-detection
//! algorithm. Parsing them from `envoy.config.cluster.v3.OutlierDetection`
//! and exposing them on `ClusterResource` lands in a follow-up PR alongside
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "progress tracking" comments like these better put them in TODO comments so they can be cleaned up later, or just leave them as PR comments instead.

- Trim module docstring; drop "lands in a follow-up PR" framing.
- Note in the docstring why there is no `child_policy` yet (tonic-xds
  has only one balancer; the field will land alongside more balancers).
- Rename `enforcement_percentage` → `enforcing_success_rate` /
  `enforcing_failure_percentage` to match the Envoy proto field names.
- Introduce a local `Percentage(u8)` newtype with a fallible constructor
  and use it for `max_ejection_percent`, `enforcing_success_rate`,
  `threshold`, and `enforcing_failure_percentage` so the 0..=100
  invariant is enforced through the type system. Add tests covering
  the constructor's range checks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants