Skip to content

Add service-pool / load-balancing layer for multiple identical service instances #974

@chrono-kw

Description

@chrono-kw

Summary

Add a service-pool / load-balancing layer so a user or org can register several interchangeable instances of the same logical service and have the NyxID proxy spread traffic across them, with health-aware failover. Today a UserService binds to exactly one endpoint (+ optional one node), so running N identical backends (e.g. several GPU/compute workers, several upstream API instances, mirrored regions) means manually picking one slug per call — there is no way to say "balance across these."

This is the generic counterpart to what PR #969 (integrations/compute-pool-service) deliberately keeps out of core. That PR's own framing:

This is not a NyxID service-pool framework. Cross-service counting, quotas, metering, and load balancing should be handled by a future generic NyxID service-pool design rather than by a compute-specific core API. The compute service exposes /v1/status as a capacity signal that such a layer could use later.

So #969 ships the data-plane queue, and this issue tracks the NyxID-side control-plane feature it points at.

Motivation / use case

  • Operator stands up 3 instances of the same backend (same auth, same API shape) for capacity or redundancy.
  • Agents / org members keep calling one stable slug; NyxID decides which instance serves each request.
  • When one instance is unhealthy or at capacity, traffic shifts to the others automatically instead of erroring.

Existing precedent to build on (not from scratch)

NyxID already does node-level failover, just not service-level load balancing:

  • services/node_routing_service.rsresolve_node_route() returns NodeRoute { fallback_node_ids: Vec<String> }, and the proxy already walks primary → fallback when a node is offline (test: resolve_node_route_fails_over_from_offline_node_to_online_fallback).
  • models/user_service.rs binds a single endpoint_id + optional node_id.
  • services/proxy_service.rsresolve_proxy_target_from_user_service() resolves one target.

The ask is to generalize that failover into capacity-aware balancing across multiple service instances, not just node fallback within one service.

Proposed scope (for architect review — not final)

  1. Pool model — a ServicePool grouping N member UserServices (or N endpoints under one service) that share a stable slug, owned by the same person/org user (reuse org_service::resolve_owner_access for ACL, consistent with Node / UserService).
  2. Balancing strategies — at minimum round-robin and least-in-flight; ideally weighted and capacity-aware using a pluggable health/status signal (a member can expose something like Add standalone compute pool service integration #969's GET /v1/status{queued, dispatched, active_workers}).
  3. Health checks + failover — periodic or proxy-time health probe; skip/deprioritize unhealthy members; generalize fallback_node_ids to fallback members.
  4. Proxy integration — pool resolution slots into proxy_service target resolution; a slug can resolve to a pool, then to a concrete member per-request.
  5. Sticky routing (optional) — affinity by session/client_ref for multi-turn/stateful backends.
  6. Observability — per-member request counts / errors for audit + the metering story below.

Explicitly out of scope here (track separately if wanted)

Open questions for architecture

  • Pool as a new model vs. extending UserService with member_service_ids / a pool_id?
  • Health signal: standardized contract (a /status-style endpoint NyxID polls) vs. passive (infer health from proxy error rates)? Add standalone compute pool service integration #969's /v1/status is a candidate shape.
  • Does balancing run only over node-routed members, direct-HTTP members, or both?
  • Interaction with existing fallback_node_ids — does the node fallback collapse into the pool layer, or stay underneath it?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions