You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a service-pool / load-balancing layer so a user or org can register several interchangeable instances of the same logical service and have the NyxID proxy spread traffic across them, with health-aware failover. Today a UserService binds to exactly one endpoint (+ optional one node), so running N identical backends (e.g. several GPU/compute workers, several upstream API instances, mirrored regions) means manually picking one slug per call — there is no way to say "balance across these."
This is the generic counterpart to what PR #969 (integrations/compute-pool-service) deliberately keeps out of core. That PR's own framing:
This is not a NyxID service-pool framework. Cross-service counting, quotas, metering, and load balancing should be handled by a future generic NyxID service-pool design rather than by a compute-specific core API. The compute service exposes /v1/status as a capacity signal that such a layer could use later.
So #969 ships the data-plane queue, and this issue tracks the NyxID-side control-plane feature it points at.
Motivation / use case
Operator stands up 3 instances of the same backend (same auth, same API shape) for capacity or redundancy.
Agents / org members keep calling one stable slug; NyxID decides which instance serves each request.
When one instance is unhealthy or at capacity, traffic shifts to the others automatically instead of erroring.
Existing precedent to build on (not from scratch)
NyxID already does node-level failover, just not service-level load balancing:
services/node_routing_service.rs → resolve_node_route() returns NodeRoute { fallback_node_ids: Vec<String> }, and the proxy already walks primary → fallback when a node is offline (test: resolve_node_route_fails_over_from_offline_node_to_online_fallback).
models/user_service.rs binds a single endpoint_id + optional node_id.
services/proxy_service.rs → resolve_proxy_target_from_user_service() resolves one target.
The ask is to generalize that failover into capacity-aware balancing across multiple service instances, not just node fallback within one service.
Proposed scope (for architect review — not final)
Pool model — a ServicePool grouping N member UserServices (or N endpoints under one service) that share a stable slug, owned by the same person/org user (reuse org_service::resolve_owner_access for ACL, consistent with Node / UserService).
Balancing strategies — at minimum round-robin and least-in-flight; ideally weighted and capacity-aware using a pluggable health/status signal (a member can expose something like Add standalone compute pool service integration #969's GET /v1/status → {queued, dispatched, active_workers}).
Health checks + failover — periodic or proxy-time health probe; skip/deprioritize unhealthy members; generalize fallback_node_ids to fallback members.
Proxy integration — pool resolution slots into proxy_service target resolution; a slug can resolve to a pool, then to a concrete member per-request.
Sticky routing (optional) — affinity by session/client_ref for multi-turn/stateful backends.
Observability — per-member request counts / errors for audit + the metering story below.
Explicitly out of scope here (track separately if wanted)
Org/agent quotas, usage counting, and metering across pool members. Add standalone compute pool service integration #969 notes these belong to the same future layer; they're a distinct, larger workstream and shouldn't block basic balancing.
Open questions for architecture
Pool as a new model vs. extending UserService with member_service_ids / a pool_id?
Health signal: standardized contract (a /status-style endpoint NyxID polls) vs. passive (infer health from proxy error rates)? Add standalone compute pool service integration #969's /v1/status is a candidate shape.
Does balancing run only over node-routed members, direct-HTTP members, or both?
Interaction with existing fallback_node_ids — does the node fallback collapse into the pool layer, or stay underneath it?
Summary
Add a service-pool / load-balancing layer so a user or org can register several interchangeable instances of the same logical service and have the NyxID proxy spread traffic across them, with health-aware failover. Today a
UserServicebinds to exactly one endpoint (+ optional one node), so running N identical backends (e.g. several GPU/compute workers, several upstream API instances, mirrored regions) means manually picking one slug per call — there is no way to say "balance across these."This is the generic counterpart to what PR #969 (
integrations/compute-pool-service) deliberately keeps out of core. That PR's own framing:So #969 ships the data-plane queue, and this issue tracks the NyxID-side control-plane feature it points at.
Motivation / use case
Existing precedent to build on (not from scratch)
NyxID already does node-level failover, just not service-level load balancing:
services/node_routing_service.rs→resolve_node_route()returnsNodeRoute { fallback_node_ids: Vec<String> }, and the proxy already walks primary → fallback when a node is offline (test:resolve_node_route_fails_over_from_offline_node_to_online_fallback).models/user_service.rsbinds a singleendpoint_id+ optionalnode_id.services/proxy_service.rs→resolve_proxy_target_from_user_service()resolves one target.The ask is to generalize that failover into capacity-aware balancing across multiple service instances, not just node fallback within one service.
Proposed scope (for architect review — not final)
ServicePoolgrouping N memberUserServices (or N endpoints under one service) that share a stable slug, owned by the same person/org user (reuseorg_service::resolve_owner_accessfor ACL, consistent with Node / UserService).GET /v1/status→{queued, dispatched, active_workers}).fallback_node_idsto fallback members.proxy_servicetarget resolution; a slug can resolve to a pool, then to a concrete member per-request.client_reffor multi-turn/stateful backends.Explicitly out of scope here (track separately if wanted)
Open questions for architecture
UserServicewithmember_service_ids/ apool_id?/status-style endpoint NyxID polls) vs. passive (infer health from proxy error rates)? Add standalone compute pool service integration #969's/v1/statusis a candidate shape.fallback_node_ids— does the node fallback collapse into the pool layer, or stay underneath it?References
integrations/compute-pool-service(the data-plane service that defers to this)