Is your feature request related to a problem? Please describe.
Today, the runners/ directory is Slurm-centric for multi-node setups — e.g., launch_b200-dgxc-slurm.sh, launch_h100-dgxc-slurm.sh, launch_h200-dgxc-slurm.sh. Slurm is great for HPC-style clusters, but it's limiting for reproducing these benchmarks on the cloud-native stacks that most production LLM serving actually runs on: Kubernetes on EKS/GKE/AKS/OpenShift and on-prem K8s GPU fleets.
The repo already demonstrates disaggregated serving via NVIDIA Dynamo on Slurm (e.g., launch_gb200-nv.sh + PR #1008 for Kimi K2.5 NVFP4 GB200 disaggregated vLLM), so disaggregation itself is supported — the gap is K8s-native orchestration of the same patterns (disaggregated P/D, KV-cache-aware routing, wide-EP, autoscaling). Without that, community users can't easily reproduce InferenceX results in their own K8s environments, and newer serving patterns that are first-class in K8s-native stacks are harder to cover.
Describe the solution you'd like
Add a first-class Kubernetes-native runner targeting llm-d as a reference, analogous to the existing Slurm runners. Concretely:
- New runner(s) under
runners/ (e.g., launch_b200-k8s-llmd.sh, launch_mi355x-k8s-llmd.sh) that stand up llm-d on a K8s cluster and drive benchmarks through the existing harness.
- Reuse llm-d's upstream Helm charts and reproducible benchmark workflows (shipped in llm-d v0.5, Feb 2026), which already include validated B200 numbers (~3.1k tok/s per decode GPU on wide-EP; up to 50k output tok/s on a 16×16 B200 P/D topology). This minimizes new orchestration code on the InferenceX side.
- Integration with
benchmarks/ so K8s-native results are directly comparable to Slurm-based runs on the same metrics (TTFT, ITL, throughput, goodput, per-GPU utilization).
- Support the serving patterns llm-d exposes natively: disaggregated prefill/decode via NIXL, KV-cache-aware inference scheduling via the Gateway API, wide-EP for MoE models (DeepSeek, Qwen3.5, gpt-oss), and tiered KV offload.
- Docs for running InferenceX benchmarks on a K8s cluster (GB200 NVL72 / B200 / H100 / MI355X) using llm-d as the orchestration layer.
Describe alternatives you've considered
- Slurm-only (status quo): works for the current set of supported clusters, but limits reproducibility for the broader K8s-based community and makes it harder to benchmark K8s-native patterns (Gateway-API-based smart routing, HPA/VPA autoscaling, workload-variant autoscaler).
- Raw Kubernetes Deployments/StatefulSets without llm-d: workable, but reinvents disaggregated serving, KV-cache-aware routing, and autoscaling that llm-d already provides on top of vLLM/SGLang.
- Ray Serve / KServe / NVIDIA Dynamo on K8s: viable alternatives — could be added as additional K8s runners later. llm-d seems like a strong first target because it's purpose-built for distributed LLM inference, aligns with the vLLM/SGLang stack already used here, is Apache-2.0, and has an existing reproducible benchmark workflow that can be leveraged directly.
Additional context
- llm-d: https://github.com/llm-d/llm-d — Kubernetes-native distributed inference stack with disaggregated P/D, KV-cache-aware scheduling, wide-EP, and native vLLM/SGLang support. Supported accelerators per their docs include NVIDIA A100+, AMD MI250+, Intel GPU Max, and Google TPU v5e+ — overlapping well with InferenceX's hardware coverage.
- A K8s-native runner would also make it easier to onboard new accelerators/clouds without waiting for Slurm integration on each provider.
- Happy to help prototype a runner if maintainers are interested and can point at a preferred starting cluster (B200 or MI355X).
Is your feature request related to a problem? Please describe.
Today, the
runners/directory is Slurm-centric for multi-node setups — e.g.,launch_b200-dgxc-slurm.sh,launch_h100-dgxc-slurm.sh,launch_h200-dgxc-slurm.sh. Slurm is great for HPC-style clusters, but it's limiting for reproducing these benchmarks on the cloud-native stacks that most production LLM serving actually runs on: Kubernetes on EKS/GKE/AKS/OpenShift and on-prem K8s GPU fleets.The repo already demonstrates disaggregated serving via NVIDIA Dynamo on Slurm (e.g.,
launch_gb200-nv.sh+ PR #1008 for Kimi K2.5 NVFP4 GB200 disaggregated vLLM), so disaggregation itself is supported — the gap is K8s-native orchestration of the same patterns (disaggregated P/D, KV-cache-aware routing, wide-EP, autoscaling). Without that, community users can't easily reproduce InferenceX results in their own K8s environments, and newer serving patterns that are first-class in K8s-native stacks are harder to cover.Describe the solution you'd like
Add a first-class Kubernetes-native runner targeting llm-d as a reference, analogous to the existing Slurm runners. Concretely:
runners/(e.g.,launch_b200-k8s-llmd.sh,launch_mi355x-k8s-llmd.sh) that stand up llm-d on a K8s cluster and drive benchmarks through the existing harness.benchmarks/so K8s-native results are directly comparable to Slurm-based runs on the same metrics (TTFT, ITL, throughput, goodput, per-GPU utilization).Describe alternatives you've considered
Additional context