infra: remove control-plane SPOFs — HA replicas + PodDisruptionBudgets by UTKARSH698 · Pull Request #124 · imran31415/kube-coder

UTKARSH698 · 2026-06-20T05:23:53Z

Summary

Removes the avoidable single points of failure called out in #106: the controller and both oauth2-proxies ran replicas: 1 with no PodDisruptionBudget, so a single pod crash or node drain locked users out and left the admin control plane with no redundancy.

Changes

Controller (charts/workspace-controller/templates/deployment.yaml): replicas: 2 + topologySpreadConstraints; new templates/pdb.yaml (minAvailable: 1). PDB lives in its own file to match this chart's one-resource-per-file layout (keeps deployment.yaml single-doc).
Controller oauth2-proxy (charts/workspace-controller/templates/oauth2-proxy.yaml): replicas: 2 + spread + appended PDB.
Per-user oauth2-proxy (charts/workspace/templates/oauth2-proxy.yaml): replicas: 2 + spread + PDB, all inside the existing ingress.auth.type=oauth2 guard.
All three are stateless, so scaling is safe. topologySpreadConstraints use whenUnsatisfiable: ScheduleAnyway (soft) so single-node clusters still schedule both replicas.
The workspace pod stays replicas: 1 (RWO PVC), as the issue specifies.

Tests

helm-unittest coverage added for the new replicas / PDB / spread resources (controller pdb_test.yaml, controller oauth2-proxy_test.yaml, per-user proxy cases in ingress_public_test.yaml); the controller's old "single replica" assertion is updated to HA.
helm unittest charts/workspace/ charts/workspace-controller/ â†’ 69 passed (was 63). Both charts helm lint clean. Manifests render correctly via helm template.
Change is Helm-only, so the SPA/server.py units in make test-all-units are unaffected.

Acceptance criteria

oauth2-proxy + controller run 2 replicas with a PDB; a single-pod failure/drain doesn't lock users out.
Replicas spread across nodes where possible.
helm-unittest coverage for the new resources.

Closes #106

The controller and both oauth2-proxies ran replicas:1 with no PDBs, so a single pod crash or node drain locked users out and left the admin control plane with no redundancy. - controller, controller oauth2-proxy, and per-user oauth2-proxy now run replicas:2 (all stateless) with a PodDisruptionBudget (minAvailable:1) - topologySpreadConstraints (ScheduleAnyway) keep the replicas off one node while still scheduling on single-node clusters - workspace pod stays replicas:1 (RWO PVC), as called out in the issue - helm-unittest coverage for the new replicas/PDB/spread resources Closes imran31415#106

imran31415 merged commit 947494a into imran31415:main Jun 20, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra: remove control-plane SPOFs — HA replicas + PodDisruptionBudgets#124

infra: remove control-plane SPOFs — HA replicas + PodDisruptionBudgets#124
imran31415 merged 1 commit into
imran31415:mainfrom
UTKARSH698:ha-replicas-pdb-spof

UTKARSH698 commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

UTKARSH698 commented Jun 20, 2026

Summary

Changes

Tests

Acceptance criteria

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants