Skip to content

infra: remove control-plane SPOFs — HA replicas + PodDisruptionBudgets#124

Merged
imran31415 merged 1 commit into
imran31415:mainfrom
UTKARSH698:ha-replicas-pdb-spof
Jun 20, 2026
Merged

infra: remove control-plane SPOFs — HA replicas + PodDisruptionBudgets#124
imran31415 merged 1 commit into
imran31415:mainfrom
UTKARSH698:ha-replicas-pdb-spof

Conversation

@UTKARSH698

Copy link
Copy Markdown
Contributor

Summary

Removes the avoidable single points of failure called out in #106: the controller and both oauth2-proxies ran replicas: 1 with no PodDisruptionBudget, so a single pod crash or node drain locked users out and left the admin control plane with no redundancy.

Changes

  • Controller (charts/workspace-controller/templates/deployment.yaml): replicas: 2 + topologySpreadConstraints; new templates/pdb.yaml (minAvailable: 1). PDB lives in its own file to match this chart's one-resource-per-file layout (keeps deployment.yaml single-doc).
  • Controller oauth2-proxy (charts/workspace-controller/templates/oauth2-proxy.yaml): replicas: 2 + spread + appended PDB.
  • Per-user oauth2-proxy (charts/workspace/templates/oauth2-proxy.yaml): replicas: 2 + spread + PDB, all inside the existing ingress.auth.type=oauth2 guard.
  • All three are stateless, so scaling is safe. topologySpreadConstraints use whenUnsatisfiable: ScheduleAnyway (soft) so single-node clusters still schedule both replicas.
  • The workspace pod stays replicas: 1 (RWO PVC), as the issue specifies.

Tests

  • helm-unittest coverage added for the new replicas / PDB / spread resources (controller pdb_test.yaml, controller oauth2-proxy_test.yaml, per-user proxy cases in ingress_public_test.yaml); the controller's old "single replica" assertion is updated to HA.
  • helm unittest charts/workspace/ charts/workspace-controller/ → 69 passed (was 63). Both charts helm lint clean. Manifests render correctly via helm template.
  • Change is Helm-only, so the SPA/server.py units in make test-all-units are unaffected.

Acceptance criteria

  • oauth2-proxy + controller run 2 replicas with a PDB; a single-pod failure/drain doesn't lock users out.
  • Replicas spread across nodes where possible.
  • helm-unittest coverage for the new resources.

Closes #106

The controller and both oauth2-proxies ran replicas:1 with no PDBs, so a
single pod crash or node drain locked users out and left the admin control
plane with no redundancy.

- controller, controller oauth2-proxy, and per-user oauth2-proxy now run
  replicas:2 (all stateless) with a PodDisruptionBudget (minAvailable:1)
- topologySpreadConstraints (ScheduleAnyway) keep the replicas off one node
  while still scheduling on single-node clusters
- workspace pod stays replicas:1 (RWO PVC), as called out in the issue
- helm-unittest coverage for the new replicas/PDB/spread resources

Closes imran31415#106
@imran31415 imran31415 merged commit 947494a into imran31415:main Jun 20, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

infra: remove SPOFs — HA replicas + PodDisruptionBudgets for controller & oauth2-proxy

2 participants