AWS EKS layer (Terraform):
- VPC
- EKS cluster and EKS addons
- Karpenter
- AWS Load Balancer Controller IAM
- ArgoCD
- GitOps Bridge — cluster Secret + root Application
Core layer (ArgoCD at argocd/applications/core/):
- Traefik ingress controller
- AWS Load Balancer Controller (Helm release; IAM stays in TF)
- NVIDIA GPU Operator
- Grafana Mimir + Alloy — Monitoring
- Piraeus Operator for Linstor tests
MLOps layer (ArgoCD at argocd/applications/mlops/):
- JupyterLab (CUDA/LLM) — image built in-cluster via BuildKit Job, deploy via manual-sync Argo Application
- Terraform — creates VPC, EKS, Karpenter, ArgoCD, cluster Secret, root Application.
- ArgoCD picks up the root Application → recursively discovers
argocd/applications/core/andargocd/applications/apps/. - ApplicationSets materialise child Applications that install Traefik, ALB controller, etc.
- Traefik comes up, NLB gets provisioned by ALB controller, you map the NLB IP in
/etc/hosts.
cd terraform
terraform init -upgrade
terraform applyDuring the first minutesthe ArgoCD UI is not yet reachable via argocd.local.
Access the UI via port-forward:
k -n argocd port-forward svc/argocd-server 8080:80
# open http://localhost:8080k -n traefik get svc traefik \
-o jsonpath='{.status.loadBalancer.ingress[0].hostname}' \
| xargs dig +shortPick any one of the returned IPs and add:
<IP> argocd.local grafana.localk -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -d; echoLogin to CLI and add the GitOps repo (if not public):
argocd login argocd.local:443
argocd repo add https://github.com/silazare/argocd-infra-example.git \
--username silazare --password github_pat_xxxxx
argocd repo add ghcr.io --type helm --name stable --enable-oci# delete ArgoCD root application
# remove stuck application sets
for kind in applications applicationsets; do
for name in $(kubectl -n argocd get $kind -o name); do
kubectl -n argocd patch $name --type=json \
-p='[{"op":"remove","path":"/metadata/finalizers"}]' 2>/dev/null
done
done
terraform destroyhttps://medium.com/@sinan.ozel_23433/iac-for-generative-ai-llm-jupyterlab-on-kubernetes-a33d31841a27 https://www.jimangel.io/posts/nvidia-rtx-gpu-kubernetes-setup/
Build runs as a Job in buildkit namespace and pushes image, layer cache to jupyterlab-llm-cache repo. Update branch/tag inside build-job.yaml if needed.
Edit the tag in files (keep them in sync), then build + push + sync:
- mlops/jupyterlab-llm/build-job.yaml —
--output=...:<NEW_TAG> - argocd/manifests/jupyterlab-llm/jupyterlab-llm-pod.yaml —
image: ...:<NEW_TAG>
k replace --force -f mlops/jupyterlab-llm/build-job.yamlApplication is not auto-synced — image must exist in ECR before first sync. Trigger sync manually:
argocd app sync jupyterlab-llmSandbox for a Piraeus / LINSTOR / DRBD persistent-storage stack Settings at argocd/helm-values/linstor-cluster/values.yaml
Three placement modes, one StorageClass per replica count:
| Manifest | StorageClass | Placement | What it proves |
|---|---|---|---|
| mlops/hdd1-test-sts.yaml | linstor-hdd-1r (autoPlace=1) |
1 diskful replica on a storage node | Provisioning + ext4 + Retain reclaim works; PV survives Pod recreate on the same node |
| mlops/hdd2-test-sts.yaml | linstor-hdd-2r (autoPlace=2) |
2 diskful replicas across storage nodes | Synchronous DRBD replication; Pod can come back on either replica node |
| mlops/diskless-test-sts.yaml | linstor-hdd-2r |
2 diskful on storage NG + 1 diskless DRBD client on karpenter ubuntu node |
Compute / storage separation pattern — the bare-metal target shape where GPU nodes mount data over the network from CPU storage nodes |
# Satellites + storage pools
k -n piraeus-datastore exec deploy/linstor-controller -- linstor node list
k -n piraeus-datastore exec deploy/linstor-controller -- linstor storage-pool list
# Apply any of the test STS and watch the resource list
k apply -f mlops/hdd2-test-sts.yaml
k -n piraeus-datastore exec deploy/linstor-controller -- linstor resource list
# Live DRBD state on a specific satellite
k -n piraeus-datastore get pod -l app.kubernetes.io/component=linstor-satellite -o wide
k -n piraeus-datastore exec <satellite-pod> -- drbdadm status