Infra Components

AWS EKS layer (Terraform):

Core layer (ArgoCD at argocd/applications/core/):

Traefik ingress controller
AWS Load Balancer Controller (Helm release; IAM stays in TF)
NVIDIA GPU Operator
Grafana Mimir + Alloy — Monitoring
Piraeus Operator for Linstor tests

MLOps layer (ArgoCD at argocd/applications/mlops/):

JupyterLab (CUDA/LLM) — image built in-cluster via BuildKit Job, deploy via manual-sync Argo Application

Deployment

Terraform — creates VPC, EKS, Karpenter, ArgoCD, cluster Secret, root Application.
ArgoCD picks up the root Application → recursively discovers argocd/applications/core/ and argocd/applications/apps/.
ApplicationSets materialise child Applications that install Traefik, ALB controller, etc.
Traefik comes up, NLB gets provisioned by ALB controller, you map the NLB IP in /etc/hosts.

1. Terraform

cd terraform
terraform init -upgrade
terraform apply

2. Wait for ArgoCD to sync core platform

During the first minutesthe ArgoCD UI is not yet reachable via argocd.local. Access the UI via port-forward:

k -n argocd port-forward svc/argocd-server 8080:80
# open http://localhost:8080

3. Map the NLB IP into `/etc/hosts`

k -n traefik get svc traefik \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}' \
  | xargs dig +short

Pick any one of the returned IPs and add:

<IP>  argocd.local grafana.local

4. Retrieve ArgoCD admin password

k -n argocd get secret argocd-initial-admin-secret \
  -o jsonpath="{.data.password}" | base64 -d; echo

Login to CLI and add the GitOps repo (if not public):

argocd login argocd.local:443

argocd repo add https://github.com/silazare/argocd-infra-example.git \
  --username silazare --password github_pat_xxxxx

argocd repo add ghcr.io --type helm --name stable --enable-oci

Delete infrastructure

# delete ArgoCD root application

# remove stuck application sets
for kind in applications applicationsets; do
  for name in $(kubectl -n argocd get $kind -o name); do
    kubectl -n argocd patch $name --type=json \
      -p='[{"op":"remove","path":"/metadata/finalizers"}]' 2>/dev/null
  done
done

terraform destroy

JupyterLab example with GPU

https://medium.com/@sinan.ozel_23433/iac-for-generative-ai-llm-jupyterlab-on-kubernetes-a33d31841a27 https://www.jimangel.io/posts/nvidia-rtx-gpu-kubernetes-setup/

1. Build & push image — in-cluster with BuildKit rootless

Build runs as a Job in buildkit namespace and pushes image, layer cache to jupyterlab-llm-cache repo. Update branch/tag inside build-job.yaml if needed.

Edit the tag in files (keep them in sync), then build + push + sync:

mlops/jupyterlab-llm/build-job.yaml — --output=...:<NEW_TAG>
argocd/manifests/jupyterlab-llm/jupyterlab-llm-pod.yaml — image: ...:<NEW_TAG>

k replace --force -f mlops/jupyterlab-llm/build-job.yaml

2. Deploy via Argo manual sync

Application is not auto-synced — image must exist in ECR before first sync. Trigger sync manually:

argocd app sync jupyterlab-llm

Piraeus Operator tests for Linstor

Sandbox for a Piraeus / LINSTOR / DRBD persistent-storage stack Settings at argocd/helm-values/linstor-cluster/values.yaml

Three placement modes, one StorageClass per replica count:

Manifest	StorageClass	Placement	What it proves
mlops/hdd1-test-sts.yaml	`linstor-hdd-1r` (`autoPlace=1`)	1 diskful replica on a storage node	Provisioning + ext4 + Retain reclaim works; PV survives Pod recreate on the same node
mlops/hdd2-test-sts.yaml	`linstor-hdd-2r` (`autoPlace=2`)	2 diskful replicas across storage nodes	Synchronous DRBD replication; Pod can come back on either replica node
mlops/diskless-test-sts.yaml	`linstor-hdd-2r`	2 diskful on storage NG + 1 diskless DRBD client on karpenter `ubuntu` node	Compute / storage separation pattern — the bare-metal target shape where GPU nodes mount data over the network from CPU storage nodes

Quick check

# Satellites + storage pools
k -n piraeus-datastore exec deploy/linstor-controller -- linstor node list
k -n piraeus-datastore exec deploy/linstor-controller -- linstor storage-pool list

# Apply any of the test STS and watch the resource list
k apply -f mlops/hdd2-test-sts.yaml
k -n piraeus-datastore exec deploy/linstor-controller -- linstor resource list

# Live DRBD state on a specific satellite
k -n piraeus-datastore get pod -l app.kubernetes.io/component=linstor-satellite -o wide
k -n piraeus-datastore exec <satellite-pod> -- drbdadm status

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
argocd		argocd
mlops		mlops
terraform		terraform
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Infra Components

Deployment

1. Terraform

2. Wait for ArgoCD to sync core platform

3. Map the NLB IP into `/etc/hosts`

4. Retrieve ArgoCD admin password

Delete infrastructure

JupyterLab example with GPU

1. Build & push image — in-cluster with BuildKit rootless

2. Deploy via Argo manual sync

Piraeus Operator tests for Linstor

Quick check

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Infra Components

Deployment

1. Terraform

2. Wait for ArgoCD to sync core platform

3. Map the NLB IP into /etc/hosts

4. Retrieve ArgoCD admin password

Delete infrastructure

JupyterLab example with GPU

1. Build & push image — in-cluster with BuildKit rootless

2. Deploy via Argo manual sync

Piraeus Operator tests for Linstor

Quick check

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3. Map the NLB IP into `/etc/hosts`

Packages