KRATOS

Framework for Application-Aware GPU Scheduling in Kubernetes

KRATOS is a Kubernetes operator for studying application-aware GPU scheduling of CUDA workloads on heterogeneous clusters.

The framework does not replace Kubernetes or Volcano. It adds an intermediate decision layer that learns from previous executions, scores eligible nodes, and generates scheduling hints.

The current design goal is to let users describe CUDA workloads together with their scheduling requirements, such as GPU memory, compute capability, priority, replica count, and distributed constraints.

After an initial execution, the controller is expected to collect profiling information from nsight-compute (e.g. if a kernel is compute-bound or memory-bound) and reuse that profile to score nodes for later runs, in order to make the scheduling policy application-aware.

Status

Planned integrations include:

Kubernetes for resource lifecycle management.
NVIDIA Nsight Compute and DCGM for CUDA profiling and GPU metrics.
Prometheus and Grafana for runtime observability.

Architecture

Getting Started

Clone the repository and run the local test suite:

git clone git@github.com:chirichexe/kratos.git
cd kratos
make test

The make test target generates Kubernetes manifests, regenerates deepcopy code, runs formatting and vet checks, downloads envtest binaries, and then runs the Go tests.

Install the CRD into the Kubernetes cluster selected by your current kubectl context:

make install

Run the controller locally against that cluster:

make run

In another terminal, create a sample CUDA workload:

kubectl apply -f config/samples/gpu_v1alpha1_cudaexperiment.yaml
kubectl get cudaexperiments.gpu.scheduler.io
kubectl get jobs,pods -l gpu.scheduler.io/experiment=cuda-vector-add
kubectl logs job/cuda-vector-add-execution

For a local GPU-enabled Kubernetes lab, see Local GPU Lab.

CUDAExperiment

Users describe a CUDA workload with one Kubernetes custom resource:

apiVersion: gpu.scheduler.io/v1alpha1
kind: CUDAExperiment
metadata:
  name: cuda-vector-add
spec:
  image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
  runtimeClassName: nvidia
  replicas: 1
  gpuRequired: 1
  minimumComputeCapability: "7.0"
  minimumMemory: 4Gi
  priority: normal
  profilingEnabled: true

The current controller creates one Kubernetes Job named <experiment-name>-execution, sets the NVIDIA GPU limit from gpuRequired, uses the configured runtime class, and records the Job name in status.

When profilingEnabled: true, the controller creates a profiled Job:

stage-workload is an initContainer that uses the experiment image and stages the CUDA executable into a shared volume.
profiling-runner is the controller-owned Nsight Compute container. It uses the Nsight Compute image, requests the GPU, launches the staged workload once under ncu, imports the generated .ncu-rep, and prints raw metrics in its logs.

The default profiling runner image is kratos-nsight-compute-poc:latest. Set KRATOS_NSIGHT_COMPUTE_IMAGE on the controller manager to use a registry image. For custom workload images, set spec.command[0] to the executable path inside the image. If command is omitted, the controller uses the NVIDIA sample path /cuda-samples/vectorAdd.

The longer-term operator roadmap is profile lookup, cluster scoring, node-selection hints, Volcano submission, and profile updates after execution.

Development

Run the package tests:

make test

Regenerate Kubernetes assets after API or RBAC changes:

make manifests
make generate

Documentation

License

KRATOS is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.devcontainer		.devcontainer
.github		.github
api/v1alpha1		api/v1alpha1
cluster		cluster
cmd		cmd
config		config
docs		docs
hack		hack
internal/controller		internal/controller
test		test
.custom-gcl.yml		.custom-gcl.yml
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KRATOS

Status

Architecture

Getting Started

CUDAExperiment

Development

Documentation

License

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KRATOS

Status

Architecture

Getting Started

CUDAExperiment

Development

Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages