HelixServe

HelixServe is a mini LLM serving engine focused on runtime internals

Language Split

HelixServe uses a hybrid implementation:

Python for server, scheduler, allocator, orchestration, and benchmarking
Triton for custom GPU kernel work (kernels/rmsnorm_triton.py)
CUDA C++ for a focused low-level op (cuda_ext/csrc/cu_seqlens*)

Scope (v1)

This project implements the six locked features:

Decoder-only model backend on one GPU (ToyDecoderBackend by default, optional HF backend).
Paged KV-cache allocator with fixed-size blocks.
Continuous batching scheduler.
Split prefill/decode with chunked prefill.
CUDA Graph replay on steady decode path (Toy backend on CUDA).
One custom Triton kernel (kernels/rmsnorm_triton.py).

Repository Layout

server/ HTTP API and streaming.
engine/ runtime config, scheduler, request lifecycle, CUDA graph helper.
cache/ paged allocator and prefix cache.
model/ model backends and tokenizer.
kernels/ Triton kernel and kernel benchmark.
cuda_ext/ optional CUDA C++ extension for decode-time cu_seqlens building.
bench/ load generation and benchmark runner.
metrics/ Prometheus metrics registry.
deploy/ Dockerfile and GCP deployment scripts.
profiling/ Nsight helper scripts.
docs/ architecture and execution plan.
tests/ unit and async integration tests.

Quickstart

python3 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Optional CUDA C++ extension build (Linux + CUDA toolkit):

HELIX_BUILD_CUDA_EXT=1 pip install -e .[dev]

Run server:

HELIX_USE_TOY_BACKEND=1 HELIX_DEVICE=cuda python -m uvicorn server.main:app --host 0.0.0.0 --port 8000

Send request:

curl -s http://127.0.0.1:8000/v1/completions \
  -H "content-type: application/json" \
  -d '{"prompt":"Explain paged KV-cache", "max_tokens":32}' | jq

Streaming request:

curl -N http://127.0.0.1:8000/v1/completions \
  -H "content-type: application/json" \
  -d '{"prompt":"Explain chunked prefill", "max_tokens":32, "stream":true}'

Benchmark

python -m bench.run_benchmark --url http://127.0.0.1:8000 --requests 200 --concurrency 16 --mode mixed --max-tokens 64 --stream

Suggested workloads:

--mode short
--mode long
--mode mixed
--mode repeated_prefix

Run the full live suite (short/long/mixed/repeated-prefix/burst) and save artifacts:

python -m bench.run_live_suite --url http://127.0.0.1:8000 --stream

Demo Video And Screenshots

Generated demo artifacts:

LinkedIn final cut (with voiceover): docs/assets/demo/final/helixserve_linkedin_final.mp4
Final cut timeline: docs/assets/demo/final/timeline.json
Final cut voiceover script: docs/assets/demo/final/voiceover_script.txt
Voiceover validation: docs/assets/demo/final/voiceover_validation.json
Video verification: docs/assets/demo/final/video_verification.json
MP4 demo video: docs/assets/demo/helixserve_demo.mp4
GIF preview: docs/assets/demo/helixserve_demo.gif
Asset manifest: docs/assets/demo/manifest.json

Screenshots used in README/report:

Regenerate all demo assets from a live endpoint:

.venv/bin/python scripts/generate_demo_assets.py --url http://34.136.218.176:8000

Build the final LinkedIn-ready narrated cut:

.venv/bin/python scripts/build_final_demo_video.py

Validate narration style (no first-person singular words):

.venv/bin/python scripts/check_product_voiceover.py --path docs/assets/demo/final/voiceover_script.txt

Validate final video technical specs:

.venv/bin/python scripts/verify_final_video.py --video docs/assets/demo/final/helixserve_linkedin_final.mp4

Triton Kernel Benchmark

python -m kernels.benchmark_rmsnorm --rows 4096 --cols 4096

Profiling

Nsight Systems decode capture:

bash profiling/nsys_decode_capture.sh helixserve_decode

Nsight Compute kernel capture:

bash profiling/ncu_rmsnorm.sh helixserve_rmsnorm

Nsight Compute CUDA C++ extension capture:

bash profiling/ncu_cu_seqlens.sh helixserve_cu_seqlens

Metrics

Prometheus endpoint: GET /metrics
Engine stats endpoint: GET /stats

Key runtime metrics:

TTFT, ITL, E2E latency histograms
Throughput counters
KV utilization and fragmentation
Queue depth, active decode batch size, and active decode batched tokens
Prefix cache hit rate

GCP Deployment

Create G2/L4 VM (Deep Learning VM image family):

PROJECT_ID=<your-project> ZONE=us-central1-c bash deploy/gcp_create_g2_l4.sh

Deploy container to VM:

PROJECT_ID=<your-project> INSTANCE_NAME=<vm-name> bash deploy/deploy_to_gcp_vm.sh

Configure VM idle auto-shutdown (default: 30 minutes idle):

PROJECT_ID=<your-project> INSTANCE_NAME=<vm-name> bash deploy/setup_idle_shutdown_on_vm.sh

Disable auto-shutdown temporarily on the VM:

sudo touch /var/lib/helixserve/disable_idle_shutdown

Re-enable auto-shutdown:

sudo rm -f /var/lib/helixserve/disable_idle_shutdown

Tests

pytest -q

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HelixServe

Language Split

Scope (v1)

Repository Layout

Quickstart

Benchmark

Demo Video And Screenshots

Triton Kernel Benchmark

Profiling

Metrics

GCP Deployment

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
bench		bench
cache		cache
cuda_ext		cuda_ext
deploy		deploy
docs		docs
engine		engine
kernels		kernels
metrics		metrics
model		model
profiling		profiling
scripts		scripts
server		server
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

HelixServe

Language Split

Scope (v1)

Repository Layout

Quickstart

Benchmark

Demo Video And Screenshots

Triton Kernel Benchmark

Profiling

Metrics

GCP Deployment

Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages