HelixServe is a mini LLM serving engine focused on runtime internals
HelixServe uses a hybrid implementation:
- Python for server, scheduler, allocator, orchestration, and benchmarking
- Triton for custom GPU kernel work (
kernels/rmsnorm_triton.py) - CUDA C++ for a focused low-level op (
cuda_ext/csrc/cu_seqlens*)
This project implements the six locked features:
- Decoder-only model backend on one GPU (
ToyDecoderBackendby default, optional HF backend). - Paged KV-cache allocator with fixed-size blocks.
- Continuous batching scheduler.
- Split prefill/decode with chunked prefill.
- CUDA Graph replay on steady decode path (Toy backend on CUDA).
- One custom Triton kernel (
kernels/rmsnorm_triton.py).
server/HTTP API and streaming.engine/runtime config, scheduler, request lifecycle, CUDA graph helper.cache/paged allocator and prefix cache.model/model backends and tokenizer.kernels/Triton kernel and kernel benchmark.cuda_ext/optional CUDA C++ extension for decode-timecu_seqlensbuilding.bench/load generation and benchmark runner.metrics/Prometheus metrics registry.deploy/Dockerfile and GCP deployment scripts.profiling/Nsight helper scripts.docs/architecture and execution plan.tests/unit and async integration tests.
python3 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]Optional CUDA C++ extension build (Linux + CUDA toolkit):
HELIX_BUILD_CUDA_EXT=1 pip install -e .[dev]Run server:
HELIX_USE_TOY_BACKEND=1 HELIX_DEVICE=cuda python -m uvicorn server.main:app --host 0.0.0.0 --port 8000Send request:
curl -s http://127.0.0.1:8000/v1/completions \
-H "content-type: application/json" \
-d '{"prompt":"Explain paged KV-cache", "max_tokens":32}' | jqStreaming request:
curl -N http://127.0.0.1:8000/v1/completions \
-H "content-type: application/json" \
-d '{"prompt":"Explain chunked prefill", "max_tokens":32, "stream":true}'python -m bench.run_benchmark --url http://127.0.0.1:8000 --requests 200 --concurrency 16 --mode mixed --max-tokens 64 --streamSuggested workloads:
--mode short--mode long--mode mixed--mode repeated_prefix
Run the full live suite (short/long/mixed/repeated-prefix/burst) and save artifacts:
python -m bench.run_live_suite --url http://127.0.0.1:8000 --streamGenerated demo artifacts:
- LinkedIn final cut (with voiceover):
docs/assets/demo/final/helixserve_linkedin_final.mp4 - Final cut timeline:
docs/assets/demo/final/timeline.json - Final cut voiceover script:
docs/assets/demo/final/voiceover_script.txt - Voiceover validation:
docs/assets/demo/final/voiceover_validation.json - Video verification:
docs/assets/demo/final/video_verification.json - MP4 demo video:
docs/assets/demo/helixserve_demo.mp4 - GIF preview:
docs/assets/demo/helixserve_demo.gif - Asset manifest:
docs/assets/demo/manifest.json
Screenshots used in README/report:
Regenerate all demo assets from a live endpoint:
.venv/bin/python scripts/generate_demo_assets.py --url http://34.136.218.176:8000Build the final LinkedIn-ready narrated cut:
.venv/bin/python scripts/build_final_demo_video.pyValidate narration style (no first-person singular words):
.venv/bin/python scripts/check_product_voiceover.py --path docs/assets/demo/final/voiceover_script.txtValidate final video technical specs:
.venv/bin/python scripts/verify_final_video.py --video docs/assets/demo/final/helixserve_linkedin_final.mp4python -m kernels.benchmark_rmsnorm --rows 4096 --cols 4096Nsight Systems decode capture:
bash profiling/nsys_decode_capture.sh helixserve_decodeNsight Compute kernel capture:
bash profiling/ncu_rmsnorm.sh helixserve_rmsnormNsight Compute CUDA C++ extension capture:
bash profiling/ncu_cu_seqlens.sh helixserve_cu_seqlens- Prometheus endpoint:
GET /metrics - Engine stats endpoint:
GET /stats
Key runtime metrics:
- TTFT, ITL, E2E latency histograms
- Throughput counters
- KV utilization and fragmentation
- Queue depth, active decode batch size, and active decode batched tokens
- Prefix cache hit rate
Create G2/L4 VM (Deep Learning VM image family):
PROJECT_ID=<your-project> ZONE=us-central1-c bash deploy/gcp_create_g2_l4.shDeploy container to VM:
PROJECT_ID=<your-project> INSTANCE_NAME=<vm-name> bash deploy/deploy_to_gcp_vm.shConfigure VM idle auto-shutdown (default: 30 minutes idle):
PROJECT_ID=<your-project> INSTANCE_NAME=<vm-name> bash deploy/setup_idle_shutdown_on_vm.shDisable auto-shutdown temporarily on the VM:
sudo touch /var/lib/helixserve/disable_idle_shutdownRe-enable auto-shutdown:
sudo rm -f /var/lib/helixserve/disable_idle_shutdownpytest -q





