Local LLM inference server built for speed. Custom kernels, speculative prefill & decoding.
Each optimization in our engine is for specific model family and hardware target.
Each one is self-contained with setup instructions and benchmark notes.
All speedups measured vs vendored llama.cpp (-fa 1, matching KV quant). Combined = geometric mean √(TTFT × decode) where both phases benched; otherwise the single-phase speedup. Drafters published on huggingface.co/Lucebox.
|
|
Reference target: RTX 3090 (Ampere sm_86) — all headline numbers. Other NVIDIA archs auto-detected by CMake / setup.py; AMD HIP backend separate (Strix Halo section).
| Arch | GPU | Min CUDA / ROCm | Status | Bench | |
|---|---|---|---|---|---|
![]() |
Ampere sm_86 |
RTX 3090, A-series | CUDA 12.0 | ✅ reference | megakernel · dflash |
![]() |
Blackwell sm_120 |
RTX 5090 | CUDA 12.8 | ✅ 205 tok/s, 4.84× | ↗ |
![]() |
Blackwell sm_121 |
DGX Spark / GB10 | CUDA 12.9 | ✅ megakernel NVFP4 | ↗ |
![]() |
Turing sm_75 |
RTX 2080 Ti | CUDA 12.0 | ✅ 53 tok/s DFlash | ↗ |
![]() |
Ada sm_89 |
RTX 40xx | CUDA 12.0 | 🟡 community WSL2 bench | ↗ |
| — | Blackwell sm_110 |
Jetson AGX Thor | CUDA 13.0 | 🟡 builds, unbenched | — |
![]() |
Volta sm_70 / Pascal sm_61 |
V100, P40 | CUDA 12.0 | 🟡 fallback paths, unbenched | — |
![]() |
RDNA3.5 gfx1151 |
Ryzen AI MAX+ 395 / Strix Halo | ROCm 6+ | ✅ 37 tok/s HIP | ↗ |
![]() |
RDNA3 gfx1100 |
Radeon RX 7900 XTX | ROCm 6+ | ✅ 50 tok/s HIP | ↗ |
server/ (DFlash) builds with CMake 3.18+ and --recurse-submodules for Luce-Org/llama.cpp@luce-dflash — no PyTorch needed. optimizations/megakernel/ is the only component requiring PyTorch 2.0+ (CUDAExtension links against torch C++ libs). Power-tune: sudo nvidia-smi -pl 220 (3090 sweet spot, re-sweep for other cards).
harness/ contains RTX 3090 client launchers and regression tests
for Lucebox server compatibility. Run Lucebox inside Claude Code, Codex,
OpenCode, Hermes, Pi, OpenClaw, or Open WebUI, or check if a server change
still works with those clients.
|
All launchers spawn the native C++ HTTP server (dflash_server). Override defaults via env vars:
DFLASH_SERVER_BIN=server/build/dflash_server \
MAX_CTX=32768 BUDGET=22 VERIFY_MODE=ddtree \
harness/clients/run_codex.shDefault: Qwen 3.6-27B Q4_K_M target + Lucebox Q8_0 DFlash drafter on RTX 3090. DDTree budget=22, TQ3_0 KV cache, sliding FA window 2048. OpenAI-compatible HTTP on :8000.
# build (CUDA 12+, CMake 3.18+)
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub
cmake -B server/build -S server -DCMAKE_BUILD_TYPE=Release
cmake --build server/build --target dflash_server -j
# default weights (~18 GB)
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir server/models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir server/models/draft/
# run (TQ3_0 KV auto-enabled; set =0 to disable)
DFLASH27B_KV_TQ3=1 \
./server/build/dflash_server server/models/Qwen3.6-27B-Q4_K_M.gguf \
--draft server/models/draft/dflash-draft-3.6-q8_0.gguf \
--ddtree --ddtree-budget 22 --fa-window 2048 --port 8000Core
| Flag | Default | Effect |
|---|---|---|
--draft <path> |
— | DFlash draft GGUF, required for speculative decode |
--port N |
8000 |
HTTP port |
--host H |
127.0.0.1 |
Bind address |
--max-ctx N |
auto-fit | KV cache size; oversizing slows prefill (FA stride over unused KV) |
--max-tokens N |
model-card | Generation cap |
--model-name S |
filename | OpenAI model field |
--chat-template-file <path> |
autodetect | Override Jinja template |
Decode (DFlash + DDTree)
| Flag | Default | Effect |
|---|---|---|
--ddtree |
off (chain) | Enable tree verify |
--ddtree-budget N |
22 |
Tree size. 22 on 3090 (default), 40 on 5090, re-sweep on GB10 |
--fa-window N |
2048 |
Sliding FA window; 0 = full attention |
--lazy-draft |
off | Defer draft load until first request |
Prefill compression (PFlash)
| Flag / env | Default | Effect |
|---|---|---|
--prefill-compression {off,auto,always} |
off |
When to score+compress the prompt |
--prefill-threshold N |
32000 |
Token threshold for auto |
--prefill-keep-ratio F |
0.05 |
Fraction of source tokens kept (0.02 @128K, 0.10 @32K) |
--prefill-drafter <gguf> |
required if on | Drafter weights (Qwen3-0.6B BF16 GGUF) |
--prefill-skip-park |
off | Keep drafter resident across requests (more VRAM, faster) |
DFLASH_FP_USE_BSA=1 |
0 |
Dispatch sparse FA through BSA (sm_80+); required for headline 10.4× |
DFLASH_FP_ALPHA=0.85 |
0.12 |
Block-selection threshold; higher = stricter = fewer K-blocks |
DFLASH_FP_PROFILE=1 |
0 |
Per-stage timing log |
KV cache
| Flag / env | Default | Effect |
|---|---|---|
--cache-type-k <t> / --cache-type-v <t> |
env-driven | Per-side quant override: f16,bf16,q4_0,q4_1,q5_0,q5_1,q8_0,tq3_0 |
DFLASH27B_KV_TQ3=1 |
(default) | Preset TQ3_0 K+V (3.5 bpv, fits 256K @ 24 GB) |
DFLASH27B_KV_Q4=1 |
off | Q4_0 K+V (4.5 bpv, legacy, ~128K ceiling) |
--prefix-cache-slots N |
— | Live prefix-cache slot count |
--kv-cache-dir <path> |
— | Persist prefix cache to disk |
--kv-cache-budget N |
— | On-disk cache size cap |
Thinking budget
| Flag | Default | Effect |
|---|---|---|
--think-max-tokens N |
model-card | Max tokens inside <think>…</think> |
--default-max-tokens N |
model-card | Default response cap |
--hard-limit-reply-budget N |
4096 |
Hard ceiling; injects </think> close near limit |
--reasoning-effort-{low,medium,high,x-high,max} N |
model-card | OpenAI-style effort tiers |
Multi-GPU / IPC
| Flag | Default | Effect |
|---|---|---|
--target-device <dev> |
cuda:0 |
Target backend (e.g. cuda:0, hip:0) |
--draft-device <dev> |
same as target | Draft backend; mixed backend needs --draft-ipc-bin |
--target-devices <list> / --target-layer-split |
single GPU | Layer-split target across GPUs |
--draft-ipc-bin <path> |
— | Out-of-process draft binary (mixed CUDA/HIP) |
--peer-access |
off | Enable P2P between target GPUs |
--chunk N |
backend default | Prefill ubatch size |
--no-cors |
CORS on | Disable CORS headers |
DFlash benchmarks → · DFlash blog → · PFlash benchmarks → · PFlash blog → · Per-machine quick starts (DGX Spark, Jetson Thor, HIP) →
Separate Python bench; 24 layers fused into one persistent CUDA dispatch. 413 tok/s decode, 21,347 prefill, 1.87 tok/J @220W vs llama.cpp BF16.
uv sync --extra megakernel
uv run --directory megakernel python final_bench.py| Method | Prefill pp520 | Decode tg128 | tok/J |
|---|---|---|---|
Megakernel @220W |
21,347 | 413 | 1.87 |
llama.cpp BF16 @350W |
11,247 | 267 | 0.76 |
| PyTorch HF | 7,578 | 108 | n/a |
Blackwell (RTX 5090, DGX Spark / GB10): auto-detected by setup; NVFP4 decode path lands ~194 tok/s on GB10. See optimizations/megakernel/README.md#blackwell-sm_120--sm_121a.
Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to extract real throughput from those chips doesn't.
General-purpose frameworks dominated the last decade because hand-tuning kernels per chip was too expensive to justify. One stack, decent on everything, great on nothing. Speculative decoding, speculative prefill, fused megakernels are the methods that turn idle silicon into 3-10× speedups, but they stay locked to BF16 weights on data-center GPUs.
AI-assisted development flips that calculus. Rewrites that took a quarter now fit in a release cycle. Lucebox ports those speculative methods down to quantized GGUF on consumer cards, one chip and one model family at a time. Apache 2.0 source, full writeup, reproducible benchmarks.
▮▮▮▮▮▮▮▮▮▮ HIP/CUDA kernel optimizations
▮▮▮▮▮▮▮▮▮▯ Speculative inference optimizations
▮▮▮▮▮▮▮▯▯▯ Support to new GPU/APU consumer cards
▮▮▮▮▮▮▮▯▯▯ Inference engine debugging
▮▮▮▮▮▮▯▯▯▯ Add new performance benchmarks
▮▮▮▮▮▯▯▯▯▯ Improvements for harnesses integration
@software{lucebox_2026,
title = {Fast LLM speculative inference server for specific consumer hardware.},
author = {Lucebox},
url = {https://github.com/Luce-Org/lucebox-hub},
year = {2026}
}- Discord: discord.gg/yHfswqZmJQ
- Website: lucebox.com
- Issues: github.com/Luce-Org/lucebox-hub/issues
- Blog: lucebox.com/blog













