Skip to content

Luce-Org/lucebox-hub

Repository files navigation

Lucebox

lucebox.com HuggingFace Discord Blog

Apache 2.0 CUDA 12+ HIP 7+ C++17

Local LLM inference server built for speed. Custom kernels, speculative prefill & decoding.
Each optimization in our engine is for specific model family and hardware target.


Inference Engine Optimizations

Each one is self-contained with setup instructions and benchmark notes.

Megakernel    DFlash 27B

PFlash speculative prefill


Supported Models & Drafters

All speedups measured vs vendored llama.cpp (-fa 1, matching KV quant). Combined = geometric mean √(TTFT × decode) where both phases benched; otherwise the single-phase speedup. Drafters published on huggingface.co/Lucebox.

Model Speedup
Qwen 3.5-0.8B (Megakernel) ~2×
Qwen 3.5-27B + DDTree 3.43×
Qwen 3.6-27B + PFlash ~5.6×
Qwen 3.6-27B + DDTree 4.84×
Laguna-XS.2 33B + PFlash 5.4× @128K
Qwen 3.5-27B HIP ~2.6×
Gemma-4-26B-A4B 1.31×
Drafter Phase
Qwen3.6-27B decode
gemma-4-26B-A4B decode
gemma-4-31B decode
Qwen3-0.6B prefill

Tested Machines (GPU/APU)

Reference target: RTX 3090 (Ampere sm_86) — all headline numbers. Other NVIDIA archs auto-detected by CMake / setup.py; AMD HIP backend separate (Strix Halo section).

Arch GPU Min CUDA / ROCm Status Bench
Ampere sm_86 RTX 3090, A-series CUDA 12.0 ✅ reference megakernel · dflash
Blackwell sm_120 RTX 5090 CUDA 12.8 ✅ 205 tok/s, 4.84×
Blackwell sm_121 DGX Spark / GB10 CUDA 12.9 ✅ megakernel NVFP4
Turing sm_75 RTX 2080 Ti CUDA 12.0 ✅ 53 tok/s DFlash
Ada sm_89 RTX 40xx CUDA 12.0 🟡 community WSL2 bench
Blackwell sm_110 Jetson AGX Thor CUDA 13.0 🟡 builds, unbenched
Volta sm_70 / Pascal sm_61 V100, P40 CUDA 12.0 🟡 fallback paths, unbenched
RDNA3.5 gfx1151 Ryzen AI MAX+ 395 / Strix Halo ROCm 6+ ✅ 37 tok/s HIP
RDNA3 gfx1100 Radeon RX 7900 XTX ROCm 6+ ✅ 50 tok/s HIP

server/ (DFlash) builds with CMake 3.18+ and --recurse-submodules for Luce-Org/llama.cpp@luce-dflash — no PyTorch needed. optimizations/megakernel/ is the only component requiring PyTorch 2.0+ (CUDAExtension links against torch C++ libs). Power-tune: sudo nvidia-smi -pl 220 (3090 sweet spot, re-sweep for other cards).

Quick Start On Harnesses

harness/ contains RTX 3090 client launchers and regression tests for Lucebox server compatibility. Run Lucebox inside Claude Code, Codex, OpenCode, Hermes, Pi, OpenClaw, or Open WebUI, or check if a server change still works with those clients.

Lucebox client harness experiments on RTX 3090

Client Launcher
Claude Code run_claude_code.sh
Codex run_codex.sh
OpenCode run_opencode.sh
Hermes run_hermes.sh
Pi run_pi.sh
OpenClaw run_openclaw.sh
Open WebUI run_openwebui.sh

All launchers spawn the native C++ HTTP server (dflash_server). Override defaults via env vars:

DFLASH_SERVER_BIN=server/build/dflash_server \
MAX_CTX=32768 BUDGET=22 VERIFY_MODE=ddtree \
harness/clients/run_codex.sh

Run the Server

Default: Qwen 3.6-27B Q4_K_M target + Lucebox Q8_0 DFlash drafter on RTX 3090. DDTree budget=22, TQ3_0 KV cache, sliding FA window 2048. OpenAI-compatible HTTP on :8000.

# build (CUDA 12+, CMake 3.18+)
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub
cmake -B server/build -S server -DCMAKE_BUILD_TYPE=Release
cmake --build server/build --target dflash_server -j

# default weights (~18 GB)
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir server/models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir server/models/draft/

# run (TQ3_0 KV auto-enabled; set =0 to disable)
DFLASH27B_KV_TQ3=1 \
./server/build/dflash_server server/models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft server/models/draft/dflash-draft-3.6-q8_0.gguf \
  --ddtree --ddtree-budget 22 --fa-window 2048 --port 8000

Server flags

Core

Flag Default Effect
--draft <path> DFlash draft GGUF, required for speculative decode
--port N 8000 HTTP port
--host H 127.0.0.1 Bind address
--max-ctx N auto-fit KV cache size; oversizing slows prefill (FA stride over unused KV)
--max-tokens N model-card Generation cap
--model-name S filename OpenAI model field
--chat-template-file <path> autodetect Override Jinja template

Decode (DFlash + DDTree)

Flag Default Effect
--ddtree off (chain) Enable tree verify
--ddtree-budget N 22 Tree size. 22 on 3090 (default), 40 on 5090, re-sweep on GB10
--fa-window N 2048 Sliding FA window; 0 = full attention
--lazy-draft off Defer draft load until first request

Prefill compression (PFlash)

Flag / env Default Effect
--prefill-compression {off,auto,always} off When to score+compress the prompt
--prefill-threshold N 32000 Token threshold for auto
--prefill-keep-ratio F 0.05 Fraction of source tokens kept (0.02 @128K, 0.10 @32K)
--prefill-drafter <gguf> required if on Drafter weights (Qwen3-0.6B BF16 GGUF)
--prefill-skip-park off Keep drafter resident across requests (more VRAM, faster)
DFLASH_FP_USE_BSA=1 0 Dispatch sparse FA through BSA (sm_80+); required for headline 10.4×
DFLASH_FP_ALPHA=0.85 0.12 Block-selection threshold; higher = stricter = fewer K-blocks
DFLASH_FP_PROFILE=1 0 Per-stage timing log

KV cache

Flag / env Default Effect
--cache-type-k <t> / --cache-type-v <t> env-driven Per-side quant override: f16,bf16,q4_0,q4_1,q5_0,q5_1,q8_0,tq3_0
DFLASH27B_KV_TQ3=1 (default) Preset TQ3_0 K+V (3.5 bpv, fits 256K @ 24 GB)
DFLASH27B_KV_Q4=1 off Q4_0 K+V (4.5 bpv, legacy, ~128K ceiling)
--prefix-cache-slots N Live prefix-cache slot count
--kv-cache-dir <path> Persist prefix cache to disk
--kv-cache-budget N On-disk cache size cap

Thinking budget

Flag Default Effect
--think-max-tokens N model-card Max tokens inside <think>…</think>
--default-max-tokens N model-card Default response cap
--hard-limit-reply-budget N 4096 Hard ceiling; injects </think> close near limit
--reasoning-effort-{low,medium,high,x-high,max} N model-card OpenAI-style effort tiers

Multi-GPU / IPC

Flag Default Effect
--target-device <dev> cuda:0 Target backend (e.g. cuda:0, hip:0)
--draft-device <dev> same as target Draft backend; mixed backend needs --draft-ipc-bin
--target-devices <list> / --target-layer-split single GPU Layer-split target across GPUs
--draft-ipc-bin <path> Out-of-process draft binary (mixed CUDA/HIP)
--peer-access off Enable P2P between target GPUs
--chunk N backend default Prefill ubatch size
--no-cors CORS on Disable CORS headers

DFlash benchmarks → · DFlash blog → · PFlash benchmarks → · PFlash blog → · Per-machine quick starts (DGX Spark, Jetson Thor, HIP) →


Run Megakernel Bench (Qwen 3.5-0.8B)

Separate Python bench; 24 layers fused into one persistent CUDA dispatch. 413 tok/s decode, 21,347 prefill, 1.87 tok/J @220W vs llama.cpp BF16.

uv sync --extra megakernel
uv run --directory megakernel python final_bench.py
Method Prefill pp520 Decode tg128 tok/J
Megakernel @220W 21,347 413 1.87
llama.cpp BF16 @350W 11,247 267 0.76
PyTorch HF 7,578 108 n/a

Setup → · Bench → · Blog →

Blackwell (RTX 5090, DGX Spark / GB10): auto-detected by setup; NVFP4 decode path lands ~194 tok/s on GB10. See optimizations/megakernel/README.md#blackwell-sm_120--sm_121a.


Why this exists

Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to extract real throughput from those chips doesn't.

General-purpose frameworks dominated the last decade because hand-tuning kernels per chip was too expensive to justify. One stack, decent on everything, great on nothing. Speculative decoding, speculative prefill, fused megakernels are the methods that turn idle silicon into 3-10× speedups, but they stay locked to BF16 weights on data-center GPUs.

AI-assisted development flips that calculus. Rewrites that took a quarter now fit in a release cycle. Lucebox ports those speculative methods down to quantized GGUF on consumer cards, one chip and one model family at a time. Apache 2.0 source, full writeup, reproducible benchmarks.

Lucebox local AI PC


Request for Contributions

  ▮▮▮▮▮▮▮▮▮▮    HIP/CUDA kernel optimizations
  ▮▮▮▮▮▮▮▮▮▯    Speculative inference optimizations
  ▮▮▮▮▮▮▮▯▯▯    Support to new GPU/APU consumer cards
  ▮▮▮▮▮▮▮▯▯▯    Inference engine debugging
  ▮▮▮▮▮▮▯▯▯▯    Add new performance benchmarks
  ▮▮▮▮▮▯▯▯▯▯    Improvements for harnesses integration

Citation

@software{lucebox_2026,
  title  = {Fast LLM speculative inference server for specific consumer hardware.},
  author = {Lucebox},
  url    = {https://github.com/Luce-Org/lucebox-hub},
  year   = {2026}
}

Community


Apache 2.0 · Lucebox.com