GitHub - Luce-Org/lucebox-hub: Fast LLM speculative inference server for consumer hardware.

Local LLM inference server built for speed. Custom kernels, speculative prefill & decoding.
Each optimization in our engine is for specific model family and hardware target.

Inference Engine Optimizations

Each one is self-contained with setup instructions and benchmark notes.

Supported Models & Drafters

All speedups measured vs vendored llama.cpp (-fa 1, matching KV quant). Combined = geometric mean √(TTFT × decode) where both phases benched; otherwise the single-phase speedup. Drafters published on huggingface.co/Lucebox.

Model	Speedup
Qwen 3.5-0.8B (Megakernel)	~2×
Qwen 3.5-27B + DDTree	3.43×
Qwen 3.6-27B + PFlash	~5.6×
Qwen 3.6-27B + DDTree	4.84×
Laguna-XS.2 33B + PFlash	5.4× @128K
Qwen 3.5-27B HIP	~2.6×
Gemma-4-26B-A4B	1.31×

Drafter	Phase
`Qwen3.6-27B`	decode
`gemma-4-26B-A4B`	decode
`gemma-4-31B`	decode
`Qwen3-0.6B`	prefill

Tested Machines (GPU/APU)

Reference target: RTX 3090 (Ampere sm_86) — all headline numbers. Other NVIDIA archs auto-detected by CMake / setup.py; AMD HIP backend separate (Strix Halo section).

	Arch	GPU	Min CUDA / ROCm	Status	Bench
	Ampere `sm_86`	RTX 3090, A-series	CUDA 12.0	✅ reference	megakernel · dflash
	Blackwell `sm_120`	RTX 5090	CUDA 12.8	✅ 205 tok/s, 4.84×	↗
	Blackwell `sm_121`	DGX Spark / GB10	CUDA 12.9	✅ megakernel NVFP4	↗
	Turing `sm_75`	RTX 2080 Ti	CUDA 12.0	✅ 53 tok/s DFlash	↗
	Ada `sm_89`	RTX 40xx	CUDA 12.0	🟡 community WSL2 bench	↗
—	Blackwell `sm_110`	Jetson AGX Thor	CUDA 13.0	🟡 builds, unbenched	—
	Volta `sm_70` / Pascal `sm_61`	V100, P40	CUDA 12.0	🟡 fallback paths, unbenched	—
	RDNA3.5 `gfx1151`	Ryzen AI MAX+ 395 / Strix Halo	ROCm 6+	✅ 37 tok/s HIP	↗
	RDNA3 `gfx1100`	Radeon RX 7900 XTX	ROCm 6+	✅ 50 tok/s HIP	↗

server/ (DFlash) builds with CMake 3.18+ and --recurse-submodules for Luce-Org/llama.cpp@luce-dflash — no PyTorch needed. optimizations/megakernel/ is the only component requiring PyTorch 2.0+ (CUDAExtension links against torch C++ libs). Power-tune: sudo nvidia-smi -pl 220 (3090 sweet spot, re-sweep for other cards).

Quick Start On Harnesses

harness/ contains RTX 3090 client launchers and regression tests for Lucebox server compatibility. Run Lucebox inside Claude Code, Codex, OpenCode, Hermes, Pi, OpenClaw, or Open WebUI, or check if a server change still works with those clients.

Client	Launcher
Claude Code	`run_claude_code.sh`
Codex	`run_codex.sh`
OpenCode	`run_opencode.sh`
Hermes	`run_hermes.sh`
Pi	`run_pi.sh`
OpenClaw	`run_openclaw.sh`
Open WebUI	`run_openwebui.sh`

All launchers spawn the native C++ HTTP server (dflash_server). Override defaults via env vars:

DFLASH_SERVER_BIN=server/build/dflash_server \
MAX_CTX=32768 BUDGET=22 VERIFY_MODE=ddtree \
harness/clients/run_codex.sh

Run the Server

Default: Qwen 3.6-27B Q4_K_M target + Lucebox Q8_0 DFlash drafter on RTX 3090. DDTree budget=22, TQ3_0 KV cache, sliding FA window 2048. OpenAI-compatible HTTP on :8000.

# build (CUDA 12+, CMake 3.18+)
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub
cmake -B server/build -S server -DCMAKE_BUILD_TYPE=Release
cmake --build server/build --target dflash_server -j

# default weights (~18 GB)
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir server/models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir server/models/draft/

# run (TQ3_0 KV auto-enabled; set =0 to disable)
DFLASH27B_KV_TQ3=1 \
./server/build/dflash_server server/models/Qwen3.6-27B-Q4_K_M.gguf \
  --draft server/models/draft/dflash-draft-3.6-q8_0.gguf \
  --ddtree --ddtree-budget 22 --fa-window 2048 --port 8000

Server flags

Core

Flag	Default	Effect
`--draft <path>`	—	DFlash draft GGUF, required for speculative decode
`--port N`	`8000`	HTTP port
`--host H`	`127.0.0.1`	Bind address
`--max-ctx N`	auto-fit	KV cache size; oversizing slows prefill (FA stride over unused KV)
`--max-tokens N`	model-card	Generation cap
`--model-name S`	filename	OpenAI `model` field
`--chat-template-file <path>`	autodetect	Override Jinja template

Decode (DFlash + DDTree)

Flag	Default	Effect
`--ddtree`	off (chain)	Enable tree verify
`--ddtree-budget N`	`22`	Tree size. 22 on 3090 (default), 40 on 5090, re-sweep on GB10
`--fa-window N`	`2048`	Sliding FA window; `0` = full attention
`--lazy-draft`	off	Defer draft load until first request

Prefill compression (PFlash)

Flag / env	Default	Effect
`--prefill-compression {off,auto,always}`	`off`	When to score+compress the prompt
`--prefill-threshold N`	`32000`	Token threshold for `auto`
`--prefill-keep-ratio F`	`0.05`	Fraction of source tokens kept (0.02 @128K, 0.10 @32K)
`--prefill-drafter <gguf>`	required if on	Drafter weights (Qwen3-0.6B BF16 GGUF)
`--prefill-skip-park`	off	Keep drafter resident across requests (more VRAM, faster)
`DFLASH_FP_USE_BSA=1`	`0`	Dispatch sparse FA through BSA (sm_80+); required for headline 10.4×
`DFLASH_FP_ALPHA=0.85`	`0.12`	Block-selection threshold; higher = stricter = fewer K-blocks
`DFLASH_FP_PROFILE=1`	`0`	Per-stage timing log

KV cache

Flag / env	Default	Effect
`--cache-type-k <t>` / `--cache-type-v <t>`	env-driven	Per-side quant override: `f16,bf16,q4_0,q4_1,q5_0,q5_1,q8_0,tq3_0`
`DFLASH27B_KV_TQ3=1`	(default)	Preset TQ3_0 K+V (3.5 bpv, fits 256K @ 24 GB)
`DFLASH27B_KV_Q4=1`	off	Q4_0 K+V (4.5 bpv, legacy, ~128K ceiling)
`--prefix-cache-slots N`	—	Live prefix-cache slot count
`--kv-cache-dir <path>`	—	Persist prefix cache to disk
`--kv-cache-budget N`	—	On-disk cache size cap

Thinking budget

Flag	Default	Effect
`--think-max-tokens N`	model-card	Max tokens inside `<think>…</think>`
`--default-max-tokens N`	model-card	Default response cap
`--hard-limit-reply-budget N`	`4096`	Hard ceiling; injects `</think>` close near limit
`--reasoning-effort-{low,medium,high,x-high,max} N`	model-card	OpenAI-style effort tiers

Multi-GPU / IPC

Flag	Default	Effect
`--target-device <dev>`	`cuda:0`	Target backend (e.g. `cuda:0`, `hip:0`)
`--draft-device <dev>`	same as target	Draft backend; mixed backend needs `--draft-ipc-bin`
`--target-devices <list>` / `--target-layer-split`	single GPU	Layer-split target across GPUs
`--draft-ipc-bin <path>`	—	Out-of-process draft binary (mixed CUDA/HIP)
`--peer-access`	off	Enable P2P between target GPUs
`--chunk N`	backend default	Prefill ubatch size
`--no-cors`	CORS on	Disable CORS headers

DFlash benchmarks → · DFlash blog → · PFlash benchmarks → · PFlash blog → · Per-machine quick starts (DGX Spark, Jetson Thor, HIP) →

Run Megakernel Bench (Qwen 3.5-0.8B)

Separate Python bench; 24 layers fused into one persistent CUDA dispatch. 413 tok/s decode, 21,347 prefill, 1.87 tok/J @220W vs llama.cpp BF16.

uv sync --extra megakernel
uv run --directory megakernel python final_bench.py

Method	Prefill pp520	Decode tg128	tok/J
Megakernel `@220W`	21,347	413	1.87
llama.cpp BF16 `@350W`	11,247	267	0.76
PyTorch HF	7,578	108	n/a

Setup → · Bench → · Blog →

Blackwell (RTX 5090, DGX Spark / GB10): auto-detected by setup; NVFP4 decode path lands ~194 tok/s on GB10. See optimizations/megakernel/README.md#blackwell-sm_120--sm_121a.

Why this exists

Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to extract real throughput from those chips doesn't.

General-purpose frameworks dominated the last decade because hand-tuning kernels per chip was too expensive to justify. One stack, decent on everything, great on nothing. Speculative decoding, speculative prefill, fused megakernels are the methods that turn idle silicon into 3-10× speedups, but they stay locked to BF16 weights on data-center GPUs.

AI-assisted development flips that calculus. Rewrites that took a quarter now fit in a release cycle. Lucebox ports those speculative methods down to quantized GGUF on consumer cards, one chip and one model family at a time. Apache 2.0 source, full writeup, reproducible benchmarks.

Request for Contributions

  ▮▮▮▮▮▮▮▮▮▮    HIP/CUDA kernel optimizations
  ▮▮▮▮▮▮▮▮▮▯    Speculative inference optimizations
  ▮▮▮▮▮▮▮▯▯▯    Support to new GPU/APU consumer cards
  ▮▮▮▮▮▮▮▯▯▯    Inference engine debugging
  ▮▮▮▮▮▮▯▯▯▯    Add new performance benchmarks
  ▮▮▮▮▮▯▯▯▯▯    Improvements for harnesses integration

Citation

@software{lucebox_2026,
  title  = {Fast LLM speculative inference server for specific consumer hardware.},
  author = {Lucebox},
  url    = {https://github.com/Luce-Org/lucebox-hub},
  year   = {2026}
}

Community

Discord: discord.gg/yHfswqZmJQ
Website: lucebox.com
Issues: github.com/Luce-Org/lucebox-hub/issues
Blog: lucebox.com/blog

_{Apache 2.0 · Lucebox.com}

Name		Name	Last commit message	Last commit date
Latest commit History 805 Commits
.github/workflows		.github/workflows
assets		assets
docs/specs		docs/specs
harness		harness
optimizations		optimizations
scripts		scripts
server		server
share/model_cards		share/model_cards
thoughts		thoughts
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference Engine Optimizations

Supported Models & Drafters

Tested Machines (GPU/APU)

Quick Start On Harnesses

Run the Server

Server flags

Run Megakernel Bench (Qwen 3.5-0.8B)

Why this exists

Request for Contributions

Citation

Community

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Inference Engine Optimizations

Supported Models & Drafters

Tested Machines (GPU/APU)

Quick Start On Harnesses

Run the Server

Server flags

Run Megakernel Bench (Qwen 3.5-0.8B)

Why this exists

Request for Contributions

Citation

Community

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages