Add calibration-free engine_kv module for inference-engine integration (C15a) by jagmarques · Pull Request #32 · jagmarques/nexusquant

jagmarques · 2026-06-15T17:51:59Z

Adds nexusquant/integrations/engine_kv.py: a clean KV-quant/dequant API for vLLM-style engines, plus a CPU unit test and GPU bench scaffold.

What it does

quantize_kv(k, v, k_bits=3, v_bits=2) returns a PackedKV (int8 codes + fp16 per-head scales). Hadamard rotation first, then E8 nearest-point, then pack as (lattice_point * 2).int8. No calibration, no fit step, no offline data.
dequantize_kv(packed) reconstructs fp16 K/V via (code * 0.5) * scale — scalar multiply only, no codebook lookup.
bpe_accounting(packed) reports logical bpe (K3V2 = 2.625 bpe, 6.1x vs fp16) and int8 storage bpe (8.125 bpe, ~2x).

CPU unit test (experiments/kaggle/nq_engine_kv/test_engine_kv.py, exit 0 verified):

(a) K3 round-trip relative error 0.1918 (bounds [0.01, 0.35]); V2 error 0.3864 (bounds [0.01, 0.60]).
(b) bpe_logical = 2.6250 (expected 2.6250, tolerance 0.05).
(c) 87.2% of k_codes differ from trivial round(k_fp32) — non-vacuous.

GPU bench scaffold queued in experiments/kaggle/nq_engine_kv/ (kernel-metadata.json pinned to same T4 docker SHA). Not pushed to Kaggle (quota exhausted).

Proof of work:

python3.11 experiments/kaggle/nq_engine_kv/test_engine_kv.py
[PASS] (a) K3 rel error: 0.1918 (expected ~0.19, bounds [0.01, 0.35])
[PASS] (a) V2 rel error: 0.3864 (expected ~0.39, bounds [0.01, 0.6])
[PASS] (b) bpe_logical: 2.6250 (expected 2.6250)
       ratio_logical: 6.10x  bpe_int8: 8.1250
[PASS] (c) non-vacuous: 87.2% codes differ from trivial identity
       max element diff after round-trip: 0.9512 (>1e-3 required)
All 3 assertions passed. exit 0.
EXIT_CODE: 0

secret-scan: SCANNER: gitleaks - clean (exit 0)
diff stat: 4 files changed, 526 insertions(+)

- nexusquant/integrations/engine_kv.py: quantize_kv/dequantize_kv/bpe_accounting API for inference-engine (vLLM-style) KV dtype integration. E8 lattice VQ with Hadamard rotation, per-head fp16 abs-max scale, integer-scale scalar dequant (no codebook lookup, no calibration step). - experiments/kaggle/nq_engine_kv/test_engine_kv.py: CPU unit test, exits 0. Asserts round-trip error in expected range for K3V2, bpe_accounting returns 2.625 bpe, and quantization is non-vacuous (87% codes differ from identity). - experiments/kaggle/nq_engine_kv/nq_engine_kv_bench.py + kernel-metadata.json: GPU bench scaffold for T4 quant/dequant latency sweep (push deferred, quota exhausted).

Bug 1 (hard): test_engine_kv.py had four "../" in the _ROOT path, landing one directory above the repo root. Reduces to three "../" so _ROOT resolves to the repo root; `python3.11 experiments/kaggle/nq_engine_kv/test_engine_kv.py` now exits 0 with no manual PYTHONPATH. Bug 2 (soft): nq_engine_kv_bench.py used sys.path.insert(0, "/kaggle/working") which fails on Kaggle because nothing installs the package there. Replaced with a subprocess pip-install of git+https://github.com/jagmarques/nexusquant.git@main, matching the pattern used by other Kaggle kernels in this repo.

jagmarques added 2 commits June 15, 2026 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add calibration-free engine_kv module for inference-engine integration (C15a)#32

Add calibration-free engine_kv module for inference-engine integration (C15a)#32
jagmarques wants to merge 2 commits into
mainfrom
company/nexusquant-kv-engine-module

jagmarques commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jagmarques commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant