Skip to content

Add calibration-free engine_kv module for inference-engine integration (C15a)#32

Draft
jagmarques wants to merge 2 commits into
mainfrom
company/nexusquant-kv-engine-module
Draft

Add calibration-free engine_kv module for inference-engine integration (C15a)#32
jagmarques wants to merge 2 commits into
mainfrom
company/nexusquant-kv-engine-module

Conversation

@jagmarques

Copy link
Copy Markdown
Owner

Adds nexusquant/integrations/engine_kv.py: a clean KV-quant/dequant API for vLLM-style engines, plus a CPU unit test and GPU bench scaffold.

What it does

  • quantize_kv(k, v, k_bits=3, v_bits=2) returns a PackedKV (int8 codes + fp16 per-head scales). Hadamard rotation first, then E8 nearest-point, then pack as (lattice_point * 2).int8. No calibration, no fit step, no offline data.
  • dequantize_kv(packed) reconstructs fp16 K/V via (code * 0.5) * scale — scalar multiply only, no codebook lookup.
  • bpe_accounting(packed) reports logical bpe (K3V2 = 2.625 bpe, 6.1x vs fp16) and int8 storage bpe (8.125 bpe, ~2x).

CPU unit test (experiments/kaggle/nq_engine_kv/test_engine_kv.py, exit 0 verified):

  • (a) K3 round-trip relative error 0.1918 (bounds [0.01, 0.35]); V2 error 0.3864 (bounds [0.01, 0.60]).
  • (b) bpe_logical = 2.6250 (expected 2.6250, tolerance 0.05).
  • (c) 87.2% of k_codes differ from trivial round(k_fp32) — non-vacuous.

GPU bench scaffold queued in experiments/kaggle/nq_engine_kv/ (kernel-metadata.json pinned to same T4 docker SHA). Not pushed to Kaggle (quota exhausted).

Proof of work:

python3.11 experiments/kaggle/nq_engine_kv/test_engine_kv.py
[PASS] (a) K3 rel error: 0.1918 (expected ~0.19, bounds [0.01, 0.35])
[PASS] (a) V2 rel error: 0.3864 (expected ~0.39, bounds [0.01, 0.6])
[PASS] (b) bpe_logical: 2.6250 (expected 2.6250)
       ratio_logical: 6.10x  bpe_int8: 8.1250
[PASS] (c) non-vacuous: 87.2% codes differ from trivial identity
       max element diff after round-trip: 0.9512 (>1e-3 required)
All 3 assertions passed. exit 0.
EXIT_CODE: 0

secret-scan: SCANNER: gitleaks - clean (exit 0)
diff stat: 4 files changed, 526 insertions(+)

- nexusquant/integrations/engine_kv.py: quantize_kv/dequantize_kv/bpe_accounting
  API for inference-engine (vLLM-style) KV dtype integration. E8 lattice VQ
  with Hadamard rotation, per-head fp16 abs-max scale, integer-scale scalar
  dequant (no codebook lookup, no calibration step).
- experiments/kaggle/nq_engine_kv/test_engine_kv.py: CPU unit test, exits 0.
  Asserts round-trip error in expected range for K3V2, bpe_accounting returns
  2.625 bpe, and quantization is non-vacuous (87% codes differ from identity).
- experiments/kaggle/nq_engine_kv/nq_engine_kv_bench.py + kernel-metadata.json:
  GPU bench scaffold for T4 quant/dequant latency sweep (push deferred, quota
  exhausted).
Bug 1 (hard): test_engine_kv.py had four "../" in the _ROOT path, landing
one directory above the repo root. Reduces to three "../" so _ROOT resolves
to the repo root; `python3.11 experiments/kaggle/nq_engine_kv/test_engine_kv.py`
now exits 0 with no manual PYTHONPATH.

Bug 2 (soft): nq_engine_kv_bench.py used sys.path.insert(0, "/kaggle/working")
which fails on Kaggle because nothing installs the package there. Replaced with
a subprocess pip-install of git+https://github.com/jagmarques/nexusquant.git@main,
matching the pattern used by other Kaggle kernels in this repo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant