Training-free fix for KV cache INT4 failures. Norm separation + per-channel quantization. Qwen2-7B: 744× improvement (ΔPPL +238 → +0.32). 12 models, 124M–40B. 4 lines of PyTorch.
-
Updated
Apr 18, 2026 - Python
Training-free fix for KV cache INT4 failures. Norm separation + per-channel quantization. Qwen2-7B: 744× improvement (ΔPPL +238 → +0.32). 12 models, 124M–40B. 4 lines of PyTorch.
Training-free INT3 KV cache quantization: 5.09× compression, ~10 lines of Python, <5% WikiText-2 ΔPPL on 8 of 8 open-weight Transformers (GPT-J 2021 → Gemma-4 2026). No calibration, no codebook, no rotation, no adapter. +2.4% decode overhead with torch.compile (no custom CUDA).
Add a description, image, and links to the norm-separation topic page so that developers can more easily learn about it.
To associate your repository with the norm-separation topic, visit your repo's landing page and select "manage topics."