Entropy Coding Meets Quantization
π¦ Extreme compression - Down to 2 bits/parameter, while retaining full 8-bit dynamic range
π¬ Data-free - No calibration, no recovery training, works on any model out of the box
β‘ Fast - Compresses a 70B model on a single GPU in <10 minutes, as you load it from disk
π Inference-ready - On-the-fly GPU decompression integrated into the inference pipeline
Standard post-training quantization (PTQ) couples compression rate to bit-width: 4-bit means exactly 16 unique values, 2-bit means only 4. Without great care, pushing to lower bit-widths directly destroys model quality because so few distinct weight values remain.
EntQuant breaks this coupling. It keeps weights in a high-precision 8-bit format (Float8 or Int8) but optimizes their distributions for low entropy, that is, it encourages weights to cluster around a small number of frequent values without discarding the ability to represent outliers. A GPU-accelerated ANS codec then losslessly compresses the resulting low-entropy distributions to far fewer bits per parameter than the original 8. This decouples numerical precision from storage cost: you get the expressiveness of 8-bit with the size of 2-4-bit.
At the extreme, EntQuant achieves effective bit rates down to ~2 bits per parameter while retaining far more unique weight values than a fixed 2-bit representation (see Table 1 in the paper). No calibration data or recovery training is required. The method works on any model out of the box, including instruction-tuned and reasoning models.
This work was developed at Merantix Momentum. If you are using it, please cite it.
π Float8@2bits: Entropy Coding Enables Data-Free Model Compression
Patrick Putzky*, Martin Genzel*, Mattes Mollenhauer, Sebastian Schulze, Thomas Wollmann, Stefan Dietzel
* equal contribution
Clone the repo and install dependencies with uv (Python >= 3.11, CUDA GPU required):
git clone https://github.com/merantix-momentum/entquant.git
cd entquant
uv syncβ EntQuant requires NVIDIA nvCOMP (tested with version 5.1.0) for the ANS compression backend. Set the
NVCOMP_ROOTenvironment variable (or add it to a.envfile) pointing to the extracted directory. A CUDA toolkit (nvcc) is also needed since the backend is JIT-compiled on first use. You can pre-compile it withuv run python scripts/compile_backend.py.
This section walks you through scripts/quickstart.py step by step. You can run it directly
with
uv run python scripts/quickstart.py.
Step 1 - Choose model and dtype:
import torch
from entquant import EntQuantModel
from entquant.quantization.optimizer import SymmetricEntropyOptimizer, WrappedAbsmaxOptimizer
MODEL = "meta-llama/Llama-2-7b-hf"
DTYPE = torch.bfloat16
WEIGHT_QTYPE = "qfloat8" # or "qint8"WEIGHT_QTYPE selects the quantization format. "qfloat8" (FP8 E4M3) is the default; "qint8" is also supported.
Step 2 - Set the target bit rate:
LAMBDA, LR = 3.9, 1.0 # ~ 4-bit
# LAMBDA, LR = 14.5, 1.0 # ~ 3-bit
# LAMBDA, LR = 58.0, 0.25 # ~ 2-bitLAMBDA is the entropy regularization strength and LR is the optimizer learning rate. Together they control how
aggressively the weight distribution is pushed toward low entropy (= higher compressibility). Higher LAMBDA = more
compression. This is the key knob unique to EntQuant - unlike standard PTQ methods, the target bit rate is continuously
tunable and not tied to a fixed integer bit-width. The above choices are robust across all models we tested, so the same
(LAMBDA, LR) pair can be reused for different architectures and base model sizes.
Step 3 - Quantize + compress:
model = EntQuantModel.from_pretrained(
MODEL,
quantize=True,
compress=True,
weight_qtype=WEIGHT_QTYPE,
dtype=DTYPE,
optimizer=SymmetricEntropyOptimizer(lr=LR, reg_param=LAMBDA),
optimizer_fallback=WrappedAbsmaxOptimizer(),
)quantize=True triggers block-streaming quantization & EntQuant optimization (the base model is never fully
materialized in memory).
compress=True
additionally runs ANS compression. optimizer_fallback is used for fallback layers (e.g., layers with super weights
that need simple 8-bit quantization).
Step 4 - Save and reload:
model.save_pretrained("artifacts/my-checkpoint")
# Later: load back with decompression
model = EntQuantModel.from_pretrained("artifacts/my-checkpoint", compress=True)Checkpoints store the quantized (but not compressed) weights. ANS compression is performed on the fly when loading from disk - it is fast enough to add negligible overhead.
See scripts/quickstart.py for the full runnable script, which also supports multi-GPU, super
weight detection, and inference benchmarks.
For batch experiments and reproducible configs, EntQuant provides a hydra-zen pipeline.
Basic command:
uv run python -m run.workflows.exec +experiment=entquant_fp8 cfg/model=llama2_7bUseful Hydra CLI flags:
--helpto discover available config groups and overrides--cfg jobto print the fully resolved config without running- Override individual parameters on the CLI, e.g.:
cfg.entquant.optimizer.reg_param=14.5 cfg.entquant.optimizer.lr=1.0
See run/conf/model.py for the full list of available model configs. You can add new ones there if
you like.
All commands to reproduce the paper results are collected in scripts/commands.txt. These use
the hydra-zen pipeline described above.
entquant/
model/ # EntQuantModel, block-streaming, checkpoint I/O
quantization/ # Entropy optimizer to compute quantization scales (Float8/Int8)
compression/ # ANS compression via nvCOMP, GPU decompression hooks
super_weights/ # Super weight detection for fallback layer selection
eval/ # Perplexity, lm-eval-harness, inference benchmarks
run/
conf/ # Hydra-zen structured configs (model, entquant, eval, ...)
workflows/ # Build, evaluation, experiment definitions, exec entry point
scripts/ # Standalone scripts (quickstart, commands, etc.)
tests/ # Basic integration tests (requires CUDA)
Design notes:
EntQuantModelis the sole public entry point, following the PeftModel pattern (nn.Module+PushToHubMixin). It wraps a standard HuggingFace model and manages quantization, compression, and serialization.- Block-streaming architecture: the full (uncompressed) model is never materialized in memory. Weights are quantized, compressed, and saved block-by-block.
optimum-quantoprovides the quantization primitives (QLinear,WeightQBytesTensor). EntQuant adds a custom low-entropy scale optimizer that minimizes the L1 norm as a differentiable proxy for Shannon entropy.- nvCOMP ANS compression with on-the-fly GPU decompression: compressed weights are stored as byte buffers and decompressed into a shared device buffer just before each block's forward pass.
This repo now also contains an experimental, self-contained NVFP4 path under
entquant/quantization/nvfp4.py and
entquant/quantization/nvfp4_optimizer.py.
Scope of this prototype:
- Standard NVFP4 E2M1 codebook
- 16-weight block scaling with FP8 E4M3 encoded block scales
- Data-free EntQuant-style offline scale optimization
- Optional
entquant_softvariant with a soft-code entropy regularizer
Minimal example:
uv run python scripts/quickstart_nvfp4.pyThis path is currently tensor-level and experimental. It does not add runtime compression or checkpoint export integration yet.
Template-based final checkpoint export:
uv run python scripts/export_nvfp4_checkpoint.py \
--full-precision-model-dir /path/to/Qwen3-4B \
--template-nvfp4-dir /path/to/Qwen3-4B-NVFP4 \
--output-dir /path/to/Qwen3-4B-NVFP4-entquant \
--variant entquant_exact \
--reg-param 0.05 \
--device cpuThis exporter:
- reads full-precision weights from the source model
- copies config/tokenizer/layout from the template NVFP4 checkpoint
- re-quantizes selected NVFP4 weight tensors
- writes a final checkpoint directory plus
nvfp4_export_report.json
Feel free to reach out to us via GitHub issues or email!
patrick.putzky at merantix-momentum dot com
martin.genzel at merantix-momentum dot com
This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.
When using or referring to this project, please cite our paper:
@article{putzky2026entquant,
title = {Float8@2bits: Entropy Coding Enables Data-Free Model Compression},
author = {Putzky, Patrick and Genzel, Martin and Mollenhauer, Mattes and Schulze, Sebastian and Wollmann, Thomas and Dietzel, Stefan},
year = {2026},
journal = {Preprint arXiv:2601.22787}
}We kindly acknowledge funding by the European Union - NextGenerationEU - and the German Federal Ministry for Economic Affairs and Energy in the project "SouverΓ€ne KI fΓΌr Europa (SOOFI)" (grant no. 13IPC040H).

