Automated quantization benchmarking suite for GGUF, GPTQ, and TFLite models. Pulls a model from HuggingFace, generates multiple quantized variants, benchmarks them on your hardware, and outputs a Pareto frontier showing the best accuracy-to-speed tradeoff.
| Auto-Quant-Tool UI |
|---|
![]() |
![]() |
![]() |
- GGUF (Q2 through Q8) — for llama.cpp / Ollama local inference
- GPTQ (INT4, INT8) — for GPU inference via gptqmodel
- TFLite (FP32, FP16, INT8) — for mobile deployment
git clone --recurse-submodules https://github.com/YOUR_USERNAME/auto-quant-tool.git
cd auto-quant-tooluv syncpython setup/install_backends.pyuv run python -m auto_quant_tool.cli uiThen open http://localhost:7860 in your browser.
uv run python -m auto_quant_tool.cli run --config sample_llm.yamluv sync
python setup/install_backends.py --backend cudaRequires Visual C++ Build Tools for llama.cpp compilation. Download: https://visualstudio.microsoft.com/visual-cpp-build-tools/
GPTQ quantization requires a GPU with 16GB+ VRAM.
For systems with less VRAM, use the Kaggle notebook:
notebooks/kaggle_gptq.ipynb
TFLite conversion is not supported on Windows.
Use the Colab notebook instead: notebooks/colab_tflite.ipynb
uv sync
python setup/install_backends.py --backend metaluv sync
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-pythonuv sync
python setup/install_backends.py --backend cpuCopy and edit a sample config:
cp sample_llm.yaml my_model.yamlmodel:
source: huggingface # or local
id: Qwen/Qwen2-0.5B
modality: llm # llm | vision | audio
quantize:
formats: [gguf, gptq]
gguf_levels: [Q2_K, Q4_K_M, Q5_0, Q8_0]
gptq_levels: [int4]
benchmark:
metrics: [perplexity, tok_s]
full_mmlu: false
soc_target: snapdragon_8_gen_3 # for TFLite sim benchmark
dataset:
name: wikitext
split: test
source: hf_datasetsoutputs/
├── models/ # cached HF model weights
├── gguf/ # GGUF quantized files per model
├── gptq/ # GPTQ quantized files per model
├── tflite/ # TFLite converted files per model
├── results/ # benchmark CSVs, unified JSON, Pareto HTML/PNG
└── best_model/ # knee-point model files copied here
notebooks/kaggle_gptq.ipynb— GPTQ quantization on Kaggle T4 (16GB VRAM)notebooks/colab_tflite.ipynb— TFLite conversion on Google Colab
| Task | Minimum | Recommended |
|---|---|---|
| GGUF conversion | 8GB RAM | 16GB RAM |
| GGUF inference (7B Q4) | 8GB RAM | 16GB RAM + any GPU |
| GPTQ quantization (7B) | 16GB VRAM | A100 40GB |
| TFLite conversion | CPU only | CPU only |
| Simulated benchmark | CPU only | CPU only |
- TFLite conversion not supported on Windows (use Colab notebook)
- GPTQ requires 16GB+ VRAM (use Kaggle notebook for smaller GPUs)
- Perplexity measured on a short fixed corpus — use
--full-mmlufor task-based accuracy (slower) - TurboQuant (KV cache quantization) deferred to v2
Apache 2.0
.png)
.png)
.png)