Auto-Quant-Tool

Automated quantization benchmarking suite for GGUF, GPTQ, and TFLite models. Pulls a model from HuggingFace, generates multiple quantized variants, benchmarks them on your hardware, and outputs a Pareto frontier showing the best accuracy-to-speed tradeoff.

Screenshots

Auto-Quant-Tool UI

Supported formats

GGUF (Q2 through Q8) — for llama.cpp / Ollama local inference
GPTQ (INT4, INT8) — for GPU inference via gptqmodel
TFLite (FP32, FP16, INT8) — for mobile deployment

Quick start

1. Clone the repo

git clone --recurse-submodules https://github.com/YOUR_USERNAME/auto-quant-tool.git
cd auto-quant-tool

2. Base install (all platforms)

uv sync

3. Hardware backend (run once, auto-detects your system)

python setup/install_backends.py

4. Launch the web UI

uv run python -m auto_quant_tool.cli ui

Then open http://localhost:7860 in your browser.

5. Or run via CLI

uv run python -m auto_quant_tool.cli run --config sample_llm.yaml

Installation by platform

Windows + NVIDIA GPU

uv sync
python setup/install_backends.py --backend cuda

Requires Visual C++ Build Tools for llama.cpp compilation. Download: https://visualstudio.microsoft.com/visual-cpp-build-tools/

GPTQ quantization requires a GPU with 16GB+ VRAM. For systems with less VRAM, use the Kaggle notebook: notebooks/kaggle_gptq.ipynb

TFLite conversion is not supported on Windows. Use the Colab notebook instead: notebooks/colab_tflite.ipynb

macOS (Apple Silicon)

uv sync
python setup/install_backends.py --backend metal

Linux + NVIDIA GPU

uv sync
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

CPU only (any OS)

uv sync
python setup/install_backends.py --backend cpu

Configuration

Copy and edit a sample config:

cp sample_llm.yaml my_model.yaml

model:
  source: huggingface       # or local
  id: Qwen/Qwen2-0.5B
  modality: llm             # llm | vision | audio

quantize:
  formats: [gguf, gptq]
  gguf_levels: [Q2_K, Q4_K_M, Q5_0, Q8_0]
  gptq_levels: [int4]

benchmark:
  metrics: [perplexity, tok_s]
  full_mmlu: false
  soc_target: snapdragon_8_gen_3    # for TFLite sim benchmark
  dataset:
    name: wikitext
    split: test
    source: hf_datasets

Output structure

outputs/
├── models/          # cached HF model weights
├── gguf/            # GGUF quantized files per model
├── gptq/            # GPTQ quantized files per model
├── tflite/          # TFLite converted files per model
├── results/         # benchmark CSVs, unified JSON, Pareto HTML/PNG
└── best_model/      # knee-point model files copied here

Notebooks

notebooks/kaggle_gptq.ipynb — GPTQ quantization on Kaggle T4 (16GB VRAM)
notebooks/colab_tflite.ipynb — TFLite conversion on Google Colab

Hardware requirements

Task	Minimum	Recommended
GGUF conversion	8GB RAM	16GB RAM
GGUF inference (7B Q4)	8GB RAM	16GB RAM + any GPU
GPTQ quantization (7B)	16GB VRAM	A100 40GB
TFLite conversion	CPU only	CPU only
Simulated benchmark	CPU only	CPU only

Known limitations

TFLite conversion not supported on Windows (use Colab notebook)
GPTQ requires 16GB+ VRAM (use Kaggle notebook for smaller GPUs)
Perplexity measured on a short fixed corpus — use --full-mmlu for task-based accuracy (slower)
TurboQuant (KV cache quantization) deferred to v2

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Screenshots		Screenshots
auto_quant_tool		auto_quant_tool
notebooks		notebooks
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
README.md		README.md
__init__.py		__init__.py
main.py		main.py
pyproject.toml		pyproject.toml
sample_bad.yaml		sample_bad.yaml
sample_llm.yaml		sample_llm.yaml
sample_vision.yaml		sample_vision.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto-Quant-Tool

Screenshots

Supported formats

Quick start

1. Clone the repo

2. Base install (all platforms)

3. Hardware backend (run once, auto-detects your system)

4. Launch the web UI

5. Or run via CLI

Installation by platform

Windows + NVIDIA GPU

macOS (Apple Silicon)

Linux + NVIDIA GPU

CPU only (any OS)

Configuration

Output structure

Notebooks

Hardware requirements

Known limitations

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Auto-Quant-Tool

Screenshots

Supported formats

Quick start

1. Clone the repo

2. Base install (all platforms)

3. Hardware backend (run once, auto-detects your system)

4. Launch the web UI

5. Or run via CLI

Installation by platform

Windows + NVIDIA GPU

macOS (Apple Silicon)

Linux + NVIDIA GPU

CPU only (any OS)

Configuration

Output structure

Notebooks

Hardware requirements

Known limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages