Generalize `llama-quantize` to support any GGUF without architecture-specific hardcoding

Currently, llama-quantize is tightly coupled with specific model architectures. This requires frequent updates to the codebase whenever a new model type (new UNets, TTS, and even Custom architecture...) is introduced to the GGUF ecosystem.

I've experimented with a patch that allows llama-quantize to process any valid GGUF file. Instead of relying on hardcoded rules, it uses an external `tensor-type-file` (which can be generated via a Python script) to define quantization strategies for specific tensors (e.g., keeping sensitive layers like `noise_refiner` at BF16).

Key Benefits:
- Maintenance: Eliminates the need to frequently patch llama.cpp for new architectures.
- Portability: Allows providing pre-compiled binaries that work for any GGUF models.
- Flexibility: Users can fine-tune quantization levels per tensor on any models without recompiling.

Here is a very minimal (and naive) PoC:
https://github.com/roj234/llama.cpp/commit/aa9f7ed7eca1db91c7c707616a90c5d2221debf2

Just a suggestion to see if this direction aligns with the project's goals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize `llama-quantize` to support any GGUF without architecture-specific hardcoding #432

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Generalize llama-quantize to support any GGUF without architecture-specific hardcoding #432

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Generalize `llama-quantize` to support any GGUF without architecture-specific hardcoding #432