Currently, llama-quantize is tightly coupled with specific model architectures. This requires frequent updates to the codebase whenever a new model type (new UNets, TTS, and even Custom architecture...) is introduced to the GGUF ecosystem.
I've experimented with a patch that allows llama-quantize to process any valid GGUF file. Instead of relying on hardcoded rules, it uses an external tensor-type-file (which can be generated via a Python script) to define quantization strategies for specific tensors (e.g., keeping sensitive layers like noise_refiner at BF16).
Key Benefits:
- Maintenance: Eliminates the need to frequently patch llama.cpp for new architectures.
- Portability: Allows providing pre-compiled binaries that work for any GGUF models.
- Flexibility: Users can fine-tune quantization levels per tensor on any models without recompiling.
Here is a very minimal (and naive) PoC:
roj234/llama.cpp@aa9f7ed
Just a suggestion to see if this direction aligns with the project's goals.
Currently, llama-quantize is tightly coupled with specific model architectures. This requires frequent updates to the codebase whenever a new model type (new UNets, TTS, and even Custom architecture...) is introduced to the GGUF ecosystem.
I've experimented with a patch that allows llama-quantize to process any valid GGUF file. Instead of relying on hardcoded rules, it uses an external
tensor-type-file(which can be generated via a Python script) to define quantization strategies for specific tensors (e.g., keeping sensitive layers likenoise_refinerat BF16).Key Benefits:
Here is a very minimal (and naive) PoC:
roj234/llama.cpp@aa9f7ed
Just a suggestion to see if this direction aligns with the project's goals.