Skip to content

Generalize llama-quantize to support any GGUF without architecture-specific hardcoding #432

@roj234

Description

@roj234

Currently, llama-quantize is tightly coupled with specific model architectures. This requires frequent updates to the codebase whenever a new model type (new UNets, TTS, and even Custom architecture...) is introduced to the GGUF ecosystem.

I've experimented with a patch that allows llama-quantize to process any valid GGUF file. Instead of relying on hardcoded rules, it uses an external tensor-type-file (which can be generated via a Python script) to define quantization strategies for specific tensors (e.g., keeping sensitive layers like noise_refiner at BF16).

Key Benefits:

  • Maintenance: Eliminates the need to frequently patch llama.cpp for new architectures.
  • Portability: Allows providing pre-compiled binaries that work for any GGUF models.
  • Flexibility: Users can fine-tune quantization levels per tensor on any models without recompiling.

Here is a very minimal (and naive) PoC:
roj234/llama.cpp@aa9f7ed

Just a suggestion to see if this direction aligns with the project's goals.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions