BitsAndBytes
GPTQ for LLaMA
https://github.com/WapaMario63/GPTQ-for-LLaMa-ROCm
AutoGPTQ
https://github.com/are-we-gfx1100-yet/AutoGPTQ-rocm
Good performance. 43it/s for 7B, 25it/s for 13B, 15it/s for 30B, 0.25it/s for 40B 3bit, 1 beam.
Triton
Navi 3x support is currently work in progress. Stay tuned.
13% performance compared to rocBLAS, when running 03-matrix-multiplication, with this branch, which is merged back recently.
There is still a lot of room for improvement.
AITemplate
Navi 3x support is currently work in progress. Stay tuned.
Reach 25it/s in generating a 512x512 image with Stable Diffusion, with this branch.
Somewhat disappointing. Is this really the limit of the RX 7900 XTX?
Flash Attention
To be ported to Navi 3x.
ROCm
ROCm 5.6.0 is available now, but we can't find Windows support anywhere.
I think it might be more appropriate to call it ROCm 5.5.2.
BitsAndBytes
https://github.com/are-we-gfx1100-yet/bitsandbytes-rocmGPTQ for LLaMA
https://github.com/WapaMario63/GPTQ-for-LLaMa-ROCm
AutoGPTQ
https://github.com/are-we-gfx1100-yet/AutoGPTQ-rocm
Good performance. 43it/s for 7B, 25it/s for 13B, 15it/s for 30B, 0.25it/s for 40B 3bit, 1 beam.
Triton
Navi 3x support is currently work in progress. Stay tuned.
13% performance compared to rocBLAS, when running 03-matrix-multiplication, with this branch, which is merged back recently.
There is still a lot of room for improvement.
AITemplate
Navi 3x support is currently work in progress. Stay tuned.
Reach 25it/s in generating a 512x512 image with Stable Diffusion, with this branch.
Somewhat disappointing. Is this really the limit of the RX 7900 XTX?
Flash Attention
To be ported to Navi 3x.
ROCm
ROCm 5.6.0 is available now, but we can't find Windows support anywhere.
I think it might be more appropriate to call it ROCm 5.5.2.