Mixed-precision Inference In VLLM

The mixed-precision inference is used for accelerating prefill steps and for enhancing throughput of LLM.

Comparision with AWQ

Assuming we have a task that is to compute the PPL(perplexity) of Wikitext2. The dataset wikitext contains 333088 validation data.

For batch size = 32, the task is devided into 10409 parts.

AWQ finished the task in 10 minutes with 16.71 it/s.

MixQ finished the task in 4.50 minutes with 35.02 it/s.

For batch size = 512, the task is devided into 655 parts by running python evalppl.py --model_type awq --model_path /home/cyd/mixqdata/Llama-2-7b --quant_file /home/cyd/mixqdata/Llama-2-7b-AWQ --n_ctx 512 --n_batch 512 --eval_accuracy True

AWQ finished the task in 127 seconds with 5.2 it/s.

MixQ (W8A8O16) finished the task in 42 seconds with 15.36 it/s by running python evalppl.py --model_type mix8 --model_path /home/cyd/mixqdata/quant8/Llama-2-7b --quant_file /home/cyd/mixqdata/quant8/Llama-2-7b --n_ctx 512 --n_batch 512 --eval_accuracy True

Setup

Please quantized the model by the QComplier (https://github.com/Qcompiler/QComplier)

git clone  git@github.com:Qcompiler/QComplier.git

Please install the vllm by

pip install vllm==0.6.2

Please install the mixed-precision source code by

git clone git@github.com:Qcompiler/vllm-mixed-precision.git

And copy the ".so" from the vllm project

cp -r $PYTHON_PATH/lib/python3.11/site-packages/vllm/*.so  vllm-mixed-precision/vllm/

Delete the vllm==0.6.2

pip uninstall vllm

Runing 8-bit mixed-preiciosn infernce in vllm

export PYTHONPATH=$( pwd )
python test8bit.py --quant 8

The output is

Acknowledgement

[1] QUIK (https://arxiv.org/pdf/2310.09259)

[2] VLLM (https://github.com/vllm-project/vllm)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
examples		examples
figures		figures
vllm		vllm
.gitignore		.gitignore
README.MD		README.MD
download_mmlu.sh		download_mmlu.sh
gradio_openai_chatbot_webserver.py		gradio_openai_chatbot_webserver.py
gradio_webserver.py		gradio_webserver.py
mmlu.py		mmlu.py
out.txt		out.txt
out2.txt		out2.txt
test4bit.py		test4bit.py
test4bitchatglm.py		test4bitchatglm.py
test8bit.py		test8bit.py
test8bitLongSeqLlama3.py		test8bitLongSeqLlama3.py
test8bitchatglm.py		test8bitchatglm.py
test8bitqwen2.py		test8bitqwen2.py
testawq.py		testawq.py
testmmlu.sh		testmmlu.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mixed-precision Inference In VLLM

Comparision with AWQ

Setup

Runing 8-bit mixed-preiciosn infernce in vllm

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mixed-precision Inference In VLLM

Comparision with AWQ

Setup

Runing 8-bit mixed-preiciosn infernce in vllm

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages