AutoGPTQ
An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).
English |
中文
News or Update
- 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with Marlin int4*fp16 matrix multiplication kernel support, with the argument
use_marlin=True
when loading models. - 2023-08-23 - (News) - 🤗 Transformers, optimum and peft have integrated
auto-gptq
, so now running and training GPTQ models can be more available to everyone! See this blog and it's resources for more details!
For more histories please turn to here
Performance Comparison
Inference Speed
The result is generated using this script, batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).
The quantized model is loaded using the setup that can gain the fastest inference speed.
model | GPU | num_beams | fp16 | gptq-int4 |
---|
llama-7b | 1xA100-40G | 1 | 18.87 | 25.53 |
llama-7b | 1xA100-40G | 4 | 68.79 | 91.30 |
moss-moon 16b | 1xA100-40G | 1 | 12.48 | 15.25 |
moss-moon 16b | 1xA100-40G | 4 | OOM | 42.67 |
moss-moon 16b | 2xA100-40G | 1 | 06.83 | 06.78 |
moss-moon 16b | 2xA100-40G | 4 | 13.10 | 10.80 |
gpt-j 6b | 1xRTX3060-12G | 1 | OOM | 29.55 |
gpt-j 6b | 1xRTX3060-12G | 4 | OOM | 47.36 |
Perplexity
For perplexity comparison, you can turn to here and here
Installation
AutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:
CUDA/ROCm version | Installation | Built against PyTorch |
---|
CUDA 11.8 | pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ | 2.2.0+cu118 |
CUDA 12.1 | pip install auto-gptq | 2.2.0+cu121 |
ROCm 5.7 | pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ | 2.2.0+rocm5.7 |
AutoGPTQ can be installed with the Triton dependency with pip install auto-gptq[triton]
in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).
For older AutoGPTQ, please refer to the previous releases installation table.
Install from source
Clone the source code:
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
A few packages are required in order to build from source: pip install numpy gekko pandas
.
Then, install locally from source:
pip install -vvv -e .
You can set BUILD_CUDA_EXT=0
to disable pytorch extension building, but this is strongly discouraged as AutoGPTQ then falls back on a slow python implementation.
On ROCm systems
To install from source for AMD GPUs supporting ROCm, please specify the ROCM_VERSION
environment variable. Example:
ROCM_VERSION=5.6 pip install -vvv -e .
The compilation can be speeded up by specifying the PYTORCH_ROCM_ARCH
variable (reference) in order to build for a single target device, for example gfx90a
for MI200 series devices.
For ROCm systems, the packages rocsparse-dev
, hipsparse-dev
, rocthrust-dev
, rocblas-dev
and hipblas-dev
are required to build.
Quick Tour
Quantization and Inference
warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.
Below is an example for the simplest use of auto_gptq
to quantize a model and inference after quantization:
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)
pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)
model.save_quantized(quantized_model_dir, use_safetensors=True)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])
For more advanced features of model quantization, please reference to this script
Customize Model
Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:
from auto_gptq.modeling import BaseGPTQForCausalLM
class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
layers_block_name = "model.decoder.layers"
outside_layer_modules = [
"model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
"model.decoder.project_in", "model.decoder.final_layer_norm"
]
inside_layer_modules = [
["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
["self_attn.out_proj"],
["fc1"],
["fc2"]
]
After this, you can use OPTGPTQForCausalLM.from_pretrained
and other methods as shown in Basic.
Evaluation on Downstream Tasks
You can use tasks defined in auto_gptq.eval_tasks
to evaluate model's performance on specific down-stream task before and after quantization.
The predefined tasks support all causal-language-models implemented in 🤗 transformers and in this project.
Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:
from functools import partial
import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.eval_tasks import SequenceClassificationTask
MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
0: "negative",
1: "neutral",
2: "positive"
}
LABELS = list(ID2LABEL.values())
def ds_refactor_fn(samples):
text_data = samples["text"]
label_data = samples["label"]
new_samples = {"prompt": [], "label": []}
for text, label in zip(text_data, label_data):
prompt = TEMPLATE.format(labels=LABELS, text=text)
new_samples["prompt"].append(prompt)
new_samples["label"].append(ID2LABEL[label])
return new_samples
model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)
task = SequenceClassificationTask(
model=model,
tokenizer=tokenizer,
classes=LABELS,
data_name_or_path=DATASET,
prompt_col_name="prompt",
label_col_name="label",
**{
"num_samples": 1000,
"sample_max_len": 1024,
"block_max_len": 2048,
"load_fn": partial(datasets.load_dataset, name="english"),
"preprocess_fn": ds_refactor_fn,
"truncate_prompt": False
}
)
print(task.run())
print(
task.run(
generation_config=GenerationConfig(
num_beams=3,
num_return_sequences=3,
do_sample=True
)
)
)
Learn More
tutorials provide step-by-step guidance to integrate auto_gptq
with your own project and some best practice principles.
examples provide plenty of example scripts to use auto_gptq
in different ways.
Supported Models
you can use model.config.model_type
to compare with the table below to check whether the model you use is supported by auto_gptq
.
for example, model_type of WizardLM
, vicuna
and gpt4all
are all llama
, hence they are all supported by auto_gptq
.
Supported Evaluation Tasks
Currently, auto_gptq
supports: LanguageModelingTask
, SequenceClassificationTask
and TextSummarizationTask
; more Tasks will come soon!
Running tests
Tests can be run with:
pytest tests/ -s
FAQ
Which kernel is used by default?
AutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.
How to use Marlin kernel?
Marlin is an optimized int4 * fp16 kernel was recently proposed at https://github.com/IST-DASLab/marlin. This is integrated in AutoGPTQ when loading a model with use_marlin=True
. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).
Acknowledgement
- Special thanks Elias Frantar, Saleh Ashkboos, Torsten Hoefler and Dan Alistarh for proposing GPTQ algorithm and open source the code, and for releasing Marlin kernel for mixed precision computation.
- Special thanks qwopqwop200, for code in this project that relevant to quantization are mainly referenced from GPTQ-for-LLaMa.
- Special thanks to turboderp, for releasing Exllama and Exllama v2 libraries with efficient mixed precision kernels.