<h1 align="center">AutoGPTQ</h1>
<p align="center">An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).</p>
<p align="center">
<a href="https://github.com/PanQiWei/AutoGPTQ/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/PanQiWei/AutoGPTQ.svg">
</a>
<a href="https://pypi.org/project/auto-gptq/">
<img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dd/auto-gptq">
</a>
</p>
<h4 align="center">
<p>
<b>English</b> |
<a href="https://github.com/PanQiWei/AutoGPTQ/blob/main/README_zh.md">δΈζ</a>
</p>
</h4>
## News or Update
- 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with [Marlin](https://github.com/IST-DASLab/marlin) int4*fp16 matrix multiplication kernel support, with the argument `use_marlin=True` when loading models.
- 2023-08-23 - (News) - π€ Transformers, optimum and peft have integrated `auto-gptq`, so now running and training GPTQ models can be more available to everyone! See [this blog](https://huggingface.co/blog/gptq-integration) and it's resources for more details!
*For more histories please turn to [here](docs/NEWS_OR_UPDATE.md)*
## Performance Comparison
### Inference Speed
> The result is generated using [this script](examples/benchmark/generation_speed.py), batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).
>
> The quantized model is loaded using the setup that can gain the fastest inference speed.
| model | GPU | num_beams | fp16 | gptq-int4 |
|---------------|---------------|-----------|-------|-----------|
| llama-7b | 1xA100-40G | 1 | 18.87 | 25.53 |
| llama-7b | 1xA100-40G | 4 | 68.79 | 91.30 |
| moss-moon 16b | 1xA100-40G | 1 | 12.48 | 15.25 |
| moss-moon 16b | 1xA100-40G | 4 | OOM | 42.67 |
| moss-moon 16b | 2xA100-40G | 1 | 06.83 | 06.78 |
| moss-moon 16b | 2xA100-40G | 4 | 13.10 | 10.80 |
| gpt-j 6b | 1xRTX3060-12G | 1 | OOM | 29.55 |
| gpt-j 6b | 1xRTX3060-12G | 4 | OOM | 47.36 |
### Perplexity
For perplexity comparison, you can turn to [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#result) and [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#gptq-vs-bitsandbytes)
## Installation
AutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:
| CUDA/ROCm version | Installation | Built against PyTorch |
|-------------------|---------------------------------------------------------------------------------------------------|-----------------------|
| CUDA 11.8 | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/` | 2.2.0+cu118 |
| CUDA 12.1 | `pip install auto-gptq` | 2.2.0+cu121 |
| ROCm 5.7 | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/` | 2.2.0+rocm5.7 |
AutoGPTQ can be installed with the Triton dependency with `pip install auto-gptq[triton]` in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).
For older AutoGPTQ, please refer to [the previous releases installation table](docs/INSTALLATION.md).
### Install from source
Clone the source code:
```bash
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
```
A few packages are required in order to build from source: `pip install numpy gekko pandas`.
Then, install locally from source:
```bash
pip install -vvv -e .
```
You can set `BUILD_CUDA_EXT=0` to disable pytorch extension building, but this is **strongly discouraged** as AutoGPTQ then falls back on a slow python implementation.
#### On ROCm systems
To install from source for AMD GPUs supporting ROCm, please specify the `ROCM_VERSION` environment variable. Example:
```bash
ROCM_VERSION=5.6 pip install -vvv -e .
```
The compilation can be speeded up by specifying the `PYTORCH_ROCM_ARCH` variable ([reference](https://github.com/pytorch/pytorch/blob/7b73b1e8a73a1777ebe8d2cd4487eb13da55b3ba/setup.py#L132)) in order to build for a single target device, for example `gfx90a` for MI200 series devices.
For ROCm systems, the packages `rocsparse-dev`, `hipsparse-dev`, `rocthrust-dev`, `rocblas-dev` and `hipblas-dev` are required to build.
## Quick Tour
### Quantization and Inference
> warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.
Below is an example for the simplest use of `auto_gptq` to quantize a model and inference after quantization:
```python
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)
pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)
# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)
# save quantized model
model.save_quantized(quantized_model_dir)
# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)
# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)
# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)
# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")
# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)
# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))
# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])
```
For more advanced features of model quantization, please reference to [this script](examples/quantization/quant_with_alpaca.py)
### Customize Model
<details>
<summary>Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:</summary>
```python
from auto_gptq.modeling import BaseGPTQForCausalLM
class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
# chained attribute name of transformer layer block
layers_block_name = "model.decoder.layers"
# chained attribute names of other nn modules that in the same level as the transformer layer block
outside_layer_modules = [
"model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
"model.decoder.project_in", "model.decoder.final_layer_norm"
]
# chained attribute names of linear layers in transformer layer module
# normally, there are four sub lists, for each one the modules in it can be seen as one operation,
# and the order should be the order when they are truly executed, in this case (and usually in most cases),
# they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
inside_layer_modules = [
["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
["self_attn.out_proj"],
["fc1"],
["fc2"]
]
```
After this, you can use `OPTGPTQForCausalLM.from_pretrained` and other methods as shown in Basic.
</details>
### Evaluation on Downstream Tasks
You can use tasks defined in `auto_gptq.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.
The predefined tasks support all causal-language-models implemented in [π€ transformers](https://github.com/huggingface/transformers) and in this project.
<details>
<summary>Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:</summary>
```python
from functools import partial
import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.eval_tasks import SequenceClassificationTask
MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
0: "negative",
1: "neutral",
2: "positive"
}
LABELS = list(ID2LABEL.values())
def ds_refactor_fn(samples):
text_data = samples["text"]
label_data = samples["label"]
new_samples = {"prompt": [], "label": []}
for text, label in zip(text_data, label_data):
prompt = TEMPLATE.format(labels=LABELS, text=text)
new_samples["prompt"].append(prompt)
new_samples["label"].append(ID2LABEL[label])
return new_samples
# model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)
task = SequenceClassificationTask(
model=model,
tokenizer=tokenizer,
classes=LABELS,
data_name_or_path=DATASET,
prompt_col_name="prompt",
label_col_name="label",
**{
"num_samples": 1000, # how many samples will be sampled to evaluation
"sample_max_len": 1024, # max tokens for each sample
"block_max_len": 2048, # max tokens for each data block
# function to load dataset, one must only accept data_name_or_path as input
# and return datasets.Dataset
"load_fn": partial(datasets.load_dataset, name="english"),
# function to preprocess dataset, which is used for datasets.Dataset.map,
# must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
"preprocess_fn": ds_refactor_fn,
# truncate label when sample's length exceed sample_max_len
"truncate_prompt": False
}
)
# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())
# self-consistency
print(
task.run(
generation_config=GenerationConfig(
num_beams=3,
num_return_sequences=3,
do_sample=True
)
)
)
```
</details>
## Learn More
[tutorials](docs/tutorial) provide step-by-step guidance to integrate `auto_gptq` with your own project and some best practice principles.
[examples](examples/README.md) provide plenty of example scripts to use `auto_gptq` in different ways.
## Supported Models
> you can use `model.config.model_type` to compare with the table below to check whether the model you use is supported by `auto_gptq`.
>
> for example, model_type of `WizardLM`, `vicuna` and `gpt4all` are all `llama`, hence they are all supported by `auto_gptq`.
| model type | quantization | inference | peft-lora | peft-ada-lora | peft-adaption_prompt |
|------------------------------------|--------------|-----------|-----------|---------------|-------------------------------------------------------------------------------------------------|
| bloom | β
| β
| β
| β
| |
| gpt2 | β
| β
| β
| β
| |
| gpt_neox | β
| β
| β
| β
| β
[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| gptj | β
| β
| β
| β
| β
[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| llama | β
| β
| β
| β
| β
|
| moss | β
| β
| β
| β
| β
[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| opt | β
| β
| β
| β
| |
| gpt_bigcode | β
| β
| β
| β
| |
| codegen | β
| β
| β
| β
| |
| falcon(RefinedWebModel/RefinedWeb) | β
| β
| β
| β
| |
## Supported Evaluation Tasks
Currently, `auto_gptq` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!
## Running tests
Tests can be run with:
```
pytest tests/ -s
```
## FAQ
### Which kernel is used by default?
AutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.
### How to use Marlin kernel?
Marlin is an optimized int4 * fp16 kernel was recently proposed at https://github.com/IST-DASLab/marlin. This is integrated in AutoGPTQ when loading a model with `use_marlin=True`. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).
## Acknowledgement
- Special thanks **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh** for proposing **GPTQ** algorithm and open source the [code](https://github.com/IST-DASLab/gptq), and for releasing [Marlin kernel](https://github.com/IST-DASLab/marlin) for mixed precision computation.
- Special thanks **qwopqwop200**, for code in this project that relevant to quantization are mainly referenced from [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda).
- Special thanks to **turboderp**, for releasing [Exllama](https://github.com/turboderp/exllama) and [Exllama v2](https://github.com/turboderp/exllamav2) libraries with efficient mixed precision kernels.
Raw data
{
"_id": null,
"home_page": "https://github.com/PanQiWei/AutoGPTQ",
"name": "auto-gptq",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8.0",
"maintainer_email": "",
"keywords": "gptq,quantization,large-language-models,transformers",
"author": "PanQiWei",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/34/71/c3e73cf17681f6ff4754ef8f4cb8b67af3def230fc8711eac1250bbd78d5/auto_gptq-0.7.0.tar.gz",
"platform": "windows",
"description": "<h1 align=\"center\">AutoGPTQ</h1>\n<p align=\"center\">An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).</p>\n<p align=\"center\">\n <a href=\"https://github.com/PanQiWei/AutoGPTQ/releases\">\n <img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/PanQiWei/AutoGPTQ.svg\">\n </a>\n <a href=\"https://pypi.org/project/auto-gptq/\">\n <img alt=\"PyPI - Downloads\" src=\"https://img.shields.io/pypi/dd/auto-gptq\">\n </a>\n</p>\n<h4 align=\"center\">\n <p>\n <b>English</b> |\n <a href=\"https://github.com/PanQiWei/AutoGPTQ/blob/main/README_zh.md\">\u4e2d\u6587</a>\n </p>\n</h4>\n\n## News or Update\n\n- 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with [Marlin](https://github.com/IST-DASLab/marlin) int4*fp16 matrix multiplication kernel support, with the argument `use_marlin=True` when loading models.\n- 2023-08-23 - (News) - \ud83e\udd17 Transformers, optimum and peft have integrated `auto-gptq`, so now running and training GPTQ models can be more available to everyone! See [this blog](https://huggingface.co/blog/gptq-integration) and it's resources for more details!\n\n*For more histories please turn to [here](docs/NEWS_OR_UPDATE.md)*\n\n## Performance Comparison\n\n### Inference Speed\n> The result is generated using [this script](examples/benchmark/generation_speed.py), batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).\n>\n> The quantized model is loaded using the setup that can gain the fastest inference speed.\n\n| model | GPU | num_beams | fp16 | gptq-int4 |\n|---------------|---------------|-----------|-------|-----------|\n| llama-7b | 1xA100-40G | 1 | 18.87 | 25.53 |\n| llama-7b | 1xA100-40G | 4 | 68.79 | 91.30 |\n| moss-moon 16b | 1xA100-40G | 1 | 12.48 | 15.25 |\n| moss-moon 16b | 1xA100-40G | 4 | OOM | 42.67 |\n| moss-moon 16b | 2xA100-40G | 1 | 06.83 | 06.78 |\n| moss-moon 16b | 2xA100-40G | 4 | 13.10 | 10.80 |\n| gpt-j 6b | 1xRTX3060-12G | 1 | OOM | 29.55 |\n| gpt-j 6b | 1xRTX3060-12G | 4 | OOM | 47.36 |\n\n\n### Perplexity\nFor perplexity comparison, you can turn to [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#result) and [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#gptq-vs-bitsandbytes)\n\n## Installation\n\nAutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:\n\n| CUDA/ROCm version | Installation | Built against PyTorch |\n|-------------------|---------------------------------------------------------------------------------------------------|-----------------------|\n| CUDA 11.8 | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/` | 2.2.0+cu118 |\n| CUDA 12.1 | `pip install auto-gptq` | 2.2.0+cu121 |\n| ROCm 5.7 | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/` | 2.2.0+rocm5.7 |\n\nAutoGPTQ can be installed with the Triton dependency with `pip install auto-gptq[triton]` in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).\n\nFor older AutoGPTQ, please refer to [the previous releases installation table](docs/INSTALLATION.md).\n\n### Install from source\n\nClone the source code:\n```bash\ngit clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ\n```\n\nA few packages are required in order to build from source: `pip install numpy gekko pandas`.\n\nThen, install locally from source:\n```bash\npip install -vvv -e .\n```\nYou can set `BUILD_CUDA_EXT=0` to disable pytorch extension building, but this is **strongly discouraged** as AutoGPTQ then falls back on a slow python implementation.\n\n#### On ROCm systems\n\nTo install from source for AMD GPUs supporting ROCm, please specify the `ROCM_VERSION` environment variable. Example:\n\n```bash\nROCM_VERSION=5.6 pip install -vvv -e .\n```\n\nThe compilation can be speeded up by specifying the `PYTORCH_ROCM_ARCH` variable ([reference](https://github.com/pytorch/pytorch/blob/7b73b1e8a73a1777ebe8d2cd4487eb13da55b3ba/setup.py#L132)) in order to build for a single target device, for example `gfx90a` for MI200 series devices.\n\nFor ROCm systems, the packages `rocsparse-dev`, `hipsparse-dev`, `rocthrust-dev`, `rocblas-dev` and `hipblas-dev` are required to build.\n\n## Quick Tour\n\n### Quantization and Inference\n> warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.\n\nBelow is an example for the simplest use of `auto_gptq` to quantize a model and inference after quantization:\n```python\nfrom transformers import AutoTokenizer, TextGenerationPipeline\nfrom auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig\nimport logging\n\nlogging.basicConfig(\n format=\"%(asctime)s %(levelname)s [%(name)s] %(message)s\", level=logging.INFO, datefmt=\"%Y-%m-%d %H:%M:%S\"\n)\n\npretrained_model_dir = \"facebook/opt-125m\"\nquantized_model_dir = \"opt-125m-4bit\"\n\ntokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)\nexamples = [\n tokenizer(\n \"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.\"\n )\n]\n\nquantize_config = BaseQuantizeConfig(\n bits=4, # quantize model to 4-bit\n group_size=128, # it is recommended to set the value to 128\n desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad\n)\n\n# load un-quantized model, by default, the model will always be loaded into CPU memory\nmodel = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)\n\n# quantize model, the examples should be list of dict whose keys can only be \"input_ids\" and \"attention_mask\"\nmodel.quantize(examples)\n\n# save quantized model\nmodel.save_quantized(quantized_model_dir)\n\n# save quantized model using safetensors\nmodel.save_quantized(quantized_model_dir, use_safetensors=True)\n\n# push quantized model to Hugging Face Hub.\n# to use use_auth_token=True, Login first via huggingface-cli login.\n# or pass explcit token with: use_auth_token=\"hf_xxxxxxx\"\n# (uncomment the following three lines to enable this feature)\n# repo_id = f\"YourUserName/{quantized_model_dir}\"\n# commit_message = f\"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}\"\n# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)\n\n# alternatively you can save and push at the same time\n# (uncomment the following three lines to enable this feature)\n# repo_id = f\"YourUserName/{quantized_model_dir}\"\n# commit_message = f\"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}\"\n# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)\n\n# load quantized model to the first GPU\nmodel = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=\"cuda:0\")\n\n# download quantized model from Hugging Face Hub and load to the first GPU\n# model = AutoGPTQForCausalLM.from_quantized(repo_id, device=\"cuda:0\", use_safetensors=True, use_triton=False)\n\n# inference with model.generate\nprint(tokenizer.decode(model.generate(**tokenizer(\"auto_gptq is\", return_tensors=\"pt\").to(model.device))[0]))\n\n# or you can also use pipeline\npipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)\nprint(pipeline(\"auto-gptq is\")[0][\"generated_text\"])\n```\n\nFor more advanced features of model quantization, please reference to [this script](examples/quantization/quant_with_alpaca.py)\n\n### Customize Model\n<details>\n\n<summary>Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:</summary>\n\n```python\nfrom auto_gptq.modeling import BaseGPTQForCausalLM\n\n\nclass OPTGPTQForCausalLM(BaseGPTQForCausalLM):\n # chained attribute name of transformer layer block\n layers_block_name = \"model.decoder.layers\"\n # chained attribute names of other nn modules that in the same level as the transformer layer block\n outside_layer_modules = [\n \"model.decoder.embed_tokens\", \"model.decoder.embed_positions\", \"model.decoder.project_out\",\n \"model.decoder.project_in\", \"model.decoder.final_layer_norm\"\n ]\n # chained attribute names of linear layers in transformer layer module\n # normally, there are four sub lists, for each one the modules in it can be seen as one operation,\n # and the order should be the order when they are truly executed, in this case (and usually in most cases),\n # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output\n inside_layer_modules = [\n [\"self_attn.k_proj\", \"self_attn.v_proj\", \"self_attn.q_proj\"],\n [\"self_attn.out_proj\"],\n [\"fc1\"],\n [\"fc2\"]\n ]\n```\nAfter this, you can use `OPTGPTQForCausalLM.from_pretrained` and other methods as shown in Basic.\n\n</details>\n\n### Evaluation on Downstream Tasks\nYou can use tasks defined in `auto_gptq.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.\n\nThe predefined tasks support all causal-language-models implemented in [\ud83e\udd17 transformers](https://github.com/huggingface/transformers) and in this project.\n\n<details>\n\n<summary>Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:</summary>\n\n```python\nfrom functools import partial\n\nimport datasets\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig\n\nfrom auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig\nfrom auto_gptq.eval_tasks import SequenceClassificationTask\n\n\nMODEL = \"EleutherAI/gpt-j-6b\"\nDATASET = \"cardiffnlp/tweet_sentiment_multilingual\"\nTEMPLATE = \"Question:What's the sentiment of the given text? Choices are {labels}.\\nText: {text}\\nAnswer:\"\nID2LABEL = {\n 0: \"negative\",\n 1: \"neutral\",\n 2: \"positive\"\n}\nLABELS = list(ID2LABEL.values())\n\n\ndef ds_refactor_fn(samples):\n text_data = samples[\"text\"]\n label_data = samples[\"label\"]\n\n new_samples = {\"prompt\": [], \"label\": []}\n for text, label in zip(text_data, label_data):\n prompt = TEMPLATE.format(labels=LABELS, text=text)\n new_samples[\"prompt\"].append(prompt)\n new_samples[\"label\"].append(ID2LABEL[label])\n\n return new_samples\n\n\n# model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to(\"cuda:0\")\nmodel = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())\ntokenizer = AutoTokenizer.from_pretrained(MODEL)\n\ntask = SequenceClassificationTask(\n model=model,\n tokenizer=tokenizer,\n classes=LABELS,\n data_name_or_path=DATASET,\n prompt_col_name=\"prompt\",\n label_col_name=\"label\",\n **{\n \"num_samples\": 1000, # how many samples will be sampled to evaluation\n \"sample_max_len\": 1024, # max tokens for each sample\n \"block_max_len\": 2048, # max tokens for each data block\n # function to load dataset, one must only accept data_name_or_path as input\n # and return datasets.Dataset\n \"load_fn\": partial(datasets.load_dataset, name=\"english\"),\n # function to preprocess dataset, which is used for datasets.Dataset.map,\n # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]\n \"preprocess_fn\": ds_refactor_fn,\n # truncate label when sample's length exceed sample_max_len\n \"truncate_prompt\": False\n }\n )\n\n# note that max_new_tokens will be automatically specified internally based on given classes\nprint(task.run())\n\n# self-consistency\nprint(\n task.run(\n generation_config=GenerationConfig(\n num_beams=3,\n num_return_sequences=3,\n do_sample=True\n )\n )\n)\n```\n\n</details>\n\n## Learn More\n[tutorials](docs/tutorial) provide step-by-step guidance to integrate `auto_gptq` with your own project and some best practice principles.\n\n[examples](examples/README.md) provide plenty of example scripts to use `auto_gptq` in different ways.\n\n## Supported Models\n\n> you can use `model.config.model_type` to compare with the table below to check whether the model you use is supported by `auto_gptq`.\n>\n> for example, model_type of `WizardLM`, `vicuna` and `gpt4all` are all `llama`, hence they are all supported by `auto_gptq`.\n\n| model type | quantization | inference | peft-lora | peft-ada-lora | peft-adaption_prompt |\n|------------------------------------|--------------|-----------|-----------|---------------|-------------------------------------------------------------------------------------------------|\n| bloom | \u2705 | \u2705 | \u2705 | \u2705 | |\n| gpt2 | \u2705 | \u2705 | \u2705 | \u2705 | |\n| gpt_neox | \u2705 | \u2705 | \u2705 | \u2705 | \u2705[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |\n| gptj | \u2705 | \u2705 | \u2705 | \u2705 | \u2705[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |\n| llama | \u2705 | \u2705 | \u2705 | \u2705 | \u2705 |\n| moss | \u2705 | \u2705 | \u2705 | \u2705 | \u2705[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |\n| opt | \u2705 | \u2705 | \u2705 | \u2705 | |\n| gpt_bigcode | \u2705 | \u2705 | \u2705 | \u2705 | |\n| codegen | \u2705 | \u2705 | \u2705 | \u2705 | |\n| falcon(RefinedWebModel/RefinedWeb) | \u2705 | \u2705 | \u2705 | \u2705 | |\n\n## Supported Evaluation Tasks\nCurrently, `auto_gptq` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!\n\n## Running tests\n\nTests can be run with:\n\n```\npytest tests/ -s\n```\n\n## FAQ\n\n### Which kernel is used by default?\n\nAutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.\n\n### How to use Marlin kernel?\n\nMarlin is an optimized int4 * fp16 kernel was recently proposed at https://github.com/IST-DASLab/marlin. This is integrated in AutoGPTQ when loading a model with `use_marlin=True`. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).\n\n## Acknowledgement\n- Special thanks **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh** for proposing **GPTQ** algorithm and open source the [code](https://github.com/IST-DASLab/gptq), and for releasing [Marlin kernel](https://github.com/IST-DASLab/marlin) for mixed precision computation.\n- Special thanks **qwopqwop200**, for code in this project that relevant to quantization are mainly referenced from [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda).\n- Special thanks to **turboderp**, for releasing [Exllama](https://github.com/turboderp/exllama) and [Exllama v2](https://github.com/turboderp/exllamav2) libraries with efficient mixed precision kernels.\n",
"bugtrack_url": null,
"license": "",
"summary": "An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.",
"version": "0.7.0",
"project_urls": {
"Homepage": "https://github.com/PanQiWei/AutoGPTQ"
},
"split_keywords": [
"gptq",
"quantization",
"large-language-models",
"transformers"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "463516108a20bb8fab128477e991c5dbd5e6975a68d392a6232d36007f7d6302",
"md5": "3b91cfac51f6dc26defeec45955638cd",
"sha256": "2a077f89d3b81e82fd091488e472c2dc35d1284a7ce14ed9db273826500d4fc5"
},
"downloads": -1,
"filename": "auto_gptq-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "3b91cfac51f6dc26defeec45955638cd",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.8.0",
"size": 23479184,
"upload_time": "2024-02-16T12:50:44",
"upload_time_iso_8601": "2024-02-16T12:50:44.390702Z",
"url": "https://files.pythonhosted.org/packages/46/35/16108a20bb8fab128477e991c5dbd5e6975a68d392a6232d36007f7d6302/auto_gptq-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "621cb6df332875a6a3b76741000e29b4f33033437860d02df86391b3e10b15e2",
"md5": "1914bc4cce8f891db94a593f47140617",
"sha256": "1fda204ff0f77269f49e0ffe63ebfd6f17e46bd1a466c9d26ee3ae14d728a25c"
},
"downloads": -1,
"filename": "auto_gptq-0.7.0-cp310-cp310-win_amd64.whl",
"has_sig": false,
"md5_digest": "1914bc4cce8f891db94a593f47140617",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.8.0",
"size": 4636130,
"upload_time": "2024-02-16T12:50:47",
"upload_time_iso_8601": "2024-02-16T12:50:47.707006Z",
"url": "https://files.pythonhosted.org/packages/62/1c/b6df332875a6a3b76741000e29b4f33033437860d02df86391b3e10b15e2/auto_gptq-0.7.0-cp310-cp310-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b28fc0b95423b1a8e51afe613358c7df1700659984758b839df0bf85ad3ae3b6",
"md5": "97192c5a7bd7b662eb23eb9331910cc2",
"sha256": "3961278b9a2f2ec60963b5c58eb1e0b91a9e650ecaa97bae69a8fdeafc78f82c"
},
"downloads": -1,
"filename": "auto_gptq-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "97192c5a7bd7b662eb23eb9331910cc2",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8.0",
"size": 23518310,
"upload_time": "2024-02-16T12:50:52",
"upload_time_iso_8601": "2024-02-16T12:50:52.180073Z",
"url": "https://files.pythonhosted.org/packages/b2/8f/c0b95423b1a8e51afe613358c7df1700659984758b839df0bf85ad3ae3b6/auto_gptq-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c53f4f93c1d136f07338feac4c4f457a1a9f5defe3b1af347d7caf31accddc0a",
"md5": "1ffe71d42ccd586d1c1bbba627e1c300",
"sha256": "f88f4c34fe58e164487cd002a6c9adff0cc00053b94c817261dc3c47d0c921cf"
},
"downloads": -1,
"filename": "auto_gptq-0.7.0-cp311-cp311-win_amd64.whl",
"has_sig": false,
"md5_digest": "1ffe71d42ccd586d1c1bbba627e1c300",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8.0",
"size": 4641518,
"upload_time": "2024-02-16T12:50:55",
"upload_time_iso_8601": "2024-02-16T12:50:55.642317Z",
"url": "https://files.pythonhosted.org/packages/c5/3f/4f93c1d136f07338feac4c4f457a1a9f5defe3b1af347d7caf31accddc0a/auto_gptq-0.7.0-cp311-cp311-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "68aee25952b6c1204d73903a089e58c06d97cb34fb40b666137446e40a0cf667",
"md5": "c889b0673b06caa25c783b4677d0b554",
"sha256": "1ba5a612e6e7b80efd6550df95f11d40a7fbb432e8e6ac148c054fa4cc8ff64d"
},
"downloads": -1,
"filename": "auto_gptq-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "c889b0673b06caa25c783b4677d0b554",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8.0",
"size": 23480455,
"upload_time": "2024-02-16T12:50:59",
"upload_time_iso_8601": "2024-02-16T12:50:59.103119Z",
"url": "https://files.pythonhosted.org/packages/68/ae/e25952b6c1204d73903a089e58c06d97cb34fb40b666137446e40a0cf667/auto_gptq-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7aa95ce2cf399164f243887d17095dc07ccbb46f182b39ccd619708d8a5a1476",
"md5": "74b3c27344180665695dd1228b50ef5b",
"sha256": "13e4e3cd62bbe02bf062f8a5908f22a58b9b8162ae9bb7abea6c2a7678ca9569"
},
"downloads": -1,
"filename": "auto_gptq-0.7.0-cp38-cp38-win_amd64.whl",
"has_sig": false,
"md5_digest": "74b3c27344180665695dd1228b50ef5b",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8.0",
"size": 4636539,
"upload_time": "2024-02-16T12:51:02",
"upload_time_iso_8601": "2024-02-16T12:51:02.629007Z",
"url": "https://files.pythonhosted.org/packages/7a/a9/5ce2cf399164f243887d17095dc07ccbb46f182b39ccd619708d8a5a1476/auto_gptq-0.7.0-cp38-cp38-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "041f14ef7a22a1a3e4f5d174e0d2eca452dda053abf0d99870b67f747835c093",
"md5": "254f8c40d101b645556220d442a1fa92",
"sha256": "059f11c776ad2a0f3f3a816c9448cef734a4ced9168eeda39846b018782f6f5b"
},
"downloads": -1,
"filename": "auto_gptq-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "254f8c40d101b645556220d442a1fa92",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.8.0",
"size": 23478195,
"upload_time": "2024-02-16T12:51:06",
"upload_time_iso_8601": "2024-02-16T12:51:06.926263Z",
"url": "https://files.pythonhosted.org/packages/04/1f/14ef7a22a1a3e4f5d174e0d2eca452dda053abf0d99870b67f747835c093/auto_gptq-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0cf3105b85db97a219c9a11a6e0d1af4319d69caf12ffa709ec082cba2d92d77",
"md5": "31c65d3ac40d3fcf248fc9f7d752e4c8",
"sha256": "e62698a13ef0ef0382c1fa32d2d4de59738063d3f973cc6db5c380c2e7f6c700"
},
"downloads": -1,
"filename": "auto_gptq-0.7.0-cp39-cp39-win_amd64.whl",
"has_sig": false,
"md5_digest": "31c65d3ac40d3fcf248fc9f7d752e4c8",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.8.0",
"size": 4633189,
"upload_time": "2024-02-16T12:51:10",
"upload_time_iso_8601": "2024-02-16T12:51:10.515564Z",
"url": "https://files.pythonhosted.org/packages/0c/f3/105b85db97a219c9a11a6e0d1af4319d69caf12ffa709ec082cba2d92d77/auto_gptq-0.7.0-cp39-cp39-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3471c3e73cf17681f6ff4754ef8f4cb8b67af3def230fc8711eac1250bbd78d5",
"md5": "5729e4a7a8ebc827ec0f44f5b1f820e2",
"sha256": "50a5396fae2db5a19446b3198ef0e86ee520846b881db47bdbf4eb9260eac723"
},
"downloads": -1,
"filename": "auto_gptq-0.7.0.tar.gz",
"has_sig": false,
"md5_digest": "5729e4a7a8ebc827ec0f44f5b1f820e2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.0",
"size": 124641,
"upload_time": "2024-02-16T12:52:41",
"upload_time_iso_8601": "2024-02-16T12:52:41.215587Z",
"url": "https://files.pythonhosted.org/packages/34/71/c3e73cf17681f6ff4754ef8f4cb8b67af3def230fc8711eac1250bbd78d5/auto_gptq-0.7.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-16 12:52:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "PanQiWei",
"github_project": "AutoGPTQ",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "auto-gptq"
}