auto-gptq

Name	auto-gptq JSON
Version	0.7.0 JSON
	download
home_page	https://github.com/PanQiWei/AutoGPTQ
Summary	An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
upload_time	2024-02-16 12:52:41
maintainer
docs_url	None
author	PanQiWei
requires_python	>=3.8.0
license
keywords	gptq quantization large-language-models transformers
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center">AutoGPTQ</h1>
<p align="center">An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).</p>
<p align="center">
    <a href="https://github.com/PanQiWei/AutoGPTQ/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/release/PanQiWei/AutoGPTQ.svg">
    </a>
    <a href="https://pypi.org/project/auto-gptq/">
        <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dd/auto-gptq">
    </a>
</p>
<h4 align="center">
    <p>
        <b>English</b> |
        <a href="https://github.com/PanQiWei/AutoGPTQ/blob/main/README_zh.md">中文</a>
    </p>
</h4>

## News or Update

- 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with [Marlin](https://github.com/IST-DASLab/marlin) int4*fp16 matrix multiplication kernel support, with the argument `use_marlin=True` when loading models.
- 2023-08-23 - (News) - 🤗 Transformers, optimum and peft have integrated `auto-gptq`, so now running and training GPTQ models can be more available to everyone! See [this blog](https://huggingface.co/blog/gptq-integration) and it's resources for more details!

*For more histories please turn to [here](docs/NEWS_OR_UPDATE.md)*

## Performance Comparison

### Inference Speed
> The result is generated using [this script](examples/benchmark/generation_speed.py), batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).
>
> The quantized model is loaded using the setup that can gain the fastest inference speed.

| model         | GPU           | num_beams | fp16  | gptq-int4 |
|---------------|---------------|-----------|-------|-----------|
| llama-7b      | 1xA100-40G    | 1         | 18.87 | 25.53     |
| llama-7b      | 1xA100-40G    | 4         | 68.79 | 91.30     |
| moss-moon 16b | 1xA100-40G    | 1         | 12.48 | 15.25     |
| moss-moon 16b | 1xA100-40G    | 4         | OOM   | 42.67     |
| moss-moon 16b | 2xA100-40G    | 1         | 06.83 | 06.78     |
| moss-moon 16b | 2xA100-40G    | 4         | 13.10 | 10.80     |
| gpt-j 6b      | 1xRTX3060-12G | 1         | OOM   | 29.55     |
| gpt-j 6b      | 1xRTX3060-12G | 4         | OOM   | 47.36     |


### Perplexity
For perplexity comparison, you can turn to [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#result) and [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#gptq-vs-bitsandbytes)

## Installation

AutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:

| CUDA/ROCm version | Installation                                                                                      | Built against PyTorch |
|-------------------|---------------------------------------------------------------------------------------------------|-----------------------|
| CUDA 11.8         | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`   | 2.2.0+cu118           |
| CUDA 12.1         | `pip install auto-gptq`                                                                            | 2.2.0+cu121           |
| ROCm 5.7          | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/` | 2.2.0+rocm5.7               |

AutoGPTQ can be installed with the Triton dependency with `pip install auto-gptq[triton]` in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).

For older AutoGPTQ, please refer to [the previous releases installation table](docs/INSTALLATION.md).

### Install from source

Clone the source code:
```bash
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
```

A few packages are required in order to build from source: `pip install numpy gekko pandas`.

Then, install locally from source:
```bash
pip install -vvv -e .
```
You can set `BUILD_CUDA_EXT=0` to disable pytorch extension building, but this is **strongly discouraged** as AutoGPTQ then falls back on a slow python implementation.

#### On ROCm systems

To install from source for AMD GPUs supporting ROCm, please specify the `ROCM_VERSION` environment variable. Example:

```bash
ROCM_VERSION=5.6 pip install -vvv -e .
```

The compilation can be speeded up by specifying the `PYTORCH_ROCM_ARCH` variable ([reference](https://github.com/pytorch/pytorch/blob/7b73b1e8a73a1777ebe8d2cd4487eb13da55b3ba/setup.py#L132)) in order to build for a single target device, for example `gfx90a` for MI200 series devices.

For ROCm systems, the packages `rocsparse-dev`, `hipsparse-dev`, `rocthrust-dev`, `rocblas-dev` and `hipblas-dev` are required to build.

## Quick Tour

### Quantization and Inference
> warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.

Below is an example for the simplest use of `auto_gptq` to quantize a model and inference after quantization:
```python
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

# save quantized model
model.save_quantized(quantized_model_dir)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])
```

For more advanced features of model quantization, please reference to [this script](examples/quantization/quant_with_alpaca.py)

### Customize Model
<details>

<summary>Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:</summary>

```python
from auto_gptq.modeling import BaseGPTQForCausalLM


class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
    # chained attribute name of transformer layer block
    layers_block_name = "model.decoder.layers"
    # chained attribute names of other nn modules that in the same level as the transformer layer block
    outside_layer_modules = [
        "model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
        "model.decoder.project_in", "model.decoder.final_layer_norm"
    ]
    # chained attribute names of linear layers in transformer layer module
    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,
    # and the order should be the order when they are truly executed, in this case (and usually in most cases),
    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
    inside_layer_modules = [
        ["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
        ["self_attn.out_proj"],
        ["fc1"],
        ["fc2"]
    ]
```
After this, you can use `OPTGPTQForCausalLM.from_pretrained` and other methods as shown in Basic.

</details>

### Evaluation on Downstream Tasks
You can use tasks defined in `auto_gptq.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.

The predefined tasks support all causal-language-models implemented in [🤗 transformers](https://github.com/huggingface/transformers) and in this project.

<details>

<summary>Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:</summary>

```python
from functools import partial

import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.eval_tasks import SequenceClassificationTask


MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
    0: "negative",
    1: "neutral",
    2: "positive"
}
LABELS = list(ID2LABEL.values())


def ds_refactor_fn(samples):
    text_data = samples["text"]
    label_data = samples["label"]

    new_samples = {"prompt": [], "label": []}
    for text, label in zip(text_data, label_data):
        prompt = TEMPLATE.format(labels=LABELS, text=text)
        new_samples["prompt"].append(prompt)
        new_samples["label"].append(ID2LABEL[label])

    return new_samples


#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)

task = SequenceClassificationTask(
        model=model,
        tokenizer=tokenizer,
        classes=LABELS,
        data_name_or_path=DATASET,
        prompt_col_name="prompt",
        label_col_name="label",
        **{
            "num_samples": 1000,  # how many samples will be sampled to evaluation
            "sample_max_len": 1024,  # max tokens for each sample
            "block_max_len": 2048,  # max tokens for each data block
            # function to load dataset, one must only accept data_name_or_path as input
            # and return datasets.Dataset
            "load_fn": partial(datasets.load_dataset, name="english"),
            # function to preprocess dataset, which is used for datasets.Dataset.map,
            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
            "preprocess_fn": ds_refactor_fn,
            # truncate label when sample's length exceed sample_max_len
            "truncate_prompt": False
        }
    )

# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())

# self-consistency
print(
    task.run(
        generation_config=GenerationConfig(
            num_beams=3,
            num_return_sequences=3,
            do_sample=True
        )
    )
)
```

</details>

## Learn More
[tutorials](docs/tutorial) provide step-by-step guidance to integrate `auto_gptq` with your own project and some best practice principles.

[examples](examples/README.md) provide plenty of example scripts to use `auto_gptq` in different ways.

## Supported Models

> you can use `model.config.model_type` to compare with the table below to check whether the model you use is supported by `auto_gptq`.
>
> for example, model_type of `WizardLM`, `vicuna` and `gpt4all` are all `llama`, hence they are all supported by `auto_gptq`.

| model type                         | quantization | inference | peft-lora | peft-ada-lora | peft-adaption_prompt                                                                            |
|------------------------------------|--------------|-----------|-----------|---------------|-------------------------------------------------------------------------------------------------|
| bloom                              | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
| gpt2                               | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
| gpt_neox                           | ✅            | ✅         | ✅         | ✅             | ✅[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| gptj                               | ✅            | ✅         | ✅         | ✅             | ✅[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| llama                              | ✅            | ✅         | ✅         | ✅             | ✅                                                                                               |
| moss                               | ✅            | ✅         | ✅         | ✅             | ✅[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |
| opt                                | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
| gpt_bigcode                        | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
| codegen                            | ✅            | ✅         | ✅         | ✅             |                                                                                                 |
| falcon(RefinedWebModel/RefinedWeb) | ✅            | ✅         | ✅         | ✅             |                                                                                                 |

## Supported Evaluation Tasks
Currently, `auto_gptq` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!

## Running tests

Tests can be run with:

```
pytest tests/ -s
```

## FAQ

### Which kernel is used by default?

AutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.

### How to use Marlin kernel?

Marlin is an optimized int4 * fp16 kernel was recently proposed at https://github.com/IST-DASLab/marlin. This is integrated in AutoGPTQ when loading a model with `use_marlin=True`. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).

## Acknowledgement
- Special thanks **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh** for proposing **GPTQ** algorithm and open source the [code](https://github.com/IST-DASLab/gptq), and for releasing [Marlin kernel](https://github.com/IST-DASLab/marlin) for mixed precision computation.
- Special thanks **qwopqwop200**, for code in this project that relevant to quantization are mainly referenced from [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda).
- Special thanks to **turboderp**, for releasing [Exllama](https://github.com/turboderp/exllama) and [Exllama v2](https://github.com/turboderp/exllamav2) libraries with efficient mixed precision kernels.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/PanQiWei/AutoGPTQ",
    "name": "auto-gptq",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.0",
    "maintainer_email": "",
    "keywords": "gptq,quantization,large-language-models,transformers",
    "author": "PanQiWei",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/34/71/c3e73cf17681f6ff4754ef8f4cb8b67af3def230fc8711eac1250bbd78d5/auto_gptq-0.7.0.tar.gz",
    "platform": "windows",
    "description": "<h1 align=\"center\">AutoGPTQ</h1>\n<p align=\"center\">An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).</p>\n<p align=\"center\">\n    <a href=\"https://github.com/PanQiWei/AutoGPTQ/releases\">\n        <img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/PanQiWei/AutoGPTQ.svg\">\n    </a>\n    <a href=\"https://pypi.org/project/auto-gptq/\">\n        <img alt=\"PyPI - Downloads\" src=\"https://img.shields.io/pypi/dd/auto-gptq\">\n    </a>\n</p>\n<h4 align=\"center\">\n    <p>\n        <b>English</b> |\n        <a href=\"https://github.com/PanQiWei/AutoGPTQ/blob/main/README_zh.md\">\u4e2d\u6587</a>\n    </p>\n</h4>\n\n## News or Update\n\n- 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with [Marlin](https://github.com/IST-DASLab/marlin) int4*fp16 matrix multiplication kernel support, with the argument `use_marlin=True` when loading models.\n- 2023-08-23 - (News) - \ud83e\udd17 Transformers, optimum and peft have integrated `auto-gptq`, so now running and training GPTQ models can be more available to everyone! See [this blog](https://huggingface.co/blog/gptq-integration) and it's resources for more details!\n\n*For more histories please turn to [here](docs/NEWS_OR_UPDATE.md)*\n\n## Performance Comparison\n\n### Inference Speed\n> The result is generated using [this script](examples/benchmark/generation_speed.py), batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).\n>\n> The quantized model is loaded using the setup that can gain the fastest inference speed.\n\n| model         | GPU           | num_beams | fp16  | gptq-int4 |\n|---------------|---------------|-----------|-------|-----------|\n| llama-7b      | 1xA100-40G    | 1         | 18.87 | 25.53     |\n| llama-7b      | 1xA100-40G    | 4         | 68.79 | 91.30     |\n| moss-moon 16b | 1xA100-40G    | 1         | 12.48 | 15.25     |\n| moss-moon 16b | 1xA100-40G    | 4         | OOM   | 42.67     |\n| moss-moon 16b | 2xA100-40G    | 1         | 06.83 | 06.78     |\n| moss-moon 16b | 2xA100-40G    | 4         | 13.10 | 10.80     |\n| gpt-j 6b      | 1xRTX3060-12G | 1         | OOM   | 29.55     |\n| gpt-j 6b      | 1xRTX3060-12G | 4         | OOM   | 47.36     |\n\n\n### Perplexity\nFor perplexity comparison, you can turn to [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#result) and [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#gptq-vs-bitsandbytes)\n\n## Installation\n\nAutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:\n\n| CUDA/ROCm version | Installation                                                                                      | Built against PyTorch |\n|-------------------|---------------------------------------------------------------------------------------------------|-----------------------|\n| CUDA 11.8         | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`   | 2.2.0+cu118           |\n| CUDA 12.1         | `pip install auto-gptq`                                                                            | 2.2.0+cu121           |\n| ROCm 5.7          | `pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/` | 2.2.0+rocm5.7               |\n\nAutoGPTQ can be installed with the Triton dependency with `pip install auto-gptq[triton]` in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).\n\nFor older AutoGPTQ, please refer to [the previous releases installation table](docs/INSTALLATION.md).\n\n### Install from source\n\nClone the source code:\n```bash\ngit clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ\n```\n\nA few packages are required in order to build from source: `pip install numpy gekko pandas`.\n\nThen, install locally from source:\n```bash\npip install -vvv -e .\n```\nYou can set `BUILD_CUDA_EXT=0` to disable pytorch extension building, but this is **strongly discouraged** as AutoGPTQ then falls back on a slow python implementation.\n\n#### On ROCm systems\n\nTo install from source for AMD GPUs supporting ROCm, please specify the `ROCM_VERSION` environment variable. Example:\n\n```bash\nROCM_VERSION=5.6 pip install -vvv -e .\n```\n\nThe compilation can be speeded up by specifying the `PYTORCH_ROCM_ARCH` variable ([reference](https://github.com/pytorch/pytorch/blob/7b73b1e8a73a1777ebe8d2cd4487eb13da55b3ba/setup.py#L132)) in order to build for a single target device, for example `gfx90a` for MI200 series devices.\n\nFor ROCm systems, the packages `rocsparse-dev`, `hipsparse-dev`, `rocthrust-dev`, `rocblas-dev` and `hipblas-dev` are required to build.\n\n## Quick Tour\n\n### Quantization and Inference\n> warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.\n\nBelow is an example for the simplest use of `auto_gptq` to quantize a model and inference after quantization:\n```python\nfrom transformers import AutoTokenizer, TextGenerationPipeline\nfrom auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig\nimport logging\n\nlogging.basicConfig(\n    format=\"%(asctime)s %(levelname)s [%(name)s] %(message)s\", level=logging.INFO, datefmt=\"%Y-%m-%d %H:%M:%S\"\n)\n\npretrained_model_dir = \"facebook/opt-125m\"\nquantized_model_dir = \"opt-125m-4bit\"\n\ntokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)\nexamples = [\n    tokenizer(\n        \"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm.\"\n    )\n]\n\nquantize_config = BaseQuantizeConfig(\n    bits=4,  # quantize model to 4-bit\n    group_size=128,  # it is recommended to set the value to 128\n    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad\n)\n\n# load un-quantized model, by default, the model will always be loaded into CPU memory\nmodel = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)\n\n# quantize model, the examples should be list of dict whose keys can only be \"input_ids\" and \"attention_mask\"\nmodel.quantize(examples)\n\n# save quantized model\nmodel.save_quantized(quantized_model_dir)\n\n# save quantized model using safetensors\nmodel.save_quantized(quantized_model_dir, use_safetensors=True)\n\n# push quantized model to Hugging Face Hub.\n# to use use_auth_token=True, Login first via huggingface-cli login.\n# or pass explcit token with: use_auth_token=\"hf_xxxxxxx\"\n# (uncomment the following three lines to enable this feature)\n# repo_id = f\"YourUserName/{quantized_model_dir}\"\n# commit_message = f\"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}\"\n# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)\n\n# alternatively you can save and push at the same time\n# (uncomment the following three lines to enable this feature)\n# repo_id = f\"YourUserName/{quantized_model_dir}\"\n# commit_message = f\"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}\"\n# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)\n\n# load quantized model to the first GPU\nmodel = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=\"cuda:0\")\n\n# download quantized model from Hugging Face Hub and load to the first GPU\n# model = AutoGPTQForCausalLM.from_quantized(repo_id, device=\"cuda:0\", use_safetensors=True, use_triton=False)\n\n# inference with model.generate\nprint(tokenizer.decode(model.generate(**tokenizer(\"auto_gptq is\", return_tensors=\"pt\").to(model.device))[0]))\n\n# or you can also use pipeline\npipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)\nprint(pipeline(\"auto-gptq is\")[0][\"generated_text\"])\n```\n\nFor more advanced features of model quantization, please reference to [this script](examples/quantization/quant_with_alpaca.py)\n\n### Customize Model\n<details>\n\n<summary>Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:</summary>\n\n```python\nfrom auto_gptq.modeling import BaseGPTQForCausalLM\n\n\nclass OPTGPTQForCausalLM(BaseGPTQForCausalLM):\n    # chained attribute name of transformer layer block\n    layers_block_name = \"model.decoder.layers\"\n    # chained attribute names of other nn modules that in the same level as the transformer layer block\n    outside_layer_modules = [\n        \"model.decoder.embed_tokens\", \"model.decoder.embed_positions\", \"model.decoder.project_out\",\n        \"model.decoder.project_in\", \"model.decoder.final_layer_norm\"\n    ]\n    # chained attribute names of linear layers in transformer layer module\n    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,\n    # and the order should be the order when they are truly executed, in this case (and usually in most cases),\n    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output\n    inside_layer_modules = [\n        [\"self_attn.k_proj\", \"self_attn.v_proj\", \"self_attn.q_proj\"],\n        [\"self_attn.out_proj\"],\n        [\"fc1\"],\n        [\"fc2\"]\n    ]\n```\nAfter this, you can use `OPTGPTQForCausalLM.from_pretrained` and other methods as shown in Basic.\n\n</details>\n\n### Evaluation on Downstream Tasks\nYou can use tasks defined in `auto_gptq.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.\n\nThe predefined tasks support all causal-language-models implemented in [\ud83e\udd17 transformers](https://github.com/huggingface/transformers) and in this project.\n\n<details>\n\n<summary>Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:</summary>\n\n```python\nfrom functools import partial\n\nimport datasets\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig\n\nfrom auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig\nfrom auto_gptq.eval_tasks import SequenceClassificationTask\n\n\nMODEL = \"EleutherAI/gpt-j-6b\"\nDATASET = \"cardiffnlp/tweet_sentiment_multilingual\"\nTEMPLATE = \"Question:What's the sentiment of the given text? Choices are {labels}.\\nText: {text}\\nAnswer:\"\nID2LABEL = {\n    0: \"negative\",\n    1: \"neutral\",\n    2: \"positive\"\n}\nLABELS = list(ID2LABEL.values())\n\n\ndef ds_refactor_fn(samples):\n    text_data = samples[\"text\"]\n    label_data = samples[\"label\"]\n\n    new_samples = {\"prompt\": [], \"label\": []}\n    for text, label in zip(text_data, label_data):\n        prompt = TEMPLATE.format(labels=LABELS, text=text)\n        new_samples[\"prompt\"].append(prompt)\n        new_samples[\"label\"].append(ID2LABEL[label])\n\n    return new_samples\n\n\n#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to(\"cuda:0\")\nmodel = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())\ntokenizer = AutoTokenizer.from_pretrained(MODEL)\n\ntask = SequenceClassificationTask(\n        model=model,\n        tokenizer=tokenizer,\n        classes=LABELS,\n        data_name_or_path=DATASET,\n        prompt_col_name=\"prompt\",\n        label_col_name=\"label\",\n        **{\n            \"num_samples\": 1000,  # how many samples will be sampled to evaluation\n            \"sample_max_len\": 1024,  # max tokens for each sample\n            \"block_max_len\": 2048,  # max tokens for each data block\n            # function to load dataset, one must only accept data_name_or_path as input\n            # and return datasets.Dataset\n            \"load_fn\": partial(datasets.load_dataset, name=\"english\"),\n            # function to preprocess dataset, which is used for datasets.Dataset.map,\n            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]\n            \"preprocess_fn\": ds_refactor_fn,\n            # truncate label when sample's length exceed sample_max_len\n            \"truncate_prompt\": False\n        }\n    )\n\n# note that max_new_tokens will be automatically specified internally based on given classes\nprint(task.run())\n\n# self-consistency\nprint(\n    task.run(\n        generation_config=GenerationConfig(\n            num_beams=3,\n            num_return_sequences=3,\n            do_sample=True\n        )\n    )\n)\n```\n\n</details>\n\n## Learn More\n[tutorials](docs/tutorial) provide step-by-step guidance to integrate `auto_gptq` with your own project and some best practice principles.\n\n[examples](examples/README.md) provide plenty of example scripts to use `auto_gptq` in different ways.\n\n## Supported Models\n\n> you can use `model.config.model_type` to compare with the table below to check whether the model you use is supported by `auto_gptq`.\n>\n> for example, model_type of `WizardLM`, `vicuna` and `gpt4all` are all `llama`, hence they are all supported by `auto_gptq`.\n\n| model type                         | quantization | inference | peft-lora | peft-ada-lora | peft-adaption_prompt                                                                            |\n|------------------------------------|--------------|-----------|-----------|---------------|-------------------------------------------------------------------------------------------------|\n| bloom                              | \u2705            | \u2705         | \u2705         | \u2705             |                                                                                                 |\n| gpt2                               | \u2705            | \u2705         | \u2705         | \u2705             |                                                                                                 |\n| gpt_neox                           | \u2705            | \u2705         | \u2705         | \u2705             | \u2705[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |\n| gptj                               | \u2705            | \u2705         | \u2705         | \u2705             | \u2705[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |\n| llama                              | \u2705            | \u2705         | \u2705         | \u2705             | \u2705                                                                                               |\n| moss                               | \u2705            | \u2705         | \u2705         | \u2705             | \u2705[requires this peft branch](https://github.com/PanQiWei/peft/tree/multi_modal_adaption_prompt) |\n| opt                                | \u2705            | \u2705         | \u2705         | \u2705             |                                                                                                 |\n| gpt_bigcode                        | \u2705            | \u2705         | \u2705         | \u2705             |                                                                                                 |\n| codegen                            | \u2705            | \u2705         | \u2705         | \u2705             |                                                                                                 |\n| falcon(RefinedWebModel/RefinedWeb) | \u2705            | \u2705         | \u2705         | \u2705             |                                                                                                 |\n\n## Supported Evaluation Tasks\nCurrently, `auto_gptq` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!\n\n## Running tests\n\nTests can be run with:\n\n```\npytest tests/ -s\n```\n\n## FAQ\n\n### Which kernel is used by default?\n\nAutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.\n\n### How to use Marlin kernel?\n\nMarlin is an optimized int4 * fp16 kernel was recently proposed at https://github.com/IST-DASLab/marlin. This is integrated in AutoGPTQ when loading a model with `use_marlin=True`. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).\n\n## Acknowledgement\n- Special thanks **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh** for proposing **GPTQ** algorithm and open source the [code](https://github.com/IST-DASLab/gptq), and for releasing [Marlin kernel](https://github.com/IST-DASLab/marlin) for mixed precision computation.\n- Special thanks **qwopqwop200**, for code in this project that relevant to quantization are mainly referenced from [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda).\n- Special thanks to **turboderp**, for releasing [Exllama](https://github.com/turboderp/exllama) and [Exllama v2](https://github.com/turboderp/exllamav2) libraries with efficient mixed precision kernels.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.",
    "version": "0.7.0",
    "project_urls": {
        "Homepage": "https://github.com/PanQiWei/AutoGPTQ"
    },
    "split_keywords": [
        "gptq",
        "quantization",
        "large-language-models",
        "transformers"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "463516108a20bb8fab128477e991c5dbd5e6975a68d392a6232d36007f7d6302",
                "md5": "3b91cfac51f6dc26defeec45955638cd",
                "sha256": "2a077f89d3b81e82fd091488e472c2dc35d1284a7ce14ed9db273826500d4fc5"
            },
            "downloads": -1,
            "filename": "auto_gptq-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "3b91cfac51f6dc26defeec45955638cd",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.8.0",
            "size": 23479184,
            "upload_time": "2024-02-16T12:50:44",
            "upload_time_iso_8601": "2024-02-16T12:50:44.390702Z",
            "url": "https://files.pythonhosted.org/packages/46/35/16108a20bb8fab128477e991c5dbd5e6975a68d392a6232d36007f7d6302/auto_gptq-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "621cb6df332875a6a3b76741000e29b4f33033437860d02df86391b3e10b15e2",
                "md5": "1914bc4cce8f891db94a593f47140617",
                "sha256": "1fda204ff0f77269f49e0ffe63ebfd6f17e46bd1a466c9d26ee3ae14d728a25c"
            },
            "downloads": -1,
            "filename": "auto_gptq-0.7.0-cp310-cp310-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "1914bc4cce8f891db94a593f47140617",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.8.0",
            "size": 4636130,
            "upload_time": "2024-02-16T12:50:47",
            "upload_time_iso_8601": "2024-02-16T12:50:47.707006Z",
            "url": "https://files.pythonhosted.org/packages/62/1c/b6df332875a6a3b76741000e29b4f33033437860d02df86391b3e10b15e2/auto_gptq-0.7.0-cp310-cp310-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b28fc0b95423b1a8e51afe613358c7df1700659984758b839df0bf85ad3ae3b6",
                "md5": "97192c5a7bd7b662eb23eb9331910cc2",
                "sha256": "3961278b9a2f2ec60963b5c58eb1e0b91a9e650ecaa97bae69a8fdeafc78f82c"
            },
            "downloads": -1,
            "filename": "auto_gptq-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "97192c5a7bd7b662eb23eb9331910cc2",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.8.0",
            "size": 23518310,
            "upload_time": "2024-02-16T12:50:52",
            "upload_time_iso_8601": "2024-02-16T12:50:52.180073Z",
            "url": "https://files.pythonhosted.org/packages/b2/8f/c0b95423b1a8e51afe613358c7df1700659984758b839df0bf85ad3ae3b6/auto_gptq-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c53f4f93c1d136f07338feac4c4f457a1a9f5defe3b1af347d7caf31accddc0a",
                "md5": "1ffe71d42ccd586d1c1bbba627e1c300",
                "sha256": "f88f4c34fe58e164487cd002a6c9adff0cc00053b94c817261dc3c47d0c921cf"
            },
            "downloads": -1,
            "filename": "auto_gptq-0.7.0-cp311-cp311-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "1ffe71d42ccd586d1c1bbba627e1c300",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.8.0",
            "size": 4641518,
            "upload_time": "2024-02-16T12:50:55",
            "upload_time_iso_8601": "2024-02-16T12:50:55.642317Z",
            "url": "https://files.pythonhosted.org/packages/c5/3f/4f93c1d136f07338feac4c4f457a1a9f5defe3b1af347d7caf31accddc0a/auto_gptq-0.7.0-cp311-cp311-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "68aee25952b6c1204d73903a089e58c06d97cb34fb40b666137446e40a0cf667",
                "md5": "c889b0673b06caa25c783b4677d0b554",
                "sha256": "1ba5a612e6e7b80efd6550df95f11d40a7fbb432e8e6ac148c054fa4cc8ff64d"
            },
            "downloads": -1,
            "filename": "auto_gptq-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "c889b0673b06caa25c783b4677d0b554",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8.0",
            "size": 23480455,
            "upload_time": "2024-02-16T12:50:59",
            "upload_time_iso_8601": "2024-02-16T12:50:59.103119Z",
            "url": "https://files.pythonhosted.org/packages/68/ae/e25952b6c1204d73903a089e58c06d97cb34fb40b666137446e40a0cf667/auto_gptq-0.7.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7aa95ce2cf399164f243887d17095dc07ccbb46f182b39ccd619708d8a5a1476",
                "md5": "74b3c27344180665695dd1228b50ef5b",
                "sha256": "13e4e3cd62bbe02bf062f8a5908f22a58b9b8162ae9bb7abea6c2a7678ca9569"
            },
            "downloads": -1,
            "filename": "auto_gptq-0.7.0-cp38-cp38-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "74b3c27344180665695dd1228b50ef5b",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8.0",
            "size": 4636539,
            "upload_time": "2024-02-16T12:51:02",
            "upload_time_iso_8601": "2024-02-16T12:51:02.629007Z",
            "url": "https://files.pythonhosted.org/packages/7a/a9/5ce2cf399164f243887d17095dc07ccbb46f182b39ccd619708d8a5a1476/auto_gptq-0.7.0-cp38-cp38-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "041f14ef7a22a1a3e4f5d174e0d2eca452dda053abf0d99870b67f747835c093",
                "md5": "254f8c40d101b645556220d442a1fa92",
                "sha256": "059f11c776ad2a0f3f3a816c9448cef734a4ced9168eeda39846b018782f6f5b"
            },
            "downloads": -1,
            "filename": "auto_gptq-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "254f8c40d101b645556220d442a1fa92",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.8.0",
            "size": 23478195,
            "upload_time": "2024-02-16T12:51:06",
            "upload_time_iso_8601": "2024-02-16T12:51:06.926263Z",
            "url": "https://files.pythonhosted.org/packages/04/1f/14ef7a22a1a3e4f5d174e0d2eca452dda053abf0d99870b67f747835c093/auto_gptq-0.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0cf3105b85db97a219c9a11a6e0d1af4319d69caf12ffa709ec082cba2d92d77",
                "md5": "31c65d3ac40d3fcf248fc9f7d752e4c8",
                "sha256": "e62698a13ef0ef0382c1fa32d2d4de59738063d3f973cc6db5c380c2e7f6c700"
            },
            "downloads": -1,
            "filename": "auto_gptq-0.7.0-cp39-cp39-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "31c65d3ac40d3fcf248fc9f7d752e4c8",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.8.0",
            "size": 4633189,
            "upload_time": "2024-02-16T12:51:10",
            "upload_time_iso_8601": "2024-02-16T12:51:10.515564Z",
            "url": "https://files.pythonhosted.org/packages/0c/f3/105b85db97a219c9a11a6e0d1af4319d69caf12ffa709ec082cba2d92d77/auto_gptq-0.7.0-cp39-cp39-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3471c3e73cf17681f6ff4754ef8f4cb8b67af3def230fc8711eac1250bbd78d5",
                "md5": "5729e4a7a8ebc827ec0f44f5b1f820e2",
                "sha256": "50a5396fae2db5a19446b3198ef0e86ee520846b881db47bdbf4eb9260eac723"
            },
            "downloads": -1,
            "filename": "auto_gptq-0.7.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5729e4a7a8ebc827ec0f44f5b1f820e2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.0",
            "size": 124641,
            "upload_time": "2024-02-16T12:52:41",
            "upload_time_iso_8601": "2024-02-16T12:52:41.215587Z",
            "url": "https://files.pythonhosted.org/packages/34/71/c3e73cf17681f6ff4754ef8f4cb8b67af3def230fc8711eac1250bbd78d5/auto_gptq-0.7.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-16 12:52:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "PanQiWei",
    "github_project": "AutoGPTQ",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "auto-gptq"
}

PanQiWei