gptqmodel


Namegptqmodel JSON
Version 1.0.6 PyPI version JSON
download
home_pagehttps://github.com/ModelCloud/GPTQModel
SummaryA LLM quantization package with user-friendly apis. Based on GPTQ algorithm.
upload_time2024-09-26 16:01:07
maintainerNone
docs_urlNone
authorModelCloud
requires_python>=3.8.0
licenseNone
keywords gptq quantization large-language-models transformers 4bit llm
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center">GPTQModel</h1>
<p align="center">An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).</p>
<p align="center">
    <a href="https://github.com/ModelCloud/GPTQModel/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/release/ModelCloud/GPTQModel.svg">
    </a>
    <a href="https://pypi.org/project/gptqmodel/">
        <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dd/gptqmodel">
    </a>
</p>

## News
* 09/26/2024 ✨ [v1.0.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.6) Fixed quantized Llama 3.2 vision quantized loader
* 09/26/2024 ✨ [v1.0.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.5) Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.
* 09/26/2024 ✨ [v1.0.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.4) Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing. 
* 09/18/2024 ✨ [v1.0.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.3) Added Microsoft GRIN-MoE and MiniCPM3 support.
* 08/16/2024 ✨ [v1.0.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.2) Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release. 
* 08/14/2024 ✨✨ [v1.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.0) 40% faster `packing`, Fixed Python 3.9 compat, added `lm_eval` api. 
* 08/10/2024 🚀🚀🚀 [v0.9.11](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.11) Added LG EXAONE 3.0 model support. New `dynamic` per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to `backend.BITBLAS`. Auto-heal quantization errors due to small damp values. 
* 07/31/2024 🚀🚀 [v0.9.10](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.10) Ported vllm/nm `gptq_marlin` inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with `FORMAT.GPTQ`. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference.
* 07/25/2024 🚀 [v0.9.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.9): Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.  
* 07/13/2024 🚀🚀🚀 [v0.9.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.8):
Run quantized models directly using GPTQModel using fast `vLLM` or `SGLang` backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximum `TPS` (check usage under examples). Marlin backend also
got full end-to-end in/out features padding to enhance current/future model compatibility.
<details>
    
<summary>Archived News:</summary>

* 07/08/2024 🚀 [v0.9.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.7): InternLM 2.5 model support added.

* 07/08/2024 🚀🚀 [v0.9.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.6): [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.

* 07/05/2024 🚀🚀 [v0.9.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.5): [Intel/QBits](https://github.com/intel/intel-extension-for-transformers) support added for [2,3,4,8] bit quantization/inference on CPU. Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.

* 07/03/2024 🚀 [v0.9.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.4): HF Transformers integration added and bug fixed Gemma 2 support.

* 07/02/2024 🚀 [v0.9.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.3): Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.

* 06/30/2024 🚀 [v0.9.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.2): Added auto-padding of model in/out-features for exllama and exllama v2. 
Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.

* 06/29/2024 🚀🚀🚀 [v0.9.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.1): With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.

* 06/20/2924 ✨ GPTQModel [v0.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.0): Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!
</details>

## Mission Statement

We want GPTQModel to be highly focused on GPTQ based quantization and target inference compatibility with HF Transformers, vLLM, and SGLang. 

## How is GPTQModel different from AutoGPTQ?

GPTQModel is an opinionated fork/refactor of AutoGPTQ with latest bug fixes, more model support, faster quant inference, faster quantization, better quants (as measured in PPL) and a pledge from the ModelCloud team and that we, along with the open-source ML community, will take every effort to bring the library up-to-date with latest advancements, model support, and bug fixes.

We will backport bug fixes to AutoGPTQ on a case-by-case basis.

## Major Changes (Advantages) vs AutoGPTQ
* 🚀🚀🚀🚀 Extensive model support for: `EXAONE 3.0`, `InternLM 2.5`, `Gemma 2`, `DeepSeek-V2`, `DeepSeek-V2-Lite`, `ChatGLM`, `MiniCPM`, `Phi-3`, `Qwen2MoE`, `DBRX` (Converted).
* 🚀🚀 vLLM inference integration for quantized model where format = `FORMAT.GPTQ` 
* 🚀🚀 SGLang inference integration for quantized model where format = `FORMAT.GPTQ` 
* 🚀 [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.
* 🚀 [Intel/QBits](https://github.com/intel/intel-extension-for-transformers) support added for [2,3,4,8] bit quantization/inference on CPU.
* 🚀 [BITBLAS](https://github.com/microsoft/BitBLAS) format/inference support from Microsoft
* 🚀`Sym=False` Support. AutoGPTQ has unusable `sym=false`. (Re-quant required)
* 🚀`lm_head` module quant inference support for further VRAM reduction. 
* 🚀 Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.
* 🚀 Better quality quants as measured by PPL. (Test config: defaults + `sym=True` + `FORMAT.GPTQ`, TinyLlama)
* 🚀 Model weights sharding support
* 🚀 Security: hash check of model weights on load
* 🚀 Over 50% faster PPL calculations (OPT)
* 🚀 Over 40% faster `packing` stage in quantization (Llama 3.1 8B)
* ✨ Alert users of sub-optimal calibration data. Most new users get this part horribly wrong.
* ✨ Increased compatibility with newest models with auto-padding of in/out-features for [ Exllama, Exllama V2 ] backends.
* 👾 Removed non-working, partially working, or fully deprecated features: Peft, ROCM, AWQ Gemm inference, Triton v1 (replaced by v2), Fused Attention (Replaced by Marlin/Exllama).
* 👾 <del>Fixed packing Performance regression on high core-count systems.</del> Backported to AutoGPTQ
* 👾 <del>Fixed crash on H100.</del> Backported to AutoGPTQ
* ✨ 10s of thousands of lines of refactor/cleanup.
* ✨ Over 8+ overly complex api args removed/merged into simple human-readable args. 
* ✨ Added CI workflow for validation of future PRs and prevent code regressions.
* ✨ Added perplexity unit-test to prevent against model quant quality regressions.
* 👾 De-bloated 271K lines of which 250K was caused by a single dataset used only by an example. 
* 👾 De-bloat the number of args presented in public `.from_quantized()`/`.from_pretrained()` api
* ✨ Shorter and more concise public api/internal vars. No need to mimic HF style for verbose class names. 
* ✨ Everything that did not pass unit-tests have been removed from repo.

## Model Support ( 🚀 GPTQModel only ) 
[Ready to deply quantized models](https://hf.co/ModelCloud)
  
| Model            |     |                       |     |           |     |            |     |     |
| ---------------- | --- | --------------------- | --- | --------- | --- | ---------- | --- | --- |
| Baichuan         | ✅   | EXAONE 3.0            | 🚀  | Llama     | ✅   | Phi/Phi-3  | 🚀  |     |
| Bloom            | ✅   | Falon                 | ✅   | LongLLaMA | ✅   | Qwen       | ✅   |     |
| ChatGLM          | 🚀  | Gemma 2               | 🚀  | MiniCPM   | 🚀  | Qwen2MoE   | 🚀  |     |
| CodeGen          | ✅   | GPTBigCod             | ✅   | MiniCPM3  | 🚀  | RefinedWeb | ✅   |     |
| Cohere           | ✅   | GPTNeoX               | ✅   | Mistral   | ✅   | StableLM   | ✅   |     |
| DBRX Converted   | 🚀  | GPT-2                 | ✅   | Mixtral   | ✅   | StarCoder2 | ✅   |     |
| Deci             | ✅   | GPT-J                 | ✅   | MOSS      | ✅   | XVERSE     | ✅   |     |
| DeepSeek-V2      | 🚀  | GRIN-MoE              | 🚀  | MPT       | ✅   | Yi         | ✅   |     |
| DeepSeek-V2-Lite | 🚀  | InternLM 1/2.5 | 🚀  | OPT       | ✅   |            |     |     |

## Compatiblity 

We aim for 100% compatibility with models quanted by AutoGPTQ <= 0.7.1 and will consider syncing future compatibilty on a case-by-case basis. 

## Platform/GPU Requirements

GPTQModel is currently Linux only and requires CUDA capability >= 6.0 Nvidia GPU. 

WSL on Windows should work as well. 

ROCM/AMD support will be re-added in a future version after everything on ROCM has been validated. Only fully validated features will be re-added from the original AutoGPTQ repo. 

## Install

### PIP 

```bash
# Include any specific modules needed using brackets. Example: pip install gptqmodel[sglang,vllm,bitblas] --no-build-isolation
pip install gptqmodel --no-build-isolation
```

### Install from source

```bash
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel

# compile and install
# You can optionally include specific modules like vllm, sglang, or bitblas by adding them in brackets. Example: pip install -vvv --no-build-isolation .[vllm,sglang,bitblas]
pip install -vvv --no-build-isolation .

# If you have `uv` package version 0.1.16 or higher, you can use `uv pip` for potentially better dependency management
# Include modules as needed: uv pip install -vvv --no-build-isolation .[vllm,sglang,bitblas]
uv pip install -vvv --no-build-isolation .
```

### Script installation  
```bash
# You can pass modules as arguments, e.g., --vllm --sglang --bitblas. Example: bash install.sh --vllm --sglang --bitblas
bash install.sh
```



### Quantization and Inference

> warning: this is just a showcase of the usage of basic apis in GPTQModel, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.

Below is an example for the simplest use of `gptqmodel` to quantize a model and inference after quantization:

```py
from transformers import AutoTokenizer
from gptqmodel import GPTQModel, QuantizeConfig

pretrained_model_dir = "facebook/opt-125m"
quant_output_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
calibration_dataset = [
    tokenizer(
        "The world is a wonderful place full of beauty and love."
    )
]

quant_config = QuantizeConfig(
    bits=4,  # 4-bit
    group_size=128,  # 128 is good balance between quality and performance
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = GPTQModel.from_pretrained(pretrained_model_dir, quant_config)

# quantize model, the calibration_dataset should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(calibration_dataset)

# save quantized model
model.save_quantized(quant_output_dir)

# load quantized model to the first GPU
model = GPTQModel.from_quantized(quant_output_dir)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("gptqmodel is", return_tensors="pt").to(model.device))[0]))
```

For more advanced features of model quantization, please reference to [this script](https://github.com/ModelCloud/GPTQModel/blob/main/examples/quantization/basic_usage_wikitext2.py)

### How to Add Support for a New Model

Read the [`gptqmodel/models/llama.py`](https://github.com/ModelCloud/GPTQModel/blob/5627f5ffeb3f19b1a2a97e3b6de6fbe668b0dc42/gptqmodel/models/llama.py) code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.

### Evaluation on Downstream Tasks

You can use tasks defined in `gptqmodel.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.

The predefined tasks support all causal-language-models implemented in [🤗 transformers](https://github.com/huggingface/transformers) and in this project.

<details>

<summary>Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:</summary>

```python
from functools import partial

import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from gptqmodel import GPTQModel, QuantizeConfig
from gptqmodel.eval_tasks import SequenceClassificationTask

MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
    0: "negative",
    1: "neutral",
    2: "positive"
}
LABELS = list(ID2LABEL.values())


def ds_refactor_fn(samples):
    text_data = samples["text"]
    label_data = samples["label"]

    new_samples = {"prompt": [], "label": []}
    for text, label in zip(text_data, label_data):
        prompt = TEMPLATE.format(labels=LABELS, text=text)
        new_samples["prompt"].append(prompt)
        new_samples["label"].append(ID2LABEL[label])

    return new_samples


#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = GPTQModel.from_pretrained(MODEL, QuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)

task = SequenceClassificationTask(
    model=model,
    tokenizer=tokenizer,
    classes=LABELS,
    data_name_or_path=DATASET,
    prompt_col_name="prompt",
    label_col_name="label",
    **{
        "num_samples": 1000,  # how many samples will be sampled to evaluation
        "sample_max_len": 1024,  # max tokens for each sample
        "block_max_len": 2048,  # max tokens for each data block
        # function to load dataset, one must only accept data_name_or_path as input
        # and return datasets.Dataset
        "load_fn": partial(datasets.load_dataset, name="english"),
        # function to preprocess dataset, which is used for datasets.Dataset.map,
        # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
        "preprocess_fn": ds_refactor_fn,
        # truncate label when sample's length exceed sample_max_len
        "truncate_prompt": False
    }
)

# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())

# self-consistency
print(
    task.run(
        generation_config=GenerationConfig(
            num_beams=3,
            num_return_sequences=3,
            do_sample=True
        )
    )
)
```

</details>

## Learn More

[tutorials](docs/tutorial) provide step-by-step guidance to integrate `gptqmodel` with your own project and some best practice principles.

[examples](examples/README.md) provide plenty of example scripts to use `gptqmodel` in different ways.

## Supported Evaluation Tasks

Currently, `gptqmodel` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!

### Which kernel is used by default?

GPTQModel will use Marlin, Exllama v2, Triton kernels in that order for maximum inference performance.

# Acknowledgements

* **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh**: for creating [GPTQ](https://github.com/IST-DASLab/gptq) and [Marlin](https://github.com/IST-DASLab/marlin).
* **PanQiWei**: for creation of [AutoGPTQ](https://github.com/autogptq/AutoGPTQ) which this project code is based upon.
* **FXMarty**: for maintaining and support of [AutoGPTQ](https://github.com/autogptq/AutoGPTQ).
* **Qwopqwop200**: for quantization code used in this project adapted from [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda).
* **Turboderp**: for releasing [Exllama v1](https://github.com/turboderp/exllama) and [Exllama v2](https://github.com/turboderp/exllamav2) kernels adapted for use in this project.
* **FpgaMiner**: for [GPTQ-Triton](https://github.com/fpgaminer/GPTQ-triton) kernels used in [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda) which is adapted into this project.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ModelCloud/GPTQModel",
    "name": "gptqmodel",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8.0",
    "maintainer_email": null,
    "keywords": "gptq, quantization, large-language-models, transformers, 4bit, llm",
    "author": "ModelCloud",
    "author_email": "qubitium@modelcloud.ai",
    "download_url": "https://files.pythonhosted.org/packages/6d/71/82783be8bda8ec724cb63b5750c2f6bb79c9fa583a9488d17b2105798978/gptqmodel-1.0.6.tar.gz",
    "platform": "linux",
    "description": "<h1 align=\"center\">GPTQModel</h1>\n<p align=\"center\">An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).</p>\n<p align=\"center\">\n    <a href=\"https://github.com/ModelCloud/GPTQModel/releases\">\n        <img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/ModelCloud/GPTQModel.svg\">\n    </a>\n    <a href=\"https://pypi.org/project/gptqmodel/\">\n        <img alt=\"PyPI - Downloads\" src=\"https://img.shields.io/pypi/dd/gptqmodel\">\n    </a>\n</p>\n\n## News\n* 09/26/2024 \u2728 [v1.0.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.6) Fixed quantized Llama 3.2 vision quantized loader\n* 09/26/2024 \u2728 [v1.0.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.5) Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.\n* 09/26/2024 \u2728 [v1.0.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.4) Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing. \n* 09/18/2024 \u2728 [v1.0.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.3) Added Microsoft GRIN-MoE and MiniCPM3 support.\n* 08/16/2024 \u2728 [v1.0.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.2) Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release. \n* 08/14/2024 \u2728\u2728 [v1.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.0) 40% faster `packing`, Fixed Python 3.9 compat, added `lm_eval` api. \n* 08/10/2024 \ud83d\ude80\ud83d\ude80\ud83d\ude80 [v0.9.11](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.11) Added LG EXAONE 3.0 model support. New `dynamic` per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to `backend.BITBLAS`. Auto-heal quantization errors due to small damp values. \n* 07/31/2024 \ud83d\ude80\ud83d\ude80 [v0.9.10](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.10) Ported vllm/nm `gptq_marlin` inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with `FORMAT.GPTQ`. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference.\n* 07/25/2024 \ud83d\ude80 [v0.9.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.9): Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.  \n* 07/13/2024 \ud83d\ude80\ud83d\ude80\ud83d\ude80 [v0.9.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.8):\nRun quantized models directly using GPTQModel using fast `vLLM` or `SGLang` backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximum `TPS` (check usage under examples). Marlin backend also\ngot full end-to-end in/out features padding to enhance current/future model compatibility.\n<details>\n    \n<summary>Archived News:</summary>\n\n* 07/08/2024 \ud83d\ude80 [v0.9.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.7): InternLM 2.5 model support added.\n\n* 07/08/2024 \ud83d\ude80\ud83d\ude80 [v0.9.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.6): [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.\n\n* 07/05/2024 \ud83d\ude80\ud83d\ude80 [v0.9.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.5): [Intel/QBits](https://github.com/intel/intel-extension-for-transformers) support added for [2,3,4,8] bit quantization/inference on CPU. Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.\n\n* 07/03/2024 \ud83d\ude80 [v0.9.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.4): HF Transformers integration added and bug fixed Gemma 2 support.\n\n* 07/02/2024 \ud83d\ude80 [v0.9.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.3): Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.\n\n* 06/30/2024 \ud83d\ude80 [v0.9.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.2): Added auto-padding of model in/out-features for exllama and exllama v2. \nFixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.\n\n* 06/29/2024 \ud83d\ude80\ud83d\ude80\ud83d\ude80 [v0.9.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.1): With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.\n\n* 06/20/2924 \u2728 GPTQModel [v0.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.0): Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!\n</details>\n\n## Mission Statement\n\nWe want GPTQModel to be highly focused on GPTQ based quantization and target inference compatibility with HF Transformers, vLLM, and SGLang. \n\n## How is GPTQModel different from AutoGPTQ?\n\nGPTQModel is an opinionated fork/refactor of AutoGPTQ with latest bug fixes, more model support, faster quant inference, faster quantization, better quants (as measured in PPL) and a pledge from the ModelCloud team and that we, along with the open-source ML community, will take every effort to bring the library up-to-date with latest advancements, model support, and bug fixes.\n\nWe will backport bug fixes to AutoGPTQ on a case-by-case basis.\n\n## Major Changes (Advantages) vs AutoGPTQ\n* \ud83d\ude80\ud83d\ude80\ud83d\ude80\ud83d\ude80 Extensive model support for: `EXAONE 3.0`, `InternLM 2.5`, `Gemma 2`, `DeepSeek-V2`, `DeepSeek-V2-Lite`, `ChatGLM`, `MiniCPM`, `Phi-3`, `Qwen2MoE`, `DBRX` (Converted).\n* \ud83d\ude80\ud83d\ude80 vLLM inference integration for quantized model where format = `FORMAT.GPTQ` \n* \ud83d\ude80\ud83d\ude80 SGLang inference integration for quantized model where format = `FORMAT.GPTQ` \n* \ud83d\ude80 [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.\n* \ud83d\ude80 [Intel/QBits](https://github.com/intel/intel-extension-for-transformers) support added for [2,3,4,8] bit quantization/inference on CPU.\n* \ud83d\ude80 [BITBLAS](https://github.com/microsoft/BitBLAS) format/inference support from Microsoft\n* \ud83d\ude80`Sym=False` Support. AutoGPTQ has unusable `sym=false`. (Re-quant required)\n* \ud83d\ude80`lm_head` module quant inference support for further VRAM reduction. \n* \ud83d\ude80 Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.\n* \ud83d\ude80 Better quality quants as measured by PPL. (Test config: defaults + `sym=True` + `FORMAT.GPTQ`, TinyLlama)\n* \ud83d\ude80 Model weights sharding support\n* \ud83d\ude80 Security: hash check of model weights on load\n* \ud83d\ude80 Over 50% faster PPL calculations (OPT)\n* \ud83d\ude80 Over 40% faster `packing` stage in quantization (Llama 3.1 8B)\n* \u2728 Alert users of sub-optimal calibration data. Most new users get this part horribly wrong.\n* \u2728 Increased compatibility with newest models with auto-padding of in/out-features for [ Exllama, Exllama V2 ] backends.\n* \ud83d\udc7e Removed non-working, partially working, or fully deprecated features: Peft, ROCM, AWQ Gemm inference, Triton v1 (replaced by v2), Fused Attention (Replaced by Marlin/Exllama).\n* \ud83d\udc7e <del>Fixed packing Performance regression on high core-count systems.</del> Backported to AutoGPTQ\n* \ud83d\udc7e <del>Fixed crash on H100.</del> Backported to AutoGPTQ\n* \u2728 10s of thousands of lines of refactor/cleanup.\n* \u2728 Over 8+ overly complex api args removed/merged into simple human-readable args. \n* \u2728 Added CI workflow for validation of future PRs and prevent code regressions.\n* \u2728 Added perplexity unit-test to prevent against model quant quality regressions.\n* \ud83d\udc7e De-bloated 271K lines of which 250K was caused by a single dataset used only by an example. \n* \ud83d\udc7e De-bloat the number of args presented in public `.from_quantized()`/`.from_pretrained()` api\n* \u2728 Shorter and more concise public api/internal vars. No need to mimic HF style for verbose class names. \n* \u2728 Everything that did not pass unit-tests have been removed from repo.\n\n## Model Support ( \ud83d\ude80 GPTQModel only ) \n[Ready to deply quantized models](https://hf.co/ModelCloud)\n  \n| Model            |     |                       |     |           |     |            |     |     |\n| ---------------- | --- | --------------------- | --- | --------- | --- | ---------- | --- | --- |\n| Baichuan         | \u2705   | EXAONE 3.0            | \ud83d\ude80  | Llama     | \u2705   | Phi/Phi-3  | \ud83d\ude80  |     |\n| Bloom            | \u2705   | Falon                 | \u2705   | LongLLaMA | \u2705   | Qwen       | \u2705   |     |\n| ChatGLM          | \ud83d\ude80  | Gemma 2               | \ud83d\ude80  | MiniCPM   | \ud83d\ude80  | Qwen2MoE   | \ud83d\ude80  |     |\n| CodeGen          | \u2705   | GPTBigCod             | \u2705   | MiniCPM3  | \ud83d\ude80  | RefinedWeb | \u2705   |     |\n| Cohere           | \u2705   | GPTNeoX               | \u2705   | Mistral   | \u2705   | StableLM   | \u2705   |     |\n| DBRX Converted   | \ud83d\ude80  | GPT-2                 | \u2705   | Mixtral   | \u2705   | StarCoder2 | \u2705   |     |\n| Deci             | \u2705   | GPT-J                 | \u2705   | MOSS      | \u2705   | XVERSE     | \u2705   |     |\n| DeepSeek-V2      | \ud83d\ude80  | GRIN-MoE              | \ud83d\ude80  | MPT       | \u2705   | Yi         | \u2705   |     |\n| DeepSeek-V2-Lite | \ud83d\ude80  | InternLM 1/2.5 | \ud83d\ude80  | OPT       | \u2705   |            |     |     |\n\n## Compatiblity \n\nWe aim for 100% compatibility with models quanted by AutoGPTQ <= 0.7.1 and will consider syncing future compatibilty on a case-by-case basis. \n\n## Platform/GPU Requirements\n\nGPTQModel is currently Linux only and requires CUDA capability >= 6.0 Nvidia GPU. \n\nWSL on Windows should work as well. \n\nROCM/AMD support will be re-added in a future version after everything on ROCM has been validated. Only fully validated features will be re-added from the original AutoGPTQ repo. \n\n## Install\n\n### PIP \n\n```bash\n# Include any specific modules needed using brackets. Example: pip install gptqmodel[sglang,vllm,bitblas] --no-build-isolation\npip install gptqmodel --no-build-isolation\n```\n\n### Install from source\n\n```bash\n# clone repo\ngit clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel\n\n# compile and install\n# You can optionally include specific modules like vllm, sglang, or bitblas by adding them in brackets. Example: pip install -vvv --no-build-isolation .[vllm,sglang,bitblas]\npip install -vvv --no-build-isolation .\n\n# If you have `uv` package version 0.1.16 or higher, you can use `uv pip` for potentially better dependency management\n# Include modules as needed: uv pip install -vvv --no-build-isolation .[vllm,sglang,bitblas]\nuv pip install -vvv --no-build-isolation .\n```\n\n### Script installation  \n```bash\n# You can pass modules as arguments, e.g., --vllm --sglang --bitblas. Example: bash install.sh --vllm --sglang --bitblas\nbash install.sh\n```\n\n\n\n### Quantization and Inference\n\n> warning: this is just a showcase of the usage of basic apis in GPTQModel, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.\n\nBelow is an example for the simplest use of `gptqmodel` to quantize a model and inference after quantization:\n\n```py\nfrom transformers import AutoTokenizer\nfrom gptqmodel import GPTQModel, QuantizeConfig\n\npretrained_model_dir = \"facebook/opt-125m\"\nquant_output_dir = \"opt-125m-4bit\"\n\ntokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)\ncalibration_dataset = [\n    tokenizer(\n        \"The world is a wonderful place full of beauty and love.\"\n    )\n]\n\nquant_config = QuantizeConfig(\n    bits=4,  # 4-bit\n    group_size=128,  # 128 is good balance between quality and performance\n)\n\n# load un-quantized model, by default, the model will always be loaded into CPU memory\nmodel = GPTQModel.from_pretrained(pretrained_model_dir, quant_config)\n\n# quantize model, the calibration_dataset should be list of dict whose keys can only be \"input_ids\" and \"attention_mask\"\nmodel.quantize(calibration_dataset)\n\n# save quantized model\nmodel.save_quantized(quant_output_dir)\n\n# load quantized model to the first GPU\nmodel = GPTQModel.from_quantized(quant_output_dir)\n\n# inference with model.generate\nprint(tokenizer.decode(model.generate(**tokenizer(\"gptqmodel is\", return_tensors=\"pt\").to(model.device))[0]))\n```\n\nFor more advanced features of model quantization, please reference to [this script](https://github.com/ModelCloud/GPTQModel/blob/main/examples/quantization/basic_usage_wikitext2.py)\n\n### How to Add Support for a New Model\n\nRead the [`gptqmodel/models/llama.py`](https://github.com/ModelCloud/GPTQModel/blob/5627f5ffeb3f19b1a2a97e3b6de6fbe668b0dc42/gptqmodel/models/llama.py) code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.\n\n### Evaluation on Downstream Tasks\n\nYou can use tasks defined in `gptqmodel.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.\n\nThe predefined tasks support all causal-language-models implemented in [\ud83e\udd17 transformers](https://github.com/huggingface/transformers) and in this project.\n\n<details>\n\n<summary>Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:</summary>\n\n```python\nfrom functools import partial\n\nimport datasets\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig\nfrom gptqmodel import GPTQModel, QuantizeConfig\nfrom gptqmodel.eval_tasks import SequenceClassificationTask\n\nMODEL = \"EleutherAI/gpt-j-6b\"\nDATASET = \"cardiffnlp/tweet_sentiment_multilingual\"\nTEMPLATE = \"Question:What's the sentiment of the given text? Choices are {labels}.\\nText: {text}\\nAnswer:\"\nID2LABEL = {\n    0: \"negative\",\n    1: \"neutral\",\n    2: \"positive\"\n}\nLABELS = list(ID2LABEL.values())\n\n\ndef ds_refactor_fn(samples):\n    text_data = samples[\"text\"]\n    label_data = samples[\"label\"]\n\n    new_samples = {\"prompt\": [], \"label\": []}\n    for text, label in zip(text_data, label_data):\n        prompt = TEMPLATE.format(labels=LABELS, text=text)\n        new_samples[\"prompt\"].append(prompt)\n        new_samples[\"label\"].append(ID2LABEL[label])\n\n    return new_samples\n\n\n#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to(\"cuda:0\")\nmodel = GPTQModel.from_pretrained(MODEL, QuantizeConfig())\ntokenizer = AutoTokenizer.from_pretrained(MODEL)\n\ntask = SequenceClassificationTask(\n    model=model,\n    tokenizer=tokenizer,\n    classes=LABELS,\n    data_name_or_path=DATASET,\n    prompt_col_name=\"prompt\",\n    label_col_name=\"label\",\n    **{\n        \"num_samples\": 1000,  # how many samples will be sampled to evaluation\n        \"sample_max_len\": 1024,  # max tokens for each sample\n        \"block_max_len\": 2048,  # max tokens for each data block\n        # function to load dataset, one must only accept data_name_or_path as input\n        # and return datasets.Dataset\n        \"load_fn\": partial(datasets.load_dataset, name=\"english\"),\n        # function to preprocess dataset, which is used for datasets.Dataset.map,\n        # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]\n        \"preprocess_fn\": ds_refactor_fn,\n        # truncate label when sample's length exceed sample_max_len\n        \"truncate_prompt\": False\n    }\n)\n\n# note that max_new_tokens will be automatically specified internally based on given classes\nprint(task.run())\n\n# self-consistency\nprint(\n    task.run(\n        generation_config=GenerationConfig(\n            num_beams=3,\n            num_return_sequences=3,\n            do_sample=True\n        )\n    )\n)\n```\n\n</details>\n\n## Learn More\n\n[tutorials](docs/tutorial) provide step-by-step guidance to integrate `gptqmodel` with your own project and some best practice principles.\n\n[examples](examples/README.md) provide plenty of example scripts to use `gptqmodel` in different ways.\n\n## Supported Evaluation Tasks\n\nCurrently, `gptqmodel` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!\n\n### Which kernel is used by default?\n\nGPTQModel will use Marlin, Exllama v2, Triton kernels in that order for maximum inference performance.\n\n# Acknowledgements\n\n* **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh**: for creating [GPTQ](https://github.com/IST-DASLab/gptq) and [Marlin](https://github.com/IST-DASLab/marlin).\n* **PanQiWei**: for creation of [AutoGPTQ](https://github.com/autogptq/AutoGPTQ) which this project code is based upon.\n* **FXMarty**: for maintaining and support of [AutoGPTQ](https://github.com/autogptq/AutoGPTQ).\n* **Qwopqwop200**: for quantization code used in this project adapted from [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda).\n* **Turboderp**: for releasing [Exllama v1](https://github.com/turboderp/exllama) and [Exllama v2](https://github.com/turboderp/exllamav2) kernels adapted for use in this project.\n* **FpgaMiner**: for [GPTQ-Triton](https://github.com/fpgaminer/GPTQ-triton) kernels used in [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda) which is adapted into this project.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.",
    "version": "1.0.6",
    "project_urls": {
        "Homepage": "https://github.com/ModelCloud/GPTQModel"
    },
    "split_keywords": [
        "gptq",
        " quantization",
        " large-language-models",
        " transformers",
        " 4bit",
        " llm"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6d7182783be8bda8ec724cb63b5750c2f6bb79c9fa583a9488d17b2105798978",
                "md5": "d7fef4a2b01df0d0c4bb896cc17d6e40",
                "sha256": "f3ba2b3213c6f862bb492a663e6dca8f4126b1de0fbe3e4b0121ad05fe40a420"
            },
            "downloads": -1,
            "filename": "gptqmodel-1.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "d7fef4a2b01df0d0c4bb896cc17d6e40",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.0",
            "size": 166057,
            "upload_time": "2024-09-26T16:01:07",
            "upload_time_iso_8601": "2024-09-26T16:01:07.420436Z",
            "url": "https://files.pythonhosted.org/packages/6d/71/82783be8bda8ec724cb63b5750c2f6bb79c9fa583a9488d17b2105798978/gptqmodel-1.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-26 16:01:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ModelCloud",
    "github_project": "GPTQModel",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "gptqmodel"
}
        
Elapsed time: 9.19208s