<h1 align="center">GPTQModel</h1>
<p align="center">An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).</p>
<p align="center">
<a href="https://github.com/ModelCloud/GPTQModel/releases">
<img alt="GitHub release" src="https://img.shields.io/github/release/ModelCloud/GPTQModel.svg">
</a>
<a href="https://pypi.org/project/gptqmodel/">
<img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dd/gptqmodel">
</a>
</p>
## News
* 09/26/2024 ✨ [v1.0.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.6) Fixed quantized Llama 3.2 vision quantized loader
* 09/26/2024 ✨ [v1.0.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.5) Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.
* 09/26/2024 ✨ [v1.0.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.4) Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing.
* 09/18/2024 ✨ [v1.0.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.3) Added Microsoft GRIN-MoE and MiniCPM3 support.
* 08/16/2024 ✨ [v1.0.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.2) Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release.
* 08/14/2024 ✨✨ [v1.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.0) 40% faster `packing`, Fixed Python 3.9 compat, added `lm_eval` api.
* 08/10/2024 🚀🚀🚀 [v0.9.11](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.11) Added LG EXAONE 3.0 model support. New `dynamic` per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to `backend.BITBLAS`. Auto-heal quantization errors due to small damp values.
* 07/31/2024 🚀🚀 [v0.9.10](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.10) Ported vllm/nm `gptq_marlin` inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with `FORMAT.GPTQ`. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference.
* 07/25/2024 🚀 [v0.9.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.9): Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.
* 07/13/2024 🚀🚀🚀 [v0.9.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.8):
Run quantized models directly using GPTQModel using fast `vLLM` or `SGLang` backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximum `TPS` (check usage under examples). Marlin backend also
got full end-to-end in/out features padding to enhance current/future model compatibility.
<details>
<summary>Archived News:</summary>
* 07/08/2024 🚀 [v0.9.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.7): InternLM 2.5 model support added.
* 07/08/2024 🚀🚀 [v0.9.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.6): [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.
* 07/05/2024 🚀🚀 [v0.9.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.5): [Intel/QBits](https://github.com/intel/intel-extension-for-transformers) support added for [2,3,4,8] bit quantization/inference on CPU. Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.
* 07/03/2024 🚀 [v0.9.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.4): HF Transformers integration added and bug fixed Gemma 2 support.
* 07/02/2024 🚀 [v0.9.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.3): Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.
* 06/30/2024 🚀 [v0.9.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.2): Added auto-padding of model in/out-features for exllama and exllama v2.
Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.
* 06/29/2024 🚀🚀🚀 [v0.9.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.1): With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.
* 06/20/2924 ✨ GPTQModel [v0.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.0): Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!
</details>
## Mission Statement
We want GPTQModel to be highly focused on GPTQ based quantization and target inference compatibility with HF Transformers, vLLM, and SGLang.
## How is GPTQModel different from AutoGPTQ?
GPTQModel is an opinionated fork/refactor of AutoGPTQ with latest bug fixes, more model support, faster quant inference, faster quantization, better quants (as measured in PPL) and a pledge from the ModelCloud team and that we, along with the open-source ML community, will take every effort to bring the library up-to-date with latest advancements, model support, and bug fixes.
We will backport bug fixes to AutoGPTQ on a case-by-case basis.
## Major Changes (Advantages) vs AutoGPTQ
* 🚀🚀🚀🚀 Extensive model support for: `EXAONE 3.0`, `InternLM 2.5`, `Gemma 2`, `DeepSeek-V2`, `DeepSeek-V2-Lite`, `ChatGLM`, `MiniCPM`, `Phi-3`, `Qwen2MoE`, `DBRX` (Converted).
* 🚀🚀 vLLM inference integration for quantized model where format = `FORMAT.GPTQ`
* 🚀🚀 SGLang inference integration for quantized model where format = `FORMAT.GPTQ`
* 🚀 [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.
* 🚀 [Intel/QBits](https://github.com/intel/intel-extension-for-transformers) support added for [2,3,4,8] bit quantization/inference on CPU.
* 🚀 [BITBLAS](https://github.com/microsoft/BitBLAS) format/inference support from Microsoft
* 🚀`Sym=False` Support. AutoGPTQ has unusable `sym=false`. (Re-quant required)
* 🚀`lm_head` module quant inference support for further VRAM reduction.
* 🚀 Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.
* 🚀 Better quality quants as measured by PPL. (Test config: defaults + `sym=True` + `FORMAT.GPTQ`, TinyLlama)
* 🚀 Model weights sharding support
* 🚀 Security: hash check of model weights on load
* 🚀 Over 50% faster PPL calculations (OPT)
* 🚀 Over 40% faster `packing` stage in quantization (Llama 3.1 8B)
* ✨ Alert users of sub-optimal calibration data. Most new users get this part horribly wrong.
* ✨ Increased compatibility with newest models with auto-padding of in/out-features for [ Exllama, Exllama V2 ] backends.
* 👾 Removed non-working, partially working, or fully deprecated features: Peft, ROCM, AWQ Gemm inference, Triton v1 (replaced by v2), Fused Attention (Replaced by Marlin/Exllama).
* 👾 <del>Fixed packing Performance regression on high core-count systems.</del> Backported to AutoGPTQ
* 👾 <del>Fixed crash on H100.</del> Backported to AutoGPTQ
* ✨ 10s of thousands of lines of refactor/cleanup.
* ✨ Over 8+ overly complex api args removed/merged into simple human-readable args.
* ✨ Added CI workflow for validation of future PRs and prevent code regressions.
* ✨ Added perplexity unit-test to prevent against model quant quality regressions.
* 👾 De-bloated 271K lines of which 250K was caused by a single dataset used only by an example.
* 👾 De-bloat the number of args presented in public `.from_quantized()`/`.from_pretrained()` api
* ✨ Shorter and more concise public api/internal vars. No need to mimic HF style for verbose class names.
* ✨ Everything that did not pass unit-tests have been removed from repo.
## Model Support ( 🚀 GPTQModel only )
[Ready to deply quantized models](https://hf.co/ModelCloud)
| Model | | | | | | | | |
| ---------------- | --- | --------------------- | --- | --------- | --- | ---------- | --- | --- |
| Baichuan | ✅ | EXAONE 3.0 | 🚀 | Llama | ✅ | Phi/Phi-3 | 🚀 | |
| Bloom | ✅ | Falon | ✅ | LongLLaMA | ✅ | Qwen | ✅ | |
| ChatGLM | 🚀 | Gemma 2 | 🚀 | MiniCPM | 🚀 | Qwen2MoE | 🚀 | |
| CodeGen | ✅ | GPTBigCod | ✅ | MiniCPM3 | 🚀 | RefinedWeb | ✅ | |
| Cohere | ✅ | GPTNeoX | ✅ | Mistral | ✅ | StableLM | ✅ | |
| DBRX Converted | 🚀 | GPT-2 | ✅ | Mixtral | ✅ | StarCoder2 | ✅ | |
| Deci | ✅ | GPT-J | ✅ | MOSS | ✅ | XVERSE | ✅ | |
| DeepSeek-V2 | 🚀 | GRIN-MoE | 🚀 | MPT | ✅ | Yi | ✅ | |
| DeepSeek-V2-Lite | 🚀 | InternLM 1/2.5 | 🚀 | OPT | ✅ | | | |
## Compatiblity
We aim for 100% compatibility with models quanted by AutoGPTQ <= 0.7.1 and will consider syncing future compatibilty on a case-by-case basis.
## Platform/GPU Requirements
GPTQModel is currently Linux only and requires CUDA capability >= 6.0 Nvidia GPU.
WSL on Windows should work as well.
ROCM/AMD support will be re-added in a future version after everything on ROCM has been validated. Only fully validated features will be re-added from the original AutoGPTQ repo.
## Install
### PIP
```bash
# Include any specific modules needed using brackets. Example: pip install gptqmodel[sglang,vllm,bitblas] --no-build-isolation
pip install gptqmodel --no-build-isolation
```
### Install from source
```bash
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
# compile and install
# You can optionally include specific modules like vllm, sglang, or bitblas by adding them in brackets. Example: pip install -vvv --no-build-isolation .[vllm,sglang,bitblas]
pip install -vvv --no-build-isolation .
# If you have `uv` package version 0.1.16 or higher, you can use `uv pip` for potentially better dependency management
# Include modules as needed: uv pip install -vvv --no-build-isolation .[vllm,sglang,bitblas]
uv pip install -vvv --no-build-isolation .
```
### Script installation
```bash
# You can pass modules as arguments, e.g., --vllm --sglang --bitblas. Example: bash install.sh --vllm --sglang --bitblas
bash install.sh
```
### Quantization and Inference
> warning: this is just a showcase of the usage of basic apis in GPTQModel, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.
Below is an example for the simplest use of `gptqmodel` to quantize a model and inference after quantization:
```py
from transformers import AutoTokenizer
from gptqmodel import GPTQModel, QuantizeConfig
pretrained_model_dir = "facebook/opt-125m"
quant_output_dir = "opt-125m-4bit"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
calibration_dataset = [
tokenizer(
"The world is a wonderful place full of beauty and love."
)
]
quant_config = QuantizeConfig(
bits=4, # 4-bit
group_size=128, # 128 is good balance between quality and performance
)
# load un-quantized model, by default, the model will always be loaded into CPU memory
model = GPTQModel.from_pretrained(pretrained_model_dir, quant_config)
# quantize model, the calibration_dataset should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(calibration_dataset)
# save quantized model
model.save_quantized(quant_output_dir)
# load quantized model to the first GPU
model = GPTQModel.from_quantized(quant_output_dir)
# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("gptqmodel is", return_tensors="pt").to(model.device))[0]))
```
For more advanced features of model quantization, please reference to [this script](https://github.com/ModelCloud/GPTQModel/blob/main/examples/quantization/basic_usage_wikitext2.py)
### How to Add Support for a New Model
Read the [`gptqmodel/models/llama.py`](https://github.com/ModelCloud/GPTQModel/blob/5627f5ffeb3f19b1a2a97e3b6de6fbe668b0dc42/gptqmodel/models/llama.py) code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.
### Evaluation on Downstream Tasks
You can use tasks defined in `gptqmodel.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.
The predefined tasks support all causal-language-models implemented in [🤗 transformers](https://github.com/huggingface/transformers) and in this project.
<details>
<summary>Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:</summary>
```python
from functools import partial
import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from gptqmodel import GPTQModel, QuantizeConfig
from gptqmodel.eval_tasks import SequenceClassificationTask
MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
0: "negative",
1: "neutral",
2: "positive"
}
LABELS = list(ID2LABEL.values())
def ds_refactor_fn(samples):
text_data = samples["text"]
label_data = samples["label"]
new_samples = {"prompt": [], "label": []}
for text, label in zip(text_data, label_data):
prompt = TEMPLATE.format(labels=LABELS, text=text)
new_samples["prompt"].append(prompt)
new_samples["label"].append(ID2LABEL[label])
return new_samples
# model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = GPTQModel.from_pretrained(MODEL, QuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)
task = SequenceClassificationTask(
model=model,
tokenizer=tokenizer,
classes=LABELS,
data_name_or_path=DATASET,
prompt_col_name="prompt",
label_col_name="label",
**{
"num_samples": 1000, # how many samples will be sampled to evaluation
"sample_max_len": 1024, # max tokens for each sample
"block_max_len": 2048, # max tokens for each data block
# function to load dataset, one must only accept data_name_or_path as input
# and return datasets.Dataset
"load_fn": partial(datasets.load_dataset, name="english"),
# function to preprocess dataset, which is used for datasets.Dataset.map,
# must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
"preprocess_fn": ds_refactor_fn,
# truncate label when sample's length exceed sample_max_len
"truncate_prompt": False
}
)
# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())
# self-consistency
print(
task.run(
generation_config=GenerationConfig(
num_beams=3,
num_return_sequences=3,
do_sample=True
)
)
)
```
</details>
## Learn More
[tutorials](docs/tutorial) provide step-by-step guidance to integrate `gptqmodel` with your own project and some best practice principles.
[examples](examples/README.md) provide plenty of example scripts to use `gptqmodel` in different ways.
## Supported Evaluation Tasks
Currently, `gptqmodel` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!
### Which kernel is used by default?
GPTQModel will use Marlin, Exllama v2, Triton kernels in that order for maximum inference performance.
# Acknowledgements
* **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh**: for creating [GPTQ](https://github.com/IST-DASLab/gptq) and [Marlin](https://github.com/IST-DASLab/marlin).
* **PanQiWei**: for creation of [AutoGPTQ](https://github.com/autogptq/AutoGPTQ) which this project code is based upon.
* **FXMarty**: for maintaining and support of [AutoGPTQ](https://github.com/autogptq/AutoGPTQ).
* **Qwopqwop200**: for quantization code used in this project adapted from [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda).
* **Turboderp**: for releasing [Exllama v1](https://github.com/turboderp/exllama) and [Exllama v2](https://github.com/turboderp/exllamav2) kernels adapted for use in this project.
* **FpgaMiner**: for [GPTQ-Triton](https://github.com/fpgaminer/GPTQ-triton) kernels used in [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda) which is adapted into this project.
Raw data
{
"_id": null,
"home_page": "https://github.com/ModelCloud/GPTQModel",
"name": "gptqmodel",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8.0",
"maintainer_email": null,
"keywords": "gptq, quantization, large-language-models, transformers, 4bit, llm",
"author": "ModelCloud",
"author_email": "qubitium@modelcloud.ai",
"download_url": "https://files.pythonhosted.org/packages/6d/71/82783be8bda8ec724cb63b5750c2f6bb79c9fa583a9488d17b2105798978/gptqmodel-1.0.6.tar.gz",
"platform": "linux",
"description": "<h1 align=\"center\">GPTQModel</h1>\n<p align=\"center\">An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).</p>\n<p align=\"center\">\n <a href=\"https://github.com/ModelCloud/GPTQModel/releases\">\n <img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/ModelCloud/GPTQModel.svg\">\n </a>\n <a href=\"https://pypi.org/project/gptqmodel/\">\n <img alt=\"PyPI - Downloads\" src=\"https://img.shields.io/pypi/dd/gptqmodel\">\n </a>\n</p>\n\n## News\n* 09/26/2024 \u2728 [v1.0.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.6) Fixed quantized Llama 3.2 vision quantized loader\n* 09/26/2024 \u2728 [v1.0.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.5) Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.\n* 09/26/2024 \u2728 [v1.0.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.4) Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing. \n* 09/18/2024 \u2728 [v1.0.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.3) Added Microsoft GRIN-MoE and MiniCPM3 support.\n* 08/16/2024 \u2728 [v1.0.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.2) Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release. \n* 08/14/2024 \u2728\u2728 [v1.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.0) 40% faster `packing`, Fixed Python 3.9 compat, added `lm_eval` api. \n* 08/10/2024 \ud83d\ude80\ud83d\ude80\ud83d\ude80 [v0.9.11](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.11) Added LG EXAONE 3.0 model support. New `dynamic` per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to `backend.BITBLAS`. Auto-heal quantization errors due to small damp values. \n* 07/31/2024 \ud83d\ude80\ud83d\ude80 [v0.9.10](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.10) Ported vllm/nm `gptq_marlin` inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with `FORMAT.GPTQ`. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference.\n* 07/25/2024 \ud83d\ude80 [v0.9.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.9): Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more. \n* 07/13/2024 \ud83d\ude80\ud83d\ude80\ud83d\ude80 [v0.9.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.8):\nRun quantized models directly using GPTQModel using fast `vLLM` or `SGLang` backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximum `TPS` (check usage under examples). Marlin backend also\ngot full end-to-end in/out features padding to enhance current/future model compatibility.\n<details>\n \n<summary>Archived News:</summary>\n\n* 07/08/2024 \ud83d\ude80 [v0.9.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.7): InternLM 2.5 model support added.\n\n* 07/08/2024 \ud83d\ude80\ud83d\ude80 [v0.9.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.6): [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.\n\n* 07/05/2024 \ud83d\ude80\ud83d\ude80 [v0.9.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.5): [Intel/QBits](https://github.com/intel/intel-extension-for-transformers) support added for [2,3,4,8] bit quantization/inference on CPU. Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.\n\n* 07/03/2024 \ud83d\ude80 [v0.9.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.4): HF Transformers integration added and bug fixed Gemma 2 support.\n\n* 07/02/2024 \ud83d\ude80 [v0.9.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.3): Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.\n\n* 06/30/2024 \ud83d\ude80 [v0.9.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.2): Added auto-padding of model in/out-features for exllama and exllama v2. \nFixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.\n\n* 06/29/2024 \ud83d\ude80\ud83d\ude80\ud83d\ude80 [v0.9.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.1): With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.\n\n* 06/20/2924 \u2728 GPTQModel [v0.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.0): Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!\n</details>\n\n## Mission Statement\n\nWe want GPTQModel to be highly focused on GPTQ based quantization and target inference compatibility with HF Transformers, vLLM, and SGLang. \n\n## How is GPTQModel different from AutoGPTQ?\n\nGPTQModel is an opinionated fork/refactor of AutoGPTQ with latest bug fixes, more model support, faster quant inference, faster quantization, better quants (as measured in PPL) and a pledge from the ModelCloud team and that we, along with the open-source ML community, will take every effort to bring the library up-to-date with latest advancements, model support, and bug fixes.\n\nWe will backport bug fixes to AutoGPTQ on a case-by-case basis.\n\n## Major Changes (Advantages) vs AutoGPTQ\n* \ud83d\ude80\ud83d\ude80\ud83d\ude80\ud83d\ude80 Extensive model support for: `EXAONE 3.0`, `InternLM 2.5`, `Gemma 2`, `DeepSeek-V2`, `DeepSeek-V2-Lite`, `ChatGLM`, `MiniCPM`, `Phi-3`, `Qwen2MoE`, `DBRX` (Converted).\n* \ud83d\ude80\ud83d\ude80 vLLM inference integration for quantized model where format = `FORMAT.GPTQ` \n* \ud83d\ude80\ud83d\ude80 SGLang inference integration for quantized model where format = `FORMAT.GPTQ` \n* \ud83d\ude80 [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.\n* \ud83d\ude80 [Intel/QBits](https://github.com/intel/intel-extension-for-transformers) support added for [2,3,4,8] bit quantization/inference on CPU.\n* \ud83d\ude80 [BITBLAS](https://github.com/microsoft/BitBLAS) format/inference support from Microsoft\n* \ud83d\ude80`Sym=False` Support. AutoGPTQ has unusable `sym=false`. (Re-quant required)\n* \ud83d\ude80`lm_head` module quant inference support for further VRAM reduction. \n* \ud83d\ude80 Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.\n* \ud83d\ude80 Better quality quants as measured by PPL. (Test config: defaults + `sym=True` + `FORMAT.GPTQ`, TinyLlama)\n* \ud83d\ude80 Model weights sharding support\n* \ud83d\ude80 Security: hash check of model weights on load\n* \ud83d\ude80 Over 50% faster PPL calculations (OPT)\n* \ud83d\ude80 Over 40% faster `packing` stage in quantization (Llama 3.1 8B)\n* \u2728 Alert users of sub-optimal calibration data. Most new users get this part horribly wrong.\n* \u2728 Increased compatibility with newest models with auto-padding of in/out-features for [ Exllama, Exllama V2 ] backends.\n* \ud83d\udc7e Removed non-working, partially working, or fully deprecated features: Peft, ROCM, AWQ Gemm inference, Triton v1 (replaced by v2), Fused Attention (Replaced by Marlin/Exllama).\n* \ud83d\udc7e <del>Fixed packing Performance regression on high core-count systems.</del> Backported to AutoGPTQ\n* \ud83d\udc7e <del>Fixed crash on H100.</del> Backported to AutoGPTQ\n* \u2728 10s of thousands of lines of refactor/cleanup.\n* \u2728 Over 8+ overly complex api args removed/merged into simple human-readable args. \n* \u2728 Added CI workflow for validation of future PRs and prevent code regressions.\n* \u2728 Added perplexity unit-test to prevent against model quant quality regressions.\n* \ud83d\udc7e De-bloated 271K lines of which 250K was caused by a single dataset used only by an example. \n* \ud83d\udc7e De-bloat the number of args presented in public `.from_quantized()`/`.from_pretrained()` api\n* \u2728 Shorter and more concise public api/internal vars. No need to mimic HF style for verbose class names. \n* \u2728 Everything that did not pass unit-tests have been removed from repo.\n\n## Model Support ( \ud83d\ude80 GPTQModel only ) \n[Ready to deply quantized models](https://hf.co/ModelCloud)\n \n| Model | | | | | | | | |\n| ---------------- | --- | --------------------- | --- | --------- | --- | ---------- | --- | --- |\n| Baichuan | \u2705 | EXAONE 3.0 | \ud83d\ude80 | Llama | \u2705 | Phi/Phi-3 | \ud83d\ude80 | |\n| Bloom | \u2705 | Falon | \u2705 | LongLLaMA | \u2705 | Qwen | \u2705 | |\n| ChatGLM | \ud83d\ude80 | Gemma 2 | \ud83d\ude80 | MiniCPM | \ud83d\ude80 | Qwen2MoE | \ud83d\ude80 | |\n| CodeGen | \u2705 | GPTBigCod | \u2705 | MiniCPM3 | \ud83d\ude80 | RefinedWeb | \u2705 | |\n| Cohere | \u2705 | GPTNeoX | \u2705 | Mistral | \u2705 | StableLM | \u2705 | |\n| DBRX Converted | \ud83d\ude80 | GPT-2 | \u2705 | Mixtral | \u2705 | StarCoder2 | \u2705 | |\n| Deci | \u2705 | GPT-J | \u2705 | MOSS | \u2705 | XVERSE | \u2705 | |\n| DeepSeek-V2 | \ud83d\ude80 | GRIN-MoE | \ud83d\ude80 | MPT | \u2705 | Yi | \u2705 | |\n| DeepSeek-V2-Lite | \ud83d\ude80 | InternLM 1/2.5 | \ud83d\ude80 | OPT | \u2705 | | | |\n\n## Compatiblity \n\nWe aim for 100% compatibility with models quanted by AutoGPTQ <= 0.7.1 and will consider syncing future compatibilty on a case-by-case basis. \n\n## Platform/GPU Requirements\n\nGPTQModel is currently Linux only and requires CUDA capability >= 6.0 Nvidia GPU. \n\nWSL on Windows should work as well. \n\nROCM/AMD support will be re-added in a future version after everything on ROCM has been validated. Only fully validated features will be re-added from the original AutoGPTQ repo. \n\n## Install\n\n### PIP \n\n```bash\n# Include any specific modules needed using brackets. Example: pip install gptqmodel[sglang,vllm,bitblas] --no-build-isolation\npip install gptqmodel --no-build-isolation\n```\n\n### Install from source\n\n```bash\n# clone repo\ngit clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel\n\n# compile and install\n# You can optionally include specific modules like vllm, sglang, or bitblas by adding them in brackets. Example: pip install -vvv --no-build-isolation .[vllm,sglang,bitblas]\npip install -vvv --no-build-isolation .\n\n# If you have `uv` package version 0.1.16 or higher, you can use `uv pip` for potentially better dependency management\n# Include modules as needed: uv pip install -vvv --no-build-isolation .[vllm,sglang,bitblas]\nuv pip install -vvv --no-build-isolation .\n```\n\n### Script installation \n```bash\n# You can pass modules as arguments, e.g., --vllm --sglang --bitblas. Example: bash install.sh --vllm --sglang --bitblas\nbash install.sh\n```\n\n\n\n### Quantization and Inference\n\n> warning: this is just a showcase of the usage of basic apis in GPTQModel, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.\n\nBelow is an example for the simplest use of `gptqmodel` to quantize a model and inference after quantization:\n\n```py\nfrom transformers import AutoTokenizer\nfrom gptqmodel import GPTQModel, QuantizeConfig\n\npretrained_model_dir = \"facebook/opt-125m\"\nquant_output_dir = \"opt-125m-4bit\"\n\ntokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)\ncalibration_dataset = [\n tokenizer(\n \"The world is a wonderful place full of beauty and love.\"\n )\n]\n\nquant_config = QuantizeConfig(\n bits=4, # 4-bit\n group_size=128, # 128 is good balance between quality and performance\n)\n\n# load un-quantized model, by default, the model will always be loaded into CPU memory\nmodel = GPTQModel.from_pretrained(pretrained_model_dir, quant_config)\n\n# quantize model, the calibration_dataset should be list of dict whose keys can only be \"input_ids\" and \"attention_mask\"\nmodel.quantize(calibration_dataset)\n\n# save quantized model\nmodel.save_quantized(quant_output_dir)\n\n# load quantized model to the first GPU\nmodel = GPTQModel.from_quantized(quant_output_dir)\n\n# inference with model.generate\nprint(tokenizer.decode(model.generate(**tokenizer(\"gptqmodel is\", return_tensors=\"pt\").to(model.device))[0]))\n```\n\nFor more advanced features of model quantization, please reference to [this script](https://github.com/ModelCloud/GPTQModel/blob/main/examples/quantization/basic_usage_wikitext2.py)\n\n### How to Add Support for a New Model\n\nRead the [`gptqmodel/models/llama.py`](https://github.com/ModelCloud/GPTQModel/blob/5627f5ffeb3f19b1a2a97e3b6de6fbe668b0dc42/gptqmodel/models/llama.py) code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.\n\n### Evaluation on Downstream Tasks\n\nYou can use tasks defined in `gptqmodel.eval_tasks` to evaluate model's performance on specific down-stream task before and after quantization.\n\nThe predefined tasks support all causal-language-models implemented in [\ud83e\udd17 transformers](https://github.com/huggingface/transformers) and in this project.\n\n<details>\n\n<summary>Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:</summary>\n\n```python\nfrom functools import partial\n\nimport datasets\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig\nfrom gptqmodel import GPTQModel, QuantizeConfig\nfrom gptqmodel.eval_tasks import SequenceClassificationTask\n\nMODEL = \"EleutherAI/gpt-j-6b\"\nDATASET = \"cardiffnlp/tweet_sentiment_multilingual\"\nTEMPLATE = \"Question:What's the sentiment of the given text? Choices are {labels}.\\nText: {text}\\nAnswer:\"\nID2LABEL = {\n 0: \"negative\",\n 1: \"neutral\",\n 2: \"positive\"\n}\nLABELS = list(ID2LABEL.values())\n\n\ndef ds_refactor_fn(samples):\n text_data = samples[\"text\"]\n label_data = samples[\"label\"]\n\n new_samples = {\"prompt\": [], \"label\": []}\n for text, label in zip(text_data, label_data):\n prompt = TEMPLATE.format(labels=LABELS, text=text)\n new_samples[\"prompt\"].append(prompt)\n new_samples[\"label\"].append(ID2LABEL[label])\n\n return new_samples\n\n\n# model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to(\"cuda:0\")\nmodel = GPTQModel.from_pretrained(MODEL, QuantizeConfig())\ntokenizer = AutoTokenizer.from_pretrained(MODEL)\n\ntask = SequenceClassificationTask(\n model=model,\n tokenizer=tokenizer,\n classes=LABELS,\n data_name_or_path=DATASET,\n prompt_col_name=\"prompt\",\n label_col_name=\"label\",\n **{\n \"num_samples\": 1000, # how many samples will be sampled to evaluation\n \"sample_max_len\": 1024, # max tokens for each sample\n \"block_max_len\": 2048, # max tokens for each data block\n # function to load dataset, one must only accept data_name_or_path as input\n # and return datasets.Dataset\n \"load_fn\": partial(datasets.load_dataset, name=\"english\"),\n # function to preprocess dataset, which is used for datasets.Dataset.map,\n # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]\n \"preprocess_fn\": ds_refactor_fn,\n # truncate label when sample's length exceed sample_max_len\n \"truncate_prompt\": False\n }\n)\n\n# note that max_new_tokens will be automatically specified internally based on given classes\nprint(task.run())\n\n# self-consistency\nprint(\n task.run(\n generation_config=GenerationConfig(\n num_beams=3,\n num_return_sequences=3,\n do_sample=True\n )\n )\n)\n```\n\n</details>\n\n## Learn More\n\n[tutorials](docs/tutorial) provide step-by-step guidance to integrate `gptqmodel` with your own project and some best practice principles.\n\n[examples](examples/README.md) provide plenty of example scripts to use `gptqmodel` in different ways.\n\n## Supported Evaluation Tasks\n\nCurrently, `gptqmodel` supports: `LanguageModelingTask`, `SequenceClassificationTask` and `TextSummarizationTask`; more Tasks will come soon!\n\n### Which kernel is used by default?\n\nGPTQModel will use Marlin, Exllama v2, Triton kernels in that order for maximum inference performance.\n\n# Acknowledgements\n\n* **Elias Frantar**, **Saleh Ashkboos**, **Torsten Hoefler** and **Dan Alistarh**: for creating [GPTQ](https://github.com/IST-DASLab/gptq) and [Marlin](https://github.com/IST-DASLab/marlin).\n* **PanQiWei**: for creation of [AutoGPTQ](https://github.com/autogptq/AutoGPTQ) which this project code is based upon.\n* **FXMarty**: for maintaining and support of [AutoGPTQ](https://github.com/autogptq/AutoGPTQ).\n* **Qwopqwop200**: for quantization code used in this project adapted from [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda).\n* **Turboderp**: for releasing [Exllama v1](https://github.com/turboderp/exllama) and [Exllama v2](https://github.com/turboderp/exllamav2) kernels adapted for use in this project.\n* **FpgaMiner**: for [GPTQ-Triton](https://github.com/fpgaminer/GPTQ-triton) kernels used in [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda) which is adapted into this project.\n",
"bugtrack_url": null,
"license": null,
"summary": "A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.",
"version": "1.0.6",
"project_urls": {
"Homepage": "https://github.com/ModelCloud/GPTQModel"
},
"split_keywords": [
"gptq",
" quantization",
" large-language-models",
" transformers",
" 4bit",
" llm"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6d7182783be8bda8ec724cb63b5750c2f6bb79c9fa583a9488d17b2105798978",
"md5": "d7fef4a2b01df0d0c4bb896cc17d6e40",
"sha256": "f3ba2b3213c6f862bb492a663e6dca8f4126b1de0fbe3e4b0121ad05fe40a420"
},
"downloads": -1,
"filename": "gptqmodel-1.0.6.tar.gz",
"has_sig": false,
"md5_digest": "d7fef4a2b01df0d0c4bb896cc17d6e40",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.0",
"size": 166057,
"upload_time": "2024-09-26T16:01:07",
"upload_time_iso_8601": "2024-09-26T16:01:07.420436Z",
"url": "https://files.pythonhosted.org/packages/6d/71/82783be8bda8ec724cb63b5750c2f6bb79c9fa583a9488d17b2105798978/gptqmodel-1.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-26 16:01:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ModelCloud",
"github_project": "GPTQModel",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "gptqmodel"
}