<h1 align="center">GPTQModel</h1>
<p align="center">Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.</p>
<p align="center">
<a href="https://github.com/ModelCloud/GPTQModel/releases" style="text-decoration:none;"><img alt="GitHub release" src="https://img.shields.io/github/release/ModelCloud/GPTQModel.svg"></a>
<a href="https://pypi.org/project/gptqmodel/" style="text-decoration:none;"><img alt="PyPI - Version" src="https://img.shields.io/pypi/v/gptqmodel"></a>
<a href="https://pypi.org/project/gptqmodel/" style="text-decoration:none;"><img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/gptqmodel"></a>
</p>
## News
* 11/11/2024 🚀 [1.2.0] Meta MobileLLM model support added. `lm-eval[gptqmodel]` integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New `.load()` and `.save()` api.
* 10/29/2024 🚀 [v1.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.1.0) IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage.
* 10/12/2024 ✨ [v1.0.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.9) Move AutoRound to optional and fix pip install regression in v1.0.8.
* 10/11/2024 ✨ [v1.0.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.8) Add wheel for python 3.12 and cuda 11.8.
* 10/08/2024 ✨ [v1.0.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.7) Fixed marlin (faster) kernel was not auto-selected for some models.
* 09/26/2024 ✨ [v1.0.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.6) Fixed quantized Llama 3.2 vision quantized loader.
* 09/26/2024 ✨ [v1.0.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.5) Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.
* 09/26/2024 ✨ [v1.0.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.4) Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing.
* 09/18/2024 ✨ [v1.0.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.3) Added Microsoft GRIN-MoE and MiniCPM3 support.
* 08/16/2024 ✨ [v1.0.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.2) Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release.
* 08/14/2024 ✨ [v1.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.0) 40% faster `packing`, Fixed Python 3.9 compat, added `lm_eval` api.
<details>
<summary>Archived News:</summary>
* 08/10/2024 🚀 [v0.9.11](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.11) Added LG EXAONE 3.0 model support. New `dynamic` per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to `backend.BITBLAS`. Auto-heal quantization errors due to small damp values.
* 07/31/2024 🚀 [v0.9.10](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.10) Ported vllm/nm `gptq_marlin` inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with `FORMAT.GPTQ`. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference.
* 07/25/2024 🚀 [v0.9.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.9): Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.
* 07/13/2024 🚀 [v0.9.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.8):
Run quantized models directly using GPTQModel using fast `vLLM` or `SGLang` backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximum `TPS` (check usage under examples). Marlin backend also
got full end-to-end in/out features padding to enhance current/future model compatibility.
* 07/08/2024 🚀 [v0.9.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.7): InternLM 2.5 model support added.
* 07/08/2024 🚀 [v0.9.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.6): [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.
* 07/05/2024 🚀 [v0.9.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.5): Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.
* 07/03/2024 🚀 [v0.9.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.4): HF Transformers integration added and bug fixed Gemma 2 support.
* 07/02/2024 🚀 [v0.9.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.3): Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.
* 06/30/2024 🚀 [v0.9.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.2): Added auto-padding of model in/out-features for exllama and exllama v2.
Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.
* 06/29/2024 🚀 [v0.9.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.1): With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.
* 06/20/2924 ✨ GPTQModel [v0.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.0): Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!
</details>
## Why should you use GPTQModel?
GPTQModel started out as a major refractor (fork) of AutoGTQP but has now morphed into a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, faster quantization, higher quality quants and a pledge that ModelCloud, together with the open-source ML community, will take every effort to bring the library up-to-date with latest advancements and model support.
## Why GPTQ specifically and not the dozens of other low-bit quantizers?
Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production level inference speed in both token latency and rps. GPTQ has currently the optimal blend of quality and inference speed you would want to use in a real-world production system.
## Features
* 🚀 Extensive model support for: `IBM Granite`, `Llama 3.2 Vision`, `MiniCPM3`, `GRIN-Moe`, `Phi 3.5`, `EXAONE 3.0`, `InternLM 2.5`, `Gemma 2`, `DeepSeek-V2`, `DeepSeek-V2-Lite`, `ChatGLM`, `MiniCPM`, `Phi-3`, `Qwen2MoE`, `DBRX` (Converted).
* ✨ 100% CI coverage for all supported models including quality/ppl regression.
* 🚀 vLLM inference integration for quantized model where format = `FORMAT.GPTQ`
* 🚀 SGLang inference integration for quantized model where format = `FORMAT.GPTQ`
* 🚀 [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.
* 🚀 [Intel/IPEX](https://github.com/intel/intel-extension-for-pytorch) support added for 4 bit quantization/inference on CPU.
* 🚀 [BITBLAS](https://github.com/microsoft/BitBLAS) format/inference support from Microsoft
* 🚀`Sym=False` Support. AutoGPTQ has unusable `sym=false`. (Re-quant required)
* 🚀`lm_head` module quant inference support for further VRAM reduction.
* 🚀 Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.
* 🚀 Better quality quants as measured by PPL. (Test config: defaults + `sym=True` + `FORMAT.GPTQ`, TinyLlama)
* 🚀 Model weights sharding support
* 🚀 Security: hash check of model weights on load
* 🚀 Over 50% faster PPL calculations (OPT)
* 🚀 Over 40% faster `packing` stage in quantization (Llama 3.1 8B)
## Model Support: 🚀 (Added by GPTQModel)
[🤗 Pre-quantized models on HF](https://hf.co/ModelCloud)
| Model | | | | | | | | | | |
| ---------------- | --- | -------------- | --- | ---------------- | --- | ---------- | --- | --- | --- | --- |
| Baichuan | ✅ | Falon | ✅ | Llama 3.2 Vision | 🚀 | Qwen | ✅ | | | |
| Bloom | ✅ | Gemma 2 | 🚀 | LongLLaMA | ✅ | Qwen2MoE | 🚀 | | | |
| ChatGLM | 🚀 | GPTBigCod | ✅ | MiniCPM3 | 🚀 | RefinedWeb | ✅ | | | |
| CodeGen | ✅ | GPTNeoX | ✅ | Mistral | ✅ | StableLM | ✅ | | | |
| Cohere | ✅ | GPT-2 | ✅ | Mixtral | ✅ | StarCoder2 | ✅ | | | |
| DBRX Converted | 🚀 | GPT-J | ✅ | MobileLLM | 🚀 | XVERSE | ✅ | | | |
| Deci | ✅ | Granite | 🚀 | MOSS | ✅ | Yi | ✅ | | | |
| DeepSeek-V2 | 🚀 | GRIN-MoE | 🚀 | MPT | ✅ | | | | | |
| DeepSeek-V2-Lite | 🚀 | InternLM 1/2.5 | 🚀 | OPT | ✅ | | | | | |
| EXAONE 3.0 | 🚀 | Llama 1/2/3 | ✅ | Phi/Phi-3 | 🚀 | | | | | |## Compatiblity
## Quality: Quantized Llama-3.2-Instruct models with 100% avg recovery:
![plus-v2 5v3@2x](https://github.com/user-attachments/assets/dd223d08-f790-41ff-814d-4c2a98e4a3de)
## Platform Requirements
GPTQModel is validated for Linux x86_64 with Nvidia GPUs. Windows WSL2 may work but un-tested.
## Install
### PIP/UV
```bash
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]
pip install -v gptqmodel --no-build-isolation
uv pip install -v gptqmodel --no-build-isolation
```
### Install from source
```bash
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]
pip install -v . --no-build-isolation
```
### Quantization and Inference
Below is a basic sample using `GPTQModel` to quantize a llm model and perform post-quantization inference:
```py
from datasets import load_dataset
from transformers import AutoTokenizer
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id)
calibration_dataset = [
tokenizer(example["text"])
for example in load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))
]
quant_config = QuantizeConfig(bits=4, group_size=128)
model = GPTQModel.load(model_id, quant_config)
model.quantize(calibration_dataset)
model.save(quant_path)
model = GPTQModel.load(quant_path)
result = model.generate(
**tokenizer(
"Uncovering deep insights begins with", return_tensors="pt"
).to(model.device)
)[0]
```
For more advanced features of model quantization, please reference to [this script](https://github.com/ModelCloud/GPTQModel/blob/main/examples/quantization/basic_usage_wikitext2.py)
### How to Add Support for a New Model
Read the [`gptqmodel/models/llama.py`](https://github.com/ModelCloud/GPTQModel/blob/5627f5ffeb3f19b1a2a97e3b6de6fbe668b0dc42/gptqmodel/models/llama.py) code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.
### Evaluation and Quality Benchmarks
GPTQModel inference is integrated into [lm-evaluation-hardness](https://github.com/EleutherAI/lm-evaluation-harness) and we highly recommend avoid using PPL and use `lm-eval` to validate post-quantization model quality.
```
# currently gptqmodel is merged into lm-eval main but not yet released on pypi
pip install lm-eval[gptqmodel]
```
### Which kernel is used by default?
* `GPU`: Marlin, Exllama v2, Triton kernels in that order for maximum inference performance. Optional Microsoft/BITBLAS kernel can be toggled.
* `CPU`: Intel/IPEX kernel
## Citation
```
@misc{gptqmodel,
author = {ModelCloud.ai},
title = {GPTQModel},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
}
@article{frantar-gptq,
title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
year={2022},
journal={arXiv preprint arXiv:2210.17323}
}
@article{frantar2024marlin,
title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
journal={arXiv preprint arXiv:2408.11743},
year={2024}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/ModelCloud/GPTQModel",
"name": "gptqmodel",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9.0",
"maintainer_email": null,
"keywords": "gptq, quantization, large-language-models, transformers, 4bit, llm",
"author": "ModelCloud",
"author_email": "qubitium@modelcloud.ai",
"download_url": "https://files.pythonhosted.org/packages/fb/7c/99c89a8d8ce0d9a1208bde245175f2b5a5a21a875a9a2c371dde26fc50b7/gptqmodel-1.2.1.tar.gz",
"platform": "linux",
"description": "<h1 align=\"center\">GPTQModel</h1>\n<p align=\"center\">Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.</p>\n<p align=\"center\">\n <a href=\"https://github.com/ModelCloud/GPTQModel/releases\" style=\"text-decoration:none;\"><img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/ModelCloud/GPTQModel.svg\"></a>\n <a href=\"https://pypi.org/project/gptqmodel/\" style=\"text-decoration:none;\"><img alt=\"PyPI - Version\" src=\"https://img.shields.io/pypi/v/gptqmodel\"></a>\n <a href=\"https://pypi.org/project/gptqmodel/\" style=\"text-decoration:none;\"><img alt=\"PyPI - Downloads\" src=\"https://img.shields.io/pypi/dm/gptqmodel\"></a>\n</p>\n\n## News\n* 11/11/2024 \ud83d\ude80 [1.2.0] Meta MobileLLM model support added. `lm-eval[gptqmodel]` integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New `.load()` and `.save()` api. \n* 10/29/2024 \ud83d\ude80 [v1.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.1.0) IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage. \n* 10/12/2024 \u2728 [v1.0.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.9) Move AutoRound to optional and fix pip install regression in v1.0.8.\n* 10/11/2024 \u2728 [v1.0.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.8) Add wheel for python 3.12 and cuda 11.8.\n* 10/08/2024 \u2728 [v1.0.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.7) Fixed marlin (faster) kernel was not auto-selected for some models.\n* 09/26/2024 \u2728 [v1.0.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.6) Fixed quantized Llama 3.2 vision quantized loader.\n* 09/26/2024 \u2728 [v1.0.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.5) Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.\n* 09/26/2024 \u2728 [v1.0.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.4) Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing. \n* 09/18/2024 \u2728 [v1.0.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.3) Added Microsoft GRIN-MoE and MiniCPM3 support.\n* 08/16/2024 \u2728 [v1.0.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.2) Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release. \n* 08/14/2024 \u2728 [v1.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.0) 40% faster `packing`, Fixed Python 3.9 compat, added `lm_eval` api. \n\n<details>\n \n<summary>Archived News:</summary>\n* 08/10/2024 \ud83d\ude80 [v0.9.11](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.11) Added LG EXAONE 3.0 model support. New `dynamic` per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to `backend.BITBLAS`. Auto-heal quantization errors due to small damp values. \n\n* 07/31/2024 \ud83d\ude80 [v0.9.10](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.10) Ported vllm/nm `gptq_marlin` inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with `FORMAT.GPTQ`. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference.\n\n* 07/25/2024 \ud83d\ude80 [v0.9.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.9): Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more. \n\n* 07/13/2024 \ud83d\ude80 [v0.9.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.8):\nRun quantized models directly using GPTQModel using fast `vLLM` or `SGLang` backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximum `TPS` (check usage under examples). Marlin backend also\ngot full end-to-end in/out features padding to enhance current/future model compatibility.\n\n* 07/08/2024 \ud83d\ude80 [v0.9.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.7): InternLM 2.5 model support added.\n\n* 07/08/2024 \ud83d\ude80 [v0.9.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.6): [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.\n\n* 07/05/2024 \ud83d\ude80 [v0.9.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.5): Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.\n\n* 07/03/2024 \ud83d\ude80 [v0.9.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.4): HF Transformers integration added and bug fixed Gemma 2 support.\n\n* 07/02/2024 \ud83d\ude80 [v0.9.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.3): Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.\n\n* 06/30/2024 \ud83d\ude80 [v0.9.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.2): Added auto-padding of model in/out-features for exllama and exllama v2. \nFixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.\n\n* 06/29/2024 \ud83d\ude80 [v0.9.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.1): With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.\n\n* 06/20/2924 \u2728 GPTQModel [v0.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.0): Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!\n</details>\n\n## Why should you use GPTQModel?\n\nGPTQModel started out as a major refractor (fork) of AutoGTQP but has now morphed into a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, faster quantization, higher quality quants and a pledge that ModelCloud, together with the open-source ML community, will take every effort to bring the library up-to-date with latest advancements and model support.\n\n## Why GPTQ specifically and not the dozens of other low-bit quantizers?\n\nPublic tests/papers and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production level inference speed in both token latency and rps. GPTQ has currently the optimal blend of quality and inference speed you would want to use in a real-world production system. \n\n## Features\n* \ud83d\ude80 Extensive model support for: `IBM Granite`, `Llama 3.2 Vision`, `MiniCPM3`, `GRIN-Moe`, `Phi 3.5`, `EXAONE 3.0`, `InternLM 2.5`, `Gemma 2`, `DeepSeek-V2`, `DeepSeek-V2-Lite`, `ChatGLM`, `MiniCPM`, `Phi-3`, `Qwen2MoE`, `DBRX` (Converted).\n* \u2728 100% CI coverage for all supported models including quality/ppl regression.\n* \ud83d\ude80 vLLM inference integration for quantized model where format = `FORMAT.GPTQ` \n* \ud83d\ude80 SGLang inference integration for quantized model where format = `FORMAT.GPTQ` \n* \ud83d\ude80 [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.\n* \ud83d\ude80 [Intel/IPEX](https://github.com/intel/intel-extension-for-pytorch) support added for 4 bit quantization/inference on CPU.\n* \ud83d\ude80 [BITBLAS](https://github.com/microsoft/BitBLAS) format/inference support from Microsoft\n* \ud83d\ude80`Sym=False` Support. AutoGPTQ has unusable `sym=false`. (Re-quant required)\n* \ud83d\ude80`lm_head` module quant inference support for further VRAM reduction. \n* \ud83d\ude80 Faster quantization: More than 50% faster for TinyLlama + 4090 with batching and large calibration dataset.\n* \ud83d\ude80 Better quality quants as measured by PPL. (Test config: defaults + `sym=True` + `FORMAT.GPTQ`, TinyLlama)\n* \ud83d\ude80 Model weights sharding support\n* \ud83d\ude80 Security: hash check of model weights on load\n* \ud83d\ude80 Over 50% faster PPL calculations (OPT)\n* \ud83d\ude80 Over 40% faster `packing` stage in quantization (Llama 3.1 8B)\n\n\n## Model Support: \ud83d\ude80 (Added by GPTQModel) \n[\ud83e\udd17 Pre-quantized models on HF](https://hf.co/ModelCloud)\n\n\n| Model | | | | | | | | | | |\n| ---------------- | --- | -------------- | --- | ---------------- | --- | ---------- | --- | --- | --- | --- |\n| Baichuan | \u2705 | Falon | \u2705 | Llama 3.2 Vision | \ud83d\ude80 | Qwen | \u2705 | | | |\n| Bloom | \u2705 | Gemma 2 | \ud83d\ude80 | LongLLaMA | \u2705 | Qwen2MoE | \ud83d\ude80 | | | |\n| ChatGLM | \ud83d\ude80 | GPTBigCod | \u2705 | MiniCPM3 | \ud83d\ude80 | RefinedWeb | \u2705 | | | |\n| CodeGen | \u2705 | GPTNeoX | \u2705 | Mistral | \u2705 | StableLM | \u2705 | | | |\n| Cohere | \u2705 | GPT-2 | \u2705 | Mixtral | \u2705 | StarCoder2 | \u2705 | | | |\n| DBRX Converted | \ud83d\ude80 | GPT-J | \u2705 | MobileLLM | \ud83d\ude80 | XVERSE | \u2705 | | | |\n| Deci | \u2705 | Granite | \ud83d\ude80 | MOSS | \u2705 | Yi | \u2705 | | | |\n| DeepSeek-V2 | \ud83d\ude80 | GRIN-MoE | \ud83d\ude80 | MPT | \u2705 | | | | | |\n| DeepSeek-V2-Lite | \ud83d\ude80 | InternLM 1/2.5 | \ud83d\ude80 | OPT | \u2705 | | | | | |\n| EXAONE 3.0 | \ud83d\ude80 | Llama 1/2/3 | \u2705 | Phi/Phi-3 | \ud83d\ude80 | | | | | |## Compatiblity \n\n\n## Quality: Quantized Llama-3.2-Instruct models with 100% avg recovery:\n![plus-v2 5v3@2x](https://github.com/user-attachments/assets/dd223d08-f790-41ff-814d-4c2a98e4a3de)\n\n\n## Platform Requirements\n\nGPTQModel is validated for Linux x86_64 with Nvidia GPUs. Windows WSL2 may work but un-tested. \n\n## Install\n\n### PIP/UV \n\n```bash\n# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.\n# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]\npip install -v gptqmodel --no-build-isolation\nuv pip install -v gptqmodel --no-build-isolation\n```\n\n### Install from source\n\n```bash\n# clone repo\ngit clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel\n\n# pip: compile and install\n# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.\n# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]\npip install -v . --no-build-isolation\n```\n\n### Quantization and Inference\n\nBelow is a basic sample using `GPTQModel` to quantize a llm model and perform post-quantization inference:\n\n```py\nfrom datasets import load_dataset\nfrom transformers import AutoTokenizer\nfrom gptqmodel import GPTQModel, QuantizeConfig\n\nmodel_id = \"meta-llama/Llama-3.2-1B-Instruct\"\nquant_path = \"Llama-3.2-1B-Instruct-gptqmodel-4bit\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n\ncalibration_dataset = [\n tokenizer(example[\"text\"])\n for example in load_dataset(\n \"allenai/c4\",\n data_files=\"en/c4-train.00001-of-01024.json.gz\",\n split=\"train\"\n ).select(range(1024))\n]\n\nquant_config = QuantizeConfig(bits=4, group_size=128)\n\nmodel = GPTQModel.load(model_id, quant_config)\n\nmodel.quantize(calibration_dataset)\n\nmodel.save(quant_path)\n\nmodel = GPTQModel.load(quant_path)\n\nresult = model.generate(\n **tokenizer(\n \"Uncovering deep insights begins with\", return_tensors=\"pt\"\n ).to(model.device)\n)[0]\n```\n\nFor more advanced features of model quantization, please reference to [this script](https://github.com/ModelCloud/GPTQModel/blob/main/examples/quantization/basic_usage_wikitext2.py)\n\n### How to Add Support for a New Model\n\nRead the [`gptqmodel/models/llama.py`](https://github.com/ModelCloud/GPTQModel/blob/5627f5ffeb3f19b1a2a97e3b6de6fbe668b0dc42/gptqmodel/models/llama.py) code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.\n\n### Evaluation and Quality Benchmarks\n\nGPTQModel inference is integrated into [lm-evaluation-hardness](https://github.com/EleutherAI/lm-evaluation-harness) and we highly recommend avoid using PPL and use `lm-eval` to validate post-quantization model quality. \n\n```\n# currently gptqmodel is merged into lm-eval main but not yet released on pypi\npip install lm-eval[gptqmodel]\n```\n\n### Which kernel is used by default?\n\n* `GPU`: Marlin, Exllama v2, Triton kernels in that order for maximum inference performance. Optional Microsoft/BITBLAS kernel can be toggled. \n* `CPU`: Intel/IPEX kernel \n\n## Citation\n```\n@misc{gptqmodel,\n author = {ModelCloud.ai},\n title = {GPTQModel},\n year = {2024},\n publisher = {GitHub},\n journal = {GitHub repository},\n howpublished = {\\url{https://github.com/modelcloud/gptqmodel}},\n}\n\n@article{frantar-gptq,\n title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, \n author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},\n year={2022},\n journal={arXiv preprint arXiv:2210.17323}\n}\n\n@article{frantar2024marlin,\n title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},\n author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},\n journal={arXiv preprint arXiv:2408.11743},\n year={2024}\n}\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "A LLM quantization package with user-friendly apis. Based on GPTQ algorithm.",
"version": "1.2.1",
"project_urls": {
"Homepage": "https://github.com/ModelCloud/GPTQModel"
},
"split_keywords": [
"gptq",
" quantization",
" large-language-models",
" transformers",
" 4bit",
" llm"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fb7c99c89a8d8ce0d9a1208bde245175f2b5a5a21a875a9a2c371dde26fc50b7",
"md5": "e9ec12edaf7432cbe8efe64359ee259a",
"sha256": "da8fc5422cf2a3693d28bb46ee07261e3676b5b241341232a455aa8988f9c02f"
},
"downloads": -1,
"filename": "gptqmodel-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "e9ec12edaf7432cbe8efe64359ee259a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9.0",
"size": 175310,
"upload_time": "2024-11-11T14:56:07",
"upload_time_iso_8601": "2024-11-11T14:56:07.080038Z",
"url": "https://files.pythonhosted.org/packages/fb/7c/99c89a8d8ce0d9a1208bde245175f2b5a5a21a875a9a2c371dde26fc50b7/gptqmodel-1.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-11 14:56:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ModelCloud",
"github_project": "GPTQModel",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "gptqmodel"
}