GPTQModel


NameGPTQModel JSON
Version 5.0.0 PyPI version JSON
download
home_pageNone
SummaryProduction ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
upload_time2025-10-24 04:39:03
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseApache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
keywords gptq awq qqq autogptq autoawq eora gar quantization large-language-models transformers llm moe compression
VCS
bugtrack_url
requirements accelerate numpy torch safetensors transformers threadpoolctl packaging device-smi protobuf pillow pypcre hf_transfer huggingface_hub random_word tokenicer logbar maturin datasets pyarrow dill pypcre torchao
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align=center>
<div align=center>
<img src="https://github.com/user-attachments/assets/ab70eb1e-06e7-4dc9-83e5-bd562e1a78b2" width=500>
</div>
<h1 align="center">GPT-QModel</h1>
</p>
<p align="center">LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.</p>
<p align="center">
    <a href="https://github.com/ModelCloud/GPTQModel/releases" style="text-decoration:none;"><img alt="GitHub release" src="https://img.shields.io/github/release/ModelCloud/GPTQModel.svg"></a>
    <a href="https://pypi.org/project/gptqmodel/" style="text-decoration:none;"><img alt="PyPI - Version" src="https://img.shields.io/pypi/v/gptqmodel"></a>
    <a href="https://pepy.tech/projects/gptqmodel" style="text-decoration:none;"><img src="https://static.pepy.tech/badge/gptqmodel" alt="PyPI Downloads"></a>
    <a href="https://github.com/ModelCloud/GPTQModel/blob/main/LICENSE"><img src="https://img.shields.io/pypi/l/gptqmodel"></a>
    <a href="https://huggingface.co/modelcloud/"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-ModelCloud-%23ff8811.svg"></a>
    <a href="https://huggingface.co/models?search=gptq">
        <img alt="Huggingface - Models" src="https://img.shields.io/badge/🤗_5000+_models_available-8A2BE2">
    </a>
</p>

## Latest News
* 10/24/2025 [5.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v5.0.0): 🎉 Data-parallel quant support for `MoE` models on multi-gpu using `nogil` Python. `offload_to_disk` support enabled by 
default to massively reduce `cpu` ram usage. New `Intel` and `AMD` cpu hw accelerated `TorchFused` kernel. Packing stage is now 4x faster and now inlined with quantization. `Vram` pressure for large models reduced during quantization.
`act_group_aware` is  16k+ times faster and now the default when `desc_act=False` for higher quality recovery without inference penalty of `desc_act=True`. New beta quality `AWQ` support with full `gemm`, 
`gemm_fast`, `marlin` kernel support. `LFM`, `Ling`, `Qwen3 Omni` model support. 
`Bitblas` kernel updated to support Bitblas `0.1.0.post1` reelase.
Quantization is now faster with reduced vram usage. Enhanced logging support with `LogBar`.
* 09/16/2025 [4.2.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v4.2.5): `hyb_act` renamed to `act_group_aware`. Removed finicky `torch` import within `setup.py`. Packing bug fix and prebuilt Pytorch 2.8 whls. 
* 09/12/2025 [4.2.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v4.2.0): ✨ New Models Support: Qwen3-Next, Apertus, Kimi K2, Klear, FastLLM, Nemotron H. New `fail_safe` `boolean` toggle to `.quantize()` to patch-fix non-activated `MoE` modules due to highly uneven MoE model training. Fixed LavaQwen2 compat. Patch fix GIL=0 cuda error for multi-gpu. Fix compat with autoround + new transformers. 
* 09/04/2025 [4.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v4.1.0): ✨ Meituan LongCat Flash Chat, Llama 4, GPT-OSS (BF16), and GLM-4.5-Air support.  New experiemental `mock_quantization` config to skip complex computational code paths during quantization to accelerate model quant testing. 
* 08/21/2025 [4.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v4.0.0): 🎉 New Group Aware Reordering (GAR) support. New models support: Bytedance Seed-OSS, Baidu Ernie, Huawei PanGu, Gemma3, Xiaomi Mimo, Qwen 3/MoE, Falcon H1, GPT-Neo. Memory leak and multiple model compatibility fixes related to Transformers >= 4.54. Python >= 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage. Early access Pytorch 2.8 fused-ops on Intel XPU for up to 50% speedup.

<details>

<summary>Archived News</summary>
* 10/17/2025 5.0.0-dev `main`: 👀: EoRA now multi-gpu compatible. Fixed both quality stability of multi-gpu quanta and vram usage. New LFM and Ling models support.
* 09/30/2025 5.0.0-dev `main`: 👀: New Data Parallel + Multi-GPU + Python 3.13T (PYTHON_GIL=0) equals 80%+ overall quant time reduction of large MoE models vs v4.2.5. 
* 09/29/2025 5.0.0-dev `main`: 🎉 New Qwen3 Omni model support. AWQ Marlin kernel integrated + many disk offload, threading, and memory usage fixes. 
* 09/24/2025 5.0.0-dev `main`: 🎉 Up to 90% cpu mem saving for large MoE models with faster/inline packing! 26% quant time reduction for Qwen3 MoE! AWQ Marlin kernel added. AWQ Gemm loading bug fixes. `act_group_aware` now faster and auto enabled for GPTQ when `desc_act` is False for higher quality recovery. 
* 09/19/2025 5.0.0-dev `main`: 👀 Cpu memory saving of ~73.5% during quantization stage with new `offload_to_disk` quantization config property default to `True`. 
* 09/18/2025 5.0.0-dev `main`: 🎉 AWQ quantization support! Complete refractor and simplification of model definitions in prepreation for future quantization formats.
* 08/19/2025 4.0.0-dev `main`: Fix quantization memory usage due to some model's incorrect application of `config.use_cache` during inference. Fixed `Transformers` >= 4.54.0 compat which changed layer forward return signature for some models. 
* 08/18/2025 4.0.0-dev `main`: GPT-Neo model support. Memory leak fix in error capture (stacktrace) and fixed `lm_head` quantization compatibility for many models.
* 07/31/2025 4.0.0-dev `main`: New Group Aware Reordering (GAR) support and prelim Pytorch 2.8 fused-ops for Intel XPU for up to 50% speedup. 
* 07/03/2025 4.0.0-dev `main`: New Baidu Ernie and Huawei PanGu model support.
* 07/02/2025 4.0.0-dev `main`: Gemma3 4B model compat fix.
* 05/29/2025 4.0.0-dev `main`: Falcon H1 model support. Fixed Transformers `4.52+` compat with Qwen 2.5 VL models.
* 05/19/2025 4.0.0-dev `main`: Qwen 2.5 Omni model support. 
* 05/05/2025 4.0.0-dev `main`: Python 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage. 
* 04/29/2025 3.1.0-dev (Now 4.) `main`: Xiaomi Mimo model support. Qwen 3 and 3 MoE model support. New arg for `quantize(..., calibration_dataset_min_length=10)` to filter out bad calibration data that exists in public dataset (wikitext). 
* 04/13/2025 [3.0.0](https://github.com/ModelCloud/Model/releases/tag/v3.0.0): 🎉 New experimental ` v2` quantization option for improved model quantization accuracy validated by `GSM8K_PLATINUM` [benchmarks](https://github.com/ModelCloud/Model#quantization-using-gptq-v2) vs original `gptq`. New `Phi4-MultiModal` model support . New Nvidia Nemotron-Ultra model support. New `Dream` model support. New experimental `multi-gpu` quantization support. Reduced vram usage. Faster quantization.
* 04/2/2025 [2.2.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v2.2.0): New `Qwen 2.5 VL` model support. New `samples` log column during quantization to track module activation in MoE models. `Loss` log column now color-coded to highlight modules that are friendly/resistant to quantization. Progress (per-step) stats during quantization now streamed to log file. Auto `bfloat16`  dtype loading for models based on model config. Fix kernel compile for Pytorch/ROCm. Slightly faster quantization and auto-resolve some low-level oom issues for smaller vram gpus. 
* 03/12/2025 [2.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v2.1.0): ✨ New `QQQ` quantization method and inference support!
New Google `Gemma 3` zero-day model support.
New Alibaba `Ovis 2` VL model support. 
New AMD `Instella` zero-day model model support. New `GSM8K Platinum` and `MMLU-Pro` benchmarking suppport.
Peft Lora training with GPT-QModel is now 30%+ faster on all gpu and IPEX devices.
Auto detect MoE modules not activated during quantization due to insufficient calibration data. 
`ROCm` `setup.py` compat fixes. `Optimum` and `Peft` compat fixes.
Fixed `Peft` `bfloat16` training. 
* 03/03/2025 [2.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v2.0.0): 🎉 `GPTQ` quantization internals are now broken into multiple stages (processes) for feature expansion. 
Synced `Marlin` kernel inference quality fix from upstream. Added `MARLIN_FP16`, lower-quality but faster backend. 
`ModelScope` support added. Logging and cli progress bar output has been revamped with sticky bottom progress.
Fixed `generation_config.json` save and load. Fixed Transformers v4.49.0 compat. Fixed compat of models without `bos`. Fixed `group_size=-1` and `bits=3` packing regression. 
Fixed Qwen 2.5 MoE regressions. 
Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate loggin/progressbar to [LogBar](https://github.com/modelcloud/logbar) pkg.
Fix ROCm version auto detection in `setup` install.
* 02/12/2025 [1.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.9.0): ⚡ Offload `tokenizer` fixes to [Toke(n)icer](https://github.com/modelcloud/tokenicer) pkg. Optimized `lm_head` quant time and vram usage.
  Optimized `DeepSeek v3/R1` model quant vram usage. Fixed `Optimum` compat regresion in `v1.8.1`. 3x speed-up for `Torch` kernel when using Pytorch >= 2.5.0 with `model.optimize()`. New `calibration_dataset_concat_size` option to enable calibration data `concat` mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like `wikitext2`. 
* 02/08/2025 [1.8.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.8.1): ⚡ `DeepSeek v3/R1` model support. New flexible weight `packing`: allow quantized weights to be packed to `[int32, int16, int8]` dtypes. 
`Triton` and `Torch` kernels supports full range of new `QuantizeConfig.pack_dtype`. 
New `auto_gc: bool` control in `quantize()` which can reduce quantization time for small model with no chance of oom. 
New `GPTQModel.push_to_hub()` api for easy quant model upload to HF repo. New `buffered_fwd: bool` control in `model.quantize()`. Over 50% quantization speed-up for visual (vl) models.  
Fixed `bits=3` packing and `group_size=-1` regression in v1.7.4.
* 01/26/2025 [1.7.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.4): New `compile()` api for ~4-8% inference tps improvement. Faster `pack()` for post-quantiztion model save. `Triton` kernel validated for Intel/`XPU` when Intel Triton packages are installed. Fixed Transformers (bug) downcasting tokenizer class on save. 
* 01/20/2025 [1.7.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.3): New Telechat2 (China Telecom) and PhiMoE model support. Fixed `lm_head` weights duplicated in post-quantize save() for models with tied-embedding. 
* 01/19/2025 [1.7.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.2): Effective BPW (bits per weight) will now be logged during `load()`. Reduce loading time on Intel Arc A770/B580 `XPU` by 3.3x. Reduce memory usage in MLX conversion and fix Marlin kernel auto-select not checking CUDA compute version. 
* 01/17/2025 [1.7.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.0): 👀 ✨ `backend.MLX` added for runtime-conversion and execution of GPTQ models on Apple's `MLX` framework on Apple Silicon (M1+). Exports of `gptq` models to `mlx` also now possible. We have added `mlx` exported models to [huggingface.co/ModelCloud](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2). ✨ `lm_head` quantization now fully support by GPTQModel without external pkg dependency. 
* 01/07/2025 [1.6.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.6.1): 🎉 New OpenAI api compatible end-point via `model.serve(host, port)`. Auto-enable flash-attention2 for inference.  Fixed `sym=False` loading regression. 
* 01/06/2025 [1.6.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.6.0): ⚡25% faster quantization. 35% reduction in vram usage vs v1.5. 👀 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU. Auto-tokenizer loader via `load()` api. For most models you no longer need to manually init a tokenizer for both inference and quantization.
* 01/01/2025 [1.5.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.5.1): 🎉 2025! Added `QuantizeConfig.device` to clearly define which device is used for quantization: default = `auto`. Non-quantized models are always loaded on cpu by-default and each layer is moved to `QuantizeConfig.device` during quantization to minimize vram usage. Compatibility fixes for `attn_implementation_autoset` in latest transformers. 

* 12/23/2024 [1.5.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.5.0): Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
* 12/19/2024 [1.4.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.5): Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed `dynamic` loading. Reduced quantization vram usage.
* 12/15/2024 [1.4.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.2): MacOS `gpu` (Metal) and `cpu` (M+) support added/validated for inference and quantization. Cohere 2 model support added. 
* 12/13/2024 [1.4.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.1): Added Qwen2-VL model support. `mse` quantization control exposed in `QuantizeConfig`. Monkey patch `patch_vllm()` and `patch_hf()` api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status. 
* 12/10/2024 [1.4.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.0) `EvalPlus` harness integration merged upstream. We now support both `lm-eval` and `EvalPlus`. Added pure torch `Torch` kernel. Refactored `Cuda` kernel to be `DynamicCuda` kernel. `Triton` kernel now auto-padded for max model support. `Dynamic` quantization now supports both positive `+:`:default, and `-:` negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-`Marlin` kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving of `Marlin` weight format since `Marlin` supports auto conversion of `gptq` format to `Marlin` during runtime. 

* 11/29/2024 [1.3.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.1) Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg. 

* 11/26/2024 [1.3.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.0) Zero-Day Hymba model support. Removed `tqdm` and `rogue` dependency. 
* 11/24/2024 [1.2.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.3) HF GLM model support. ClearML logging integration. Use `device-smi` and replace `gputil` + `psutil` depends. Fixed model unit tests. 

* 11/11/2024 🚀 [1.2.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.1) Meta MobileLLM model support added. `lm-eval[gptqmodel]` integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New `.load()` and `.save()` api. 

* 10/29/2024 🚀 [1.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.1.0) IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage. 

* 10/12/2024 ✨ [1.0.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.9) Move AutoRound to optional and fix pip install regression in v1.0.8.

* 10/11/2024 ✨ [1.0.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.8) Add wheel for python 3.12 and cuda 11.8.
* 10/08/2024 ✨ [1.0.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.7) Fixed marlin (faster) kernel was not auto-selected for some models.

* 09/26/2024 ✨ [1.0.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.6) Fixed quantized Llama 3.2 vision quantized loader.
* 09/26/2024 ✨ [1.0.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.5) Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.

* 09/26/2024 ✨ [1.0.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.4) Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing. 
* 09/18/2024 ✨ [1.0.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.3) Added Microsoft GRIN-MoE and MiniCPM3 support.
* 08/16/2024 ✨ [1.0.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.2) Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release. 
* 08/14/2024 ✨ [1.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.0) 40% faster `packing`, Fixed Python 3.9 compat, added `lm_eval` api. 
* 08/10/2024 🚀 [0.9.11](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.11) Added LG EXAONE 3.0 model support. New `dynamic` per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to `backend.BITBLAS`. Auto-heal quantization errors due to small damp values. 
* 07/31/2024 🚀 [0.9.10](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.10) Ported vllm/nm `gptq_marlin` inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with `FORMAT.GPTQ`. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference.
* 07/25/2024 🚀 [0.9.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.9): Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.  
* 07/13/2024 🚀 [0.9.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.8):
Run quantized models directly using GPTQModel using fast `vLLM` or `SGLang` backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximum `TPS` (check usage under examples). Marlin backend also
got full end-to-end in/out features padding to enhance current/future model compatibility.
* 07/08/2024 🚀 [0.9.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.7): InternLM 2.5 model support added.
* 07/08/2024 🚀 [0.9.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.6): [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.
* 07/05/2024 🚀 [0.9.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.5): Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.
* 07/03/2024 🚀 [0.9.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.4): HF Transformers integration added and bug fixed Gemma 2 support.
* 07/02/2024 🚀 [0.9.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.3): Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.
* 06/30/2024 🚀 [0.9.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.2): Added auto-padding of model in/out-features for exllama and exllama v2. 
Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.
* 06/29/2024 🚀 [0.9.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.1): With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.
* 06/20/2924 ✨ [0.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.0): Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!
</details>

## What is GPT-QModel?
GPT-QModel is a production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF Transformers, vLLM, and SGLang.

Public and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment. 

GPT-QModel not only supports GPTQ but also QQQ, GPTQv2, Eora with more quantization methods and enhancements planned. 

## Quantization Support

GPT-QModel is a modular design supporting multiple quantization methods and feature extensions.

| Quantization Feature       | GPT-QModel | Transformers | vLLM  | SGLang | Lora Training |
|----------------------------|------------|---|---|---|---------------|
| GPTQ                       | ✅          | ✅ | ✅ | ✅ | ✅             | 
| EoRA                       | ✅          | ✅ | ✅ | ✅ | x             | 
| Group Aware Act Reordering | ✅          | ✅ | ✅ | ✅ | ✅             |
| AWQ                        | ✅          | ✅* | ✅* | ✅* | ✅*             | 
| QQQ                        | ✅          | x | x | x | x             | 
| Rotation                   | ✅          | x | x | x | x             |  
| GPTQ v2*                   | ✅          | ✅ | ✅ | ✅ | ✅             | 

## Multi-Modal

Native support support some of the most popular multi-modal models:

| Multi-Modal              |   | 
|-------------------|---|
| Qwen 2.5 Omni     | ✅ | 
| Qwen2 VL          | ✅ | 
| Ovis 1.6 + 2      | ✅ | 
| Phi-4 MultiModal  | ✅ | 



## Features
* ✨ Native integration with HF [Transformers](https://github.com/huggingface/transformers), [Optimum](https://github.com/huggingface/optimum), and [Peft (main)](https://github.com/huggingface/peft)
* 🚀 [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) inference integration for quantized model with format = `FORMAT.GPTQ`
* ✨ GPTQ, AWQ, and QQQ quantization format with hw accelerated inference kernels. 
* 🚀 Data Parallism for 80%+ quantization speed reduction with Multi-GPU.
* 🚀 Optimized for Python >= 3.13t (free threading) with lock-free threading.
* ✨ Linux, MacOS, Windows platform quantization and accelerated inference support for CUDA (Nvidia), XPU (Intel), ROCm (AMD), MPS (Apple Silicon), CPU (Intel/AMD/Apple Silicon).
* ✨ `Dynamic` mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together. 
* 🚀 Intel Torch 2.8 fused kernel support for XPU [`Arc` + `Datacenter Max`] and CPU [`avx`, `amx`, `xmx`].
* 🚀 Python 3.13.3t (free-threading, GIL disabled) support for multi-gpu accelerated quantization for MoE models and multi-core cpu boost for post-quant packing.
* ✨ Asymmetric `Sym=False` support. Model weights sharding support with optional hash check of model weights on load.
* ✨ `lm_head` module quant inference support for further VRAM reduction.
* 🚀 [Microsoft/BITBLAS](https://github.com/microsoft/BitBLAS) format + dynamically compiled inference.
* 💯 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.


## Quality: GPTQ 4bit (5.0 bpw) can match BF16:
🤗 [ModelCloud quantized Vortex models on HF](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2)

<img src=https://github.com/user-attachments/assets/c1b89394-f8f6-44e5-9949-bef15a124723 width="51%"> <img src=https://github.com/user-attachments/assets/23901236-10c5-4435-ac2f-06cf2e097f1e width="47%">

## Model Support  
| Model             |   |                   |   |                |   |                |   |                     |   |
|-------------------|---|-------------------|---|----------------|---|----------------|---|---------------------|---|
| Apertus           | ✅ | EXAONE 3.0        | ✅ | InternLM 1/2.5 | ✅ | Mixtral        | ✅ | Qwen 2/3 (Next/MoE) | ✅ |
| Baichuan          | ✅ | Falcon (H1)       | ✅ | Kimi K2        | ✅ | MobileLLM      | ✅ | Qwen 2/2.5 VL       | ✅ |
| Bloom             | ✅ | FastVLM           | ✅ | Klear          | ✅ | MOSS           | ✅ | Qwen 2.5/3 Omni     | ✅ |
| ChatGLM           | ✅ | Gemma 1/2/3       | ✅ | LING/RING      | ✅ | MPT            | ✅ | RefinedWeb          | ✅ |
| CodeGen           | ✅ | GPTBigCod         | ✅ | Llama 1-3.3    | ✅ | Nemotron H     | ✅ | StableLM            | ✅ |
| Cohere 1-2        | ✅ | GPTQ-Neo/GPT-NeoX | ✅ | Llama 3.2 VL   | ✅ | Nemotron Ultra | ✅ | StarCoder2          | ✅ |
| DBRX Converted    | ✅ | GPT-2             | ✅ | Llama 4        | ✅ | OPT            | ✅ | TeleChat2           | ✅ |
| Deci              | ✅ | GPT-J             | ✅ | LongCatFlash   | ✅ | OLMo2          | ✅ | Yi                  | ✅ |
| DeepSeek-V2/V3/R1 | ✅ | GPT-OSS           | ✅ | LongLLaMA      | ✅ | Ovis 1.6/2     | ✅ | Seed-OSS            | ✅ |
| DeepSeek-V2-Lite  | ✅ | Granite           | ✅ | Instella       | ✅ | Phi 1-4        | ✅ | XVERSE              | ✅ |
| Dream             | ✅ | GRIN-MoE          | ✅ | MiniCPM3       | ✅ | PanGu-α        | ✅ |                     |   |
| ERNIE 4.5         | ✅ | Hymba             | ✅ | Mistral        | ✅ | Qwen 1/2/3     | ✅ |                     |   |


## Platform and HW Support 

GPT-QModel is validated for Linux, MacOS, and Windows 11:

| Platform        | Device        |     |  Optimized Arch              | Kernels                                       |
|-----------------|---------------| --- | -------------- |-----------------------------------------------| 
| 🐧 Linux           | Nvidia GPU    | ✅       | `Ampere+` | Marlin, Exllama V2, Exallma V1, Triton, Torch |
| 🐧 Linux | AMD GPU     | ✅             |   `7900XT+`,  `ROCm 6.2+` | Exllama V2, Exallma V1, Torch                 |
| 🐧 Linux | Intel XPU     | ✅             |   `Arc`, `Datacenter Max` | Torch Fused (Python 2.8+), Torch              |
| 🐧 Linux           | Intel/AMD CPU | ✅          | `avx`, `amx`, `xmx` | Torch Fused (Python 2.8+), Torch              |
| 🍎 MacOS | GPU (Metal) / CPU          | ✅             |   `Apple Silicon`, `M1+` | Torch, MLX via conversion                     |
| 🪟 Windows | GPU (Nvidia) / CPU       | ✅             |   `Nvidia`  | Torch                                         |


## Install

### PIP/UV 

```bash
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas]
pip install -v gptqmodel --no-build-isolation 
uv pip install -v gptqmodel --no-build-isolation
```

### Install from source

```bash
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel

# python3-dev is required, ninja is to speed up compile
apt install python3-dev ninja

# pip: compile and install
# You can install optional modules like  vllm, sglang, bitblas.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas]
pip install -v . --no-build-isolation
```

### Inference
Three line api to use `GPT-QModel` for gptq model inference:

```py
from gptqmodel import GPTQModel

model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
```

To use models from [ModelScope](https://www.modelscope.cn/) instead of HuggingFace Hub, set an environment variable:
```shell
export GPTQMODEL_USE_MODELSCOPE=True
```
```py
from gptqmodel import GPTQModel
# load Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4 from modelscope
model = GPTQModel.load("Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
```

### OpenAI API compatible end-point
```py
# load model using above inference guide first
model.serve(host="0.0.0.0",port="12345")
```

### Quantization
Basic example of using `GPT-QModel` to quantize a llm model:

```py
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128)

model = GPTQModel.load(model_id, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)

model.save(quant_path)
```

### Quantization using GPTQ V2* (Experimental, not MoE compatible, and results may not be better than v1)

Enable GPTQ v2 quantization by setting `v2 = True`.
```py
# Note v2 is currently experimental, not MoE compatible, and requires 2-4x more vram to execute
# We have many reports of v2 not working better or exceeding v1 so please use for testing only
# If oom on 1 gpu, please set CUDA_VISIBLE_DEVICES=0,1 to 2 gpu and gptqmodel will auto use second gpu
quant_config = QuantizeConfig(bits=4, group_size=128, v2=True)
```
`Llama 3.1 8B-Instruct` quantized using `test/models/test_llama3_2.py`

| Method  | Bits/Group Size | ARC_CHALLENGE   | GSM8K_Platinum_COT | 
|---------|-----------------|-----------------|--------------------|
| GPTQ    | 4 / 128         | 49.15           | 48.30              |
| GPTQ v2 | 4 / 128         | 49.74  +1.20%   | 61.46  +27.25%     |
| GPTQ    | 3 / 128         | 39.93           | 43.26              |
| GPTQ v2 | 3 / 128         | 41.13  +3.01%   | 50.54  +16.83%     | 

# Quantization Inference
```py
# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
```

### Quantization + EoRA Accuracy Recovery 

GPT-QModel now support EoRA, a LoRA method that can further imporve the accuracy of the quantized model
```py
# higher rank improves accuracy at the cost of vram usage
# suggestion: test rank 64 and 32 before 128 or 256 as latter may overfit while increasing memory usage
eora = Lora(
  # for eora generation, path is adapter save path; for load, it is loading path
  path=f"{quant_path}/eora_rank32", 
  rank=32,
)

# provide a previously gptq quantized model path
GPTQModel.adapter.generate(
  adapter=eora,
  model_id_or_path=model_id,
  quantized_model_id_or_path=quant_path,
  calibration_dataset=calibration_dataset,
  calibration_dataset_concat_size=0,
)

# post-eora inference
model = GPTQModel.load(
  model_id_or_path=quant_path,
  adapter=eora
)

tokens = model.generate("Capital of France is")[0]
result = model.tokenizer.decode(tokens)

print(f"Result: {result}")
# For more detail of EoRA please see GPTQModel/examples/eora
# Please use the benchmark tools in later part of this README to evaluate EoRA effectiveness
```

For more advanced features of model quantization, please reference to [this script](https://github.com/ModelCloud/GPTQModel/blob/main/examples/quantization/basic_usage_wikitext2.py)

### How to Add Support for a New Model

Read the [`gptqmodel/models/llama.py`](https://github.com/ModelCloud/GPTQModel/blob/5627f5ffeb3f19b1a2a97e3b6de6fbe668b0dc42/gptqmodel/models/llama.py) code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.

### Evaluation and Quality Benchmarks

GPTQModel inference is integrated into both [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) and [evalplus](https://github.com/evalplus/evalplus)  
We highly recommend avoid using `ppl` and use `lm-eval`/`evalplus` to validate post-quantization model quality. `ppl` should only be used for regression tests and is not a good indicator of model output quality.  

```
# gptqmodel is integrated into lm-eval >= v0.4.7
pip install lm-eval>=0.4.7
```

```
# gptqmodel is integrated into evalplus[main]
pip install -U "evalplus @ git+https://github.com/evalplus/evalplus"
```

Below is a basic sample using `GPTQModel.eval` API

```py
from gptqmodel import GPTQModel
from gptqmodel.utils.eval import EVAL

model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"

# Use `lm-eval` as framework to evaluate the model
lm_eval_data = GPTQModel.eval(model_id, 
                    framework=EVAL.LM_EVAL, 
                    tasks=[EVAL.LM_EVAL.ARC_CHALLENGE])


# Use `evalplus` as framework to evaluate the model
evalplus_data = GPTQModel.eval(model_id, 
                    framework=EVAL.EVALPLUS, 
                    tasks=[EVAL.EVALPLUS.HUMAN])

```
### Dynamic Quantization (Per Module QuantizeConfig Override)

`QuantizeConfig.dynamic` is dynamic control which allows specific matching `modules` to be skipped for quantization (negative matching)
or have a unique `[bits, group_size, sym, desc_act, mse, pack_dtype]` property override per matching `module` vs base `QuantizeConfig` (postive match with override). 

Sample `QuantizerConfig.dynamic` usage:

```py
dynamic = { 
    # `.*\.` matches the layers_node prefix 
    # layer index start at 0 
    
    # positive match: layer 19, gate module 
    r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},  
    
    # positgive match: layer 20, gate module (prefix defaults to positive if missing)
    r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  
    
    # negative match: skip layer 21, gate module
    r"-:.*\.20\..*gate.*": {}, 
    
    # negative match: skip all down modules for all layers
    r"-:.*down.*": {},  
 } 

```

### Group Aware Reordering (GAR)

Group Aware Reordering (GAR) is an enhanced activation reordering scheme designed to significantly improve the accuracy of quantized models without incurring additional inference overhead. Unlike traditional activation reordering, GAR restricts permutations to within individual groups or rearrangements of entire groups. This ensures each group's associated scales and zero-points remain efficiently accessible during inference, thereby avoiding any inference-time overhead.

How to enable GAR:

Set the `act_group_aware` parameter to `True` and disable the default activation reordering by setting `desc_act` to `False` in your `QuantizeConfig`. For example:

```python
quant_config = QuantizeConfig(bits=4, group_size=128, act_group_aware=True)
```


### Experimental Features

* GPTQ v2: set `v2=True` in quantization config.


### Attribution of Quantization Methods:

* GPTQ (v1): IST-DASLab, main-author: Elias Frantar, arXiv:2210.17323
* GPTQ (v2*): Yale Intelligent Computing Lab, main-author: Yuhang Li, arXiv:2504.02692. v2 naming is by Yale author and not endorsed by original GPTQ authors.
* QQQ: Meituan, main-author Ying Zhang, arXiv:2406.09904
* EoRA: Nvidia, main-author: Shih-Yang Liu, arXiv preprint arXiv:2410.21271.
* GAR: Intel, main-author: T Gafni, A Karnieli, Y Hanani, [Paper](https://openaccess.thecvf.com/content/CVPR2025W/eLVM/html/Gafni_Dual_Precision_Quantization_for_Efficient_and_Accurate_Deep_Neural_Networks_CVPRW_2025_paper.html)
* AWQ: main-authors: Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song

## Citation

```bibtex
# GPT-QModel
@misc{qubitium2024gptqmodel,
  author = {ModelCloud.ai and qubitium@modelcloud.ai},
  title = {GPT-QModel},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
  note = {Contact: qubitium@modelcloud.ai},
  year = {2024},
}

# GPTQ
@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  journal={arXiv preprint arXiv:2210.17323},
  year={2022}
  
}

# EoRA
@article{liu2024eora,
  title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
  author={Liu, Shih-Yang and Yang, Huck and Wang, Chien-Yi and Fung, Nai Chit and Yin, Hongxu and Sakr, Charbel and Muralidharan, Saurav and Cheng, Kwang-Ting and Kautz, Jan and Wang, Yu-Chiang Frank and others},
  journal={arXiv preprint arXiv:2410.21271},
  year={2024}
}

# Group Aware Reordering (GAR)
@article{gar,
  title={Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference, CVPRW 2025.},
  author={T. Gafni, A. Karnieli, Y. Hanani},
  journal={arXiv preprint arXiv:2505.14638},
  year={2025}
}

# GPTQ Marlin Kernel
@article{frantar2024marlin,
  title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
  author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2408.11743},
  year={2024}
}

# QQQ 
@article{zhang2024qqq,
      title={QQQ: Quality Quattuor-Bit Quantization for Large Language Models}, 
      author={Ying Zhang and Peng Zhang and Mincong Huang and Jingyang Xiang and Yujie Wang and Chao Wang and Yineng Zhang and Lei Yu and Chuan Liu and Wei Lin},
      journal={arXiv preprint arXiv:2406.09904},
      year={2024}
}

# AWQ
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

# GPTQ v2
@article{li2025gptqv2,
  title={GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration}, 
  author={Yuhang Li and Ruokai Yin and Donghyun Lee and Shiting Xiao and Priyadarshini Panda},
  journal={arXiv preprint arXiv:2504.02692},
  year={2025}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "GPTQModel",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "gptq, awq, qqq, autogptq, autoawq, eora, gar, quantization, large-language-models, transformers, llm, moe, compression",
    "author": null,
    "author_email": "ModelCloud <qubitium@modelcloud.ai>",
    "download_url": "https://files.pythonhosted.org/packages/67/79/409687c7705314fb8fe46299c941628c250fc9f156eeefc18d34639eb1fa/gptqmodel-5.0.0.tar.gz",
    "platform": null,
    "description": "<p align=center>\n<div align=center>\n<img src=\"https://github.com/user-attachments/assets/ab70eb1e-06e7-4dc9-83e5-bd562e1a78b2\" width=500>\n</div>\n<h1 align=\"center\">GPT-QModel</h1>\n</p>\n<p align=\"center\">LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.</p>\n<p align=\"center\">\n    <a href=\"https://github.com/ModelCloud/GPTQModel/releases\" style=\"text-decoration:none;\"><img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/ModelCloud/GPTQModel.svg\"></a>\n    <a href=\"https://pypi.org/project/gptqmodel/\" style=\"text-decoration:none;\"><img alt=\"PyPI - Version\" src=\"https://img.shields.io/pypi/v/gptqmodel\"></a>\n    <a href=\"https://pepy.tech/projects/gptqmodel\" style=\"text-decoration:none;\"><img src=\"https://static.pepy.tech/badge/gptqmodel\" alt=\"PyPI Downloads\"></a>\n    <a href=\"https://github.com/ModelCloud/GPTQModel/blob/main/LICENSE\"><img src=\"https://img.shields.io/pypi/l/gptqmodel\"></a>\n    <a href=\"https://huggingface.co/modelcloud/\"><img src=\"https://img.shields.io/badge/\ud83e\udd17%20Hugging%20Face-ModelCloud-%23ff8811.svg\"></a>\n    <a href=\"https://huggingface.co/models?search=gptq\">\n        <img alt=\"Huggingface - Models\" src=\"https://img.shields.io/badge/\ud83e\udd17_5000+_models_available-8A2BE2\">\n    </a>\n</p>\n\n## Latest News\n* 10/24/2025 [5.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v5.0.0): \ud83c\udf89 Data-parallel quant support for `MoE` models on multi-gpu using `nogil` Python. `offload_to_disk` support enabled by \ndefault to massively reduce `cpu` ram usage. New `Intel` and `AMD` cpu hw accelerated `TorchFused` kernel. Packing stage is now 4x faster and now inlined with quantization. `Vram` pressure for large models reduced during quantization.\n`act_group_aware` is  16k+ times faster and now the default when `desc_act=False` for higher quality recovery without inference penalty of `desc_act=True`. New beta quality `AWQ` support with full `gemm`, \n`gemm_fast`, `marlin` kernel support. `LFM`, `Ling`, `Qwen3 Omni` model support. \n`Bitblas` kernel updated to support Bitblas `0.1.0.post1` reelase.\nQuantization is now faster with reduced vram usage. Enhanced logging support with `LogBar`.\n* 09/16/2025 [4.2.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v4.2.5): `hyb_act` renamed to `act_group_aware`. Removed finicky `torch` import within `setup.py`. Packing bug fix and prebuilt Pytorch 2.8 whls. \n* 09/12/2025 [4.2.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v4.2.0): \u2728 New Models Support: Qwen3-Next, Apertus, Kimi K2, Klear, FastLLM, Nemotron H. New `fail_safe` `boolean` toggle to `.quantize()` to patch-fix non-activated `MoE` modules due to highly uneven MoE model training. Fixed LavaQwen2 compat. Patch fix GIL=0 cuda error for multi-gpu. Fix compat with autoround + new transformers. \n* 09/04/2025 [4.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v4.1.0): \u2728 Meituan LongCat Flash Chat, Llama 4, GPT-OSS (BF16), and GLM-4.5-Air support.  New experiemental `mock_quantization` config to skip complex computational code paths during quantization to accelerate model quant testing. \n* 08/21/2025 [4.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v4.0.0): \ud83c\udf89 New Group Aware Reordering (GAR) support. New models support: Bytedance Seed-OSS, Baidu Ernie, Huawei PanGu, Gemma3, Xiaomi Mimo, Qwen 3/MoE, Falcon H1, GPT-Neo. Memory leak and multiple model compatibility fixes related to Transformers >= 4.54. Python >= 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage. Early access Pytorch 2.8 fused-ops on Intel XPU for up to 50% speedup.\n\n<details>\n\n<summary>Archived News</summary>\n* 10/17/2025 5.0.0-dev `main`: \ud83d\udc40: EoRA now multi-gpu compatible. Fixed both quality stability of multi-gpu quanta and vram usage. New LFM and Ling models support.\n* 09/30/2025 5.0.0-dev `main`: \ud83d\udc40: New Data Parallel + Multi-GPU + Python 3.13T (PYTHON_GIL=0) equals 80%+ overall quant time reduction of large MoE models vs v4.2.5. \n* 09/29/2025 5.0.0-dev `main`: \ud83c\udf89 New Qwen3 Omni model support. AWQ Marlin kernel integrated + many disk offload, threading, and memory usage fixes. \n* 09/24/2025 5.0.0-dev `main`: \ud83c\udf89 Up to 90% cpu mem saving for large MoE models with faster/inline packing! 26% quant time reduction for Qwen3 MoE! AWQ Marlin kernel added. AWQ Gemm loading bug fixes. `act_group_aware` now faster and auto enabled for GPTQ when `desc_act` is False for higher quality recovery. \n* 09/19/2025 5.0.0-dev `main`: \ud83d\udc40 Cpu memory saving of ~73.5% during quantization stage with new `offload_to_disk` quantization config property default to `True`. \n* 09/18/2025 5.0.0-dev `main`: \ud83c\udf89 AWQ quantization support! Complete refractor and simplification of model definitions in prepreation for future quantization formats.\n* 08/19/2025 4.0.0-dev `main`: Fix quantization memory usage due to some model's incorrect application of `config.use_cache` during inference. Fixed `Transformers` >= 4.54.0 compat which changed layer forward return signature for some models. \n* 08/18/2025 4.0.0-dev `main`: GPT-Neo model support. Memory leak fix in error capture (stacktrace) and fixed `lm_head` quantization compatibility for many models.\n* 07/31/2025 4.0.0-dev `main`: New Group Aware Reordering (GAR) support and prelim Pytorch 2.8 fused-ops for Intel XPU for up to 50% speedup. \n* 07/03/2025 4.0.0-dev `main`: New Baidu Ernie and Huawei PanGu model support.\n* 07/02/2025 4.0.0-dev `main`: Gemma3 4B model compat fix.\n* 05/29/2025 4.0.0-dev `main`: Falcon H1 model support. Fixed Transformers `4.52+` compat with Qwen 2.5 VL models.\n* 05/19/2025 4.0.0-dev `main`: Qwen 2.5 Omni model support. \n* 05/05/2025 4.0.0-dev `main`: Python 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage. \n* 04/29/2025 3.1.0-dev (Now 4.) `main`: Xiaomi Mimo model support. Qwen 3 and 3 MoE model support. New arg for `quantize(..., calibration_dataset_min_length=10)` to filter out bad calibration data that exists in public dataset (wikitext). \n* 04/13/2025 [3.0.0](https://github.com/ModelCloud/Model/releases/tag/v3.0.0): \ud83c\udf89 New experimental ` v2` quantization option for improved model quantization accuracy validated by `GSM8K_PLATINUM` [benchmarks](https://github.com/ModelCloud/Model#quantization-using-gptq-v2) vs original `gptq`. New `Phi4-MultiModal` model support . New Nvidia Nemotron-Ultra model support. New `Dream` model support. New experimental `multi-gpu` quantization support. Reduced vram usage. Faster quantization.\n* 04/2/2025 [2.2.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v2.2.0): New `Qwen 2.5 VL` model support. New `samples` log column during quantization to track module activation in MoE models. `Loss` log column now color-coded to highlight modules that are friendly/resistant to quantization. Progress (per-step) stats during quantization now streamed to log file. Auto `bfloat16`  dtype loading for models based on model config. Fix kernel compile for Pytorch/ROCm. Slightly faster quantization and auto-resolve some low-level oom issues for smaller vram gpus. \n* 03/12/2025 [2.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v2.1.0): \u2728 New `QQQ` quantization method and inference support!\nNew Google `Gemma 3` zero-day model support.\nNew Alibaba `Ovis 2` VL model support. \nNew AMD `Instella` zero-day model model support. New `GSM8K Platinum` and `MMLU-Pro` benchmarking suppport.\nPeft Lora training with GPT-QModel is now 30%+ faster on all gpu and IPEX devices.\nAuto detect MoE modules not activated during quantization due to insufficient calibration data. \n`ROCm` `setup.py` compat fixes. `Optimum` and `Peft` compat fixes.\nFixed `Peft` `bfloat16` training. \n* 03/03/2025 [2.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v2.0.0): \ud83c\udf89 `GPTQ` quantization internals are now broken into multiple stages (processes) for feature expansion. \nSynced `Marlin` kernel inference quality fix from upstream. Added `MARLIN_FP16`, lower-quality but faster backend. \n`ModelScope` support added. Logging and cli progress bar output has been revamped with sticky bottom progress.\nFixed `generation_config.json` save and load. Fixed Transformers v4.49.0 compat. Fixed compat of models without `bos`. Fixed `group_size=-1` and `bits=3` packing regression. \nFixed Qwen 2.5 MoE regressions. \nAdded CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate loggin/progressbar to [LogBar](https://github.com/modelcloud/logbar) pkg.\nFix ROCm version auto detection in `setup` install.\n* 02/12/2025 [1.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.9.0): \u26a1 Offload `tokenizer` fixes to [Toke(n)icer](https://github.com/modelcloud/tokenicer) pkg. Optimized `lm_head` quant time and vram usage.\n  Optimized `DeepSeek v3/R1` model quant vram usage. Fixed `Optimum` compat regresion in `v1.8.1`. 3x speed-up for `Torch` kernel when using Pytorch >= 2.5.0 with `model.optimize()`. New `calibration_dataset_concat_size` option to enable calibration data `concat` mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like `wikitext2`. \n* 02/08/2025 [1.8.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.8.1): \u26a1 `DeepSeek v3/R1` model support. New flexible weight `packing`: allow quantized weights to be packed to `[int32, int16, int8]` dtypes. \n`Triton` and `Torch` kernels supports full range of new `QuantizeConfig.pack_dtype`. \nNew `auto_gc: bool` control in `quantize()` which can reduce quantization time for small model with no chance of oom. \nNew `GPTQModel.push_to_hub()` api for easy quant model upload to HF repo. New `buffered_fwd: bool` control in `model.quantize()`. Over 50% quantization speed-up for visual (vl) models.  \nFixed `bits=3` packing and `group_size=-1` regression in v1.7.4.\n* 01/26/2025 [1.7.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.4): New `compile()` api for ~4-8% inference tps improvement. Faster `pack()` for post-quantiztion model save. `Triton` kernel validated for Intel/`XPU` when Intel Triton packages are installed. Fixed Transformers (bug) downcasting tokenizer class on save. \n* 01/20/2025 [1.7.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.3): New Telechat2 (China Telecom) and PhiMoE model support. Fixed `lm_head` weights duplicated in post-quantize save() for models with tied-embedding. \n* 01/19/2025 [1.7.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.2): Effective BPW (bits per weight) will now be logged during `load()`. Reduce loading time on Intel Arc A770/B580 `XPU` by 3.3x. Reduce memory usage in MLX conversion and fix Marlin kernel auto-select not checking CUDA compute version. \n* 01/17/2025 [1.7.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.0): \ud83d\udc40 \u2728 `backend.MLX` added for runtime-conversion and execution of GPTQ models on Apple's `MLX` framework on Apple Silicon (M1+). Exports of `gptq` models to `mlx` also now possible. We have added `mlx` exported models to [huggingface.co/ModelCloud](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2). \u2728 `lm_head` quantization now fully support by GPTQModel without external pkg dependency. \n* 01/07/2025 [1.6.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.6.1): \ud83c\udf89 New OpenAI api compatible end-point via `model.serve(host, port)`. Auto-enable flash-attention2 for inference.  Fixed `sym=False` loading regression. \n* 01/06/2025 [1.6.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.6.0): \u26a125% faster quantization. 35% reduction in vram usage vs v1.5. \ud83d\udc40 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU. Auto-tokenizer loader via `load()` api. For most models you no longer need to manually init a tokenizer for both inference and quantization.\n* 01/01/2025 [1.5.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.5.1): \ud83c\udf89 2025! Added `QuantizeConfig.device` to clearly define which device is used for quantization: default = `auto`. Non-quantized models are always loaded on cpu by-default and each layer is moved to `QuantizeConfig.device` during quantization to minimize vram usage. Compatibility fixes for `attn_implementation_autoset` in latest transformers. \n\n* 12/23/2024 [1.5.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.5.0): Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.\n* 12/19/2024 [1.4.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.5): Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed `dynamic` loading. Reduced quantization vram usage.\n* 12/15/2024 [1.4.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.2): MacOS `gpu` (Metal) and `cpu` (M+) support added/validated for inference and quantization. Cohere 2 model support added. \n* 12/13/2024 [1.4.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.1): Added Qwen2-VL model support. `mse` quantization control exposed in `QuantizeConfig`. Monkey patch `patch_vllm()` and `patch_hf()` api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status. \n* 12/10/2024 [1.4.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.0) `EvalPlus` harness integration merged upstream. We now support both `lm-eval` and `EvalPlus`. Added pure torch `Torch` kernel. Refactored `Cuda` kernel to be `DynamicCuda` kernel. `Triton` kernel now auto-padded for max model support. `Dynamic` quantization now supports both positive `+:`:default, and `-:` negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-`Marlin` kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving of `Marlin` weight format since `Marlin` supports auto conversion of `gptq` format to `Marlin` during runtime. \n\n* 11/29/2024 [1.3.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.1) Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg. \n\n* 11/26/2024 [1.3.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.0) Zero-Day Hymba model support. Removed `tqdm` and `rogue` dependency. \n* 11/24/2024 [1.2.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.3) HF GLM model support. ClearML logging integration. Use `device-smi` and replace `gputil` + `psutil` depends. Fixed model unit tests. \n\n* 11/11/2024 \ud83d\ude80 [1.2.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.1) Meta MobileLLM model support added. `lm-eval[gptqmodel]` integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New `.load()` and `.save()` api. \n\n* 10/29/2024 \ud83d\ude80 [1.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.1.0) IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage. \n\n* 10/12/2024 \u2728 [1.0.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.9) Move AutoRound to optional and fix pip install regression in v1.0.8.\n\n* 10/11/2024 \u2728 [1.0.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.8) Add wheel for python 3.12 and cuda 11.8.\n* 10/08/2024 \u2728 [1.0.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.7) Fixed marlin (faster) kernel was not auto-selected for some models.\n\n* 09/26/2024 \u2728 [1.0.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.6) Fixed quantized Llama 3.2 vision quantized loader.\n* 09/26/2024 \u2728 [1.0.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.5) Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.\n\n* 09/26/2024 \u2728 [1.0.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.4) Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing. \n* 09/18/2024 \u2728 [1.0.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.3) Added Microsoft GRIN-MoE and MiniCPM3 support.\n* 08/16/2024 \u2728 [1.0.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.2) Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release. \n* 08/14/2024 \u2728 [1.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.0.0) 40% faster `packing`, Fixed Python 3.9 compat, added `lm_eval` api. \n* 08/10/2024 \ud83d\ude80 [0.9.11](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.11) Added LG EXAONE 3.0 model support. New `dynamic` per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to `backend.BITBLAS`. Auto-heal quantization errors due to small damp values. \n* 07/31/2024 \ud83d\ude80 [0.9.10](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.10) Ported vllm/nm `gptq_marlin` inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with `FORMAT.GPTQ`. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference.\n* 07/25/2024 \ud83d\ude80 [0.9.9](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.9): Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.  \n* 07/13/2024 \ud83d\ude80 [0.9.8](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.8):\nRun quantized models directly using GPTQModel using fast `vLLM` or `SGLang` backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximum `TPS` (check usage under examples). Marlin backend also\ngot full end-to-end in/out features padding to enhance current/future model compatibility.\n* 07/08/2024 \ud83d\ude80 [0.9.7](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.7): InternLM 2.5 model support added.\n* 07/08/2024 \ud83d\ude80 [0.9.6](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.6): [Intel/AutoRound](https://github.com/intel/auto-round) QUANT_METHOD support added for a potentially higher quality quantization with `lm_head` module quantization support for even more vram reduction: format export to `FORMAT.GPTQ` for max inference compatibility.\n* 07/05/2024 \ud83d\ude80 [0.9.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.5): Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.\n* 07/03/2024 \ud83d\ude80 [0.9.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.4): HF Transformers integration added and bug fixed Gemma 2 support.\n* 07/02/2024 \ud83d\ude80 [0.9.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.3): Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.\n* 06/30/2024 \ud83d\ude80 [0.9.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.2): Added auto-padding of model in/out-features for exllama and exllama v2. \nFixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.\n* 06/29/2024 \ud83d\ude80 [0.9.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.1): With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.\n* 06/20/2924 \u2728 [0.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v0.9.0): Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!\n</details>\n\n## What is GPT-QModel?\nGPT-QModel is a production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF Transformers, vLLM, and SGLang.\n\nPublic and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment. \n\nGPT-QModel not only supports GPTQ but also QQQ, GPTQv2, Eora with more quantization methods and enhancements planned. \n\n## Quantization Support\n\nGPT-QModel is a modular design supporting multiple quantization methods and feature extensions.\n\n| Quantization Feature       | GPT-QModel | Transformers | vLLM  | SGLang | Lora Training |\n|----------------------------|------------|---|---|---|---------------|\n| GPTQ                       | \u2705          | \u2705 | \u2705 | \u2705 | \u2705             | \n| EoRA                       | \u2705          | \u2705 | \u2705 | \u2705 | x             | \n| Group Aware Act Reordering | \u2705          | \u2705 | \u2705 | \u2705 | \u2705             |\n| AWQ                        | \u2705          | \u2705* | \u2705* | \u2705* | \u2705*             | \n| QQQ                        | \u2705          | x | x | x | x             | \n| Rotation                   | \u2705          | x | x | x | x             |  \n| GPTQ v2*                   | \u2705          | \u2705 | \u2705 | \u2705 | \u2705             | \n\n## Multi-Modal\n\nNative support support some of the most popular multi-modal models:\n\n| Multi-Modal              |   | \n|-------------------|---|\n| Qwen 2.5 Omni     | \u2705 | \n| Qwen2 VL          | \u2705 | \n| Ovis 1.6 + 2      | \u2705 | \n| Phi-4 MultiModal  | \u2705 | \n\n\n\n## Features\n* \u2728 Native integration with HF [Transformers](https://github.com/huggingface/transformers), [Optimum](https://github.com/huggingface/optimum), and [Peft (main)](https://github.com/huggingface/peft)\n* \ud83d\ude80 [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang) inference integration for quantized model with format = `FORMAT.GPTQ`\n* \u2728 GPTQ, AWQ, and QQQ quantization format with hw accelerated inference kernels. \n* \ud83d\ude80 Data Parallism for 80%+ quantization speed reduction with Multi-GPU.\n* \ud83d\ude80 Optimized for Python >= 3.13t (free threading) with lock-free threading.\n* \u2728 Linux, MacOS, Windows platform quantization and accelerated inference support for CUDA (Nvidia), XPU (Intel), ROCm (AMD), MPS (Apple Silicon), CPU (Intel/AMD/Apple Silicon).\n* \u2728 `Dynamic` mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together. \n* \ud83d\ude80 Intel Torch 2.8 fused kernel support for XPU [`Arc` + `Datacenter Max`] and CPU [`avx`, `amx`, `xmx`].\n* \ud83d\ude80 Python 3.13.3t (free-threading, GIL disabled) support for multi-gpu accelerated quantization for MoE models and multi-core cpu boost for post-quant packing.\n* \u2728 Asymmetric `Sym=False` support. Model weights sharding support with optional hash check of model weights on load.\n* \u2728 `lm_head` module quant inference support for further VRAM reduction.\n* \ud83d\ude80 [Microsoft/BITBLAS](https://github.com/microsoft/BitBLAS) format + dynamically compiled inference.\n* \ud83d\udcaf 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.\n\n\n## Quality: GPTQ 4bit (5.0 bpw) can match BF16:\n\ud83e\udd17 [ModelCloud quantized Vortex models on HF](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2)\n\n<img src=https://github.com/user-attachments/assets/c1b89394-f8f6-44e5-9949-bef15a124723 width=\"51%\"> <img src=https://github.com/user-attachments/assets/23901236-10c5-4435-ac2f-06cf2e097f1e width=\"47%\">\n\n## Model Support  \n| Model             |   |                   |   |                |   |                |   |                     |   |\n|-------------------|---|-------------------|---|----------------|---|----------------|---|---------------------|---|\n| Apertus           | \u2705 | EXAONE 3.0        | \u2705 | InternLM 1/2.5 | \u2705 | Mixtral        | \u2705 | Qwen 2/3 (Next/MoE) | \u2705 |\n| Baichuan          | \u2705 | Falcon (H1)       | \u2705 | Kimi K2        | \u2705 | MobileLLM      | \u2705 | Qwen 2/2.5 VL       | \u2705 |\n| Bloom             | \u2705 | FastVLM           | \u2705 | Klear          | \u2705 | MOSS           | \u2705 | Qwen 2.5/3 Omni     | \u2705 |\n| ChatGLM           | \u2705 | Gemma 1/2/3       | \u2705 | LING/RING      | \u2705 | MPT            | \u2705 | RefinedWeb          | \u2705 |\n| CodeGen           | \u2705 | GPTBigCod         | \u2705 | Llama 1-3.3    | \u2705 | Nemotron H     | \u2705 | StableLM            | \u2705 |\n| Cohere 1-2        | \u2705 | GPTQ-Neo/GPT-NeoX | \u2705 | Llama 3.2 VL   | \u2705 | Nemotron Ultra | \u2705 | StarCoder2          | \u2705 |\n| DBRX Converted    | \u2705 | GPT-2             | \u2705 | Llama 4        | \u2705 | OPT            | \u2705 | TeleChat2           | \u2705 |\n| Deci              | \u2705 | GPT-J             | \u2705 | LongCatFlash   | \u2705 | OLMo2          | \u2705 | Yi                  | \u2705 |\n| DeepSeek-V2/V3/R1 | \u2705 | GPT-OSS           | \u2705 | LongLLaMA      | \u2705 | Ovis 1.6/2     | \u2705 | Seed-OSS            | \u2705 |\n| DeepSeek-V2-Lite  | \u2705 | Granite           | \u2705 | Instella       | \u2705 | Phi 1-4        | \u2705 | XVERSE              | \u2705 |\n| Dream             | \u2705 | GRIN-MoE          | \u2705 | MiniCPM3       | \u2705 | PanGu-\u03b1        | \u2705 |                     |   |\n| ERNIE 4.5         | \u2705 | Hymba             | \u2705 | Mistral        | \u2705 | Qwen 1/2/3     | \u2705 |                     |   |\n\n\n## Platform and HW Support \n\nGPT-QModel is validated for Linux, MacOS, and Windows 11:\n\n| Platform        | Device        |     |  Optimized Arch              | Kernels                                       |\n|-----------------|---------------| --- | -------------- |-----------------------------------------------| \n| \ud83d\udc27 Linux           | Nvidia GPU    | \u2705       | `Ampere+` | Marlin, Exllama V2, Exallma V1, Triton, Torch |\n| \ud83d\udc27 Linux | AMD GPU     | \u2705             |   `7900XT+`,  `ROCm 6.2+` | Exllama V2, Exallma V1, Torch                 |\n| \ud83d\udc27 Linux | Intel XPU     | \u2705             |   `Arc`, `Datacenter Max` | Torch Fused (Python 2.8+), Torch              |\n| \ud83d\udc27 Linux           | Intel/AMD CPU | \u2705          | `avx`, `amx`, `xmx` | Torch Fused (Python 2.8+), Torch              |\n| \ud83c\udf4e MacOS | GPU (Metal) / CPU          | \u2705             |   `Apple Silicon`, `M1+` | Torch, MLX via conversion                     |\n| \ud83e\ude9f Windows | GPU (Nvidia) / CPU       | \u2705             |   `Nvidia`  | Torch                                         |\n\n\n## Install\n\n### PIP/UV \n\n```bash\n# You can install optional modules like autoround, ipex, vllm, sglang, bitblas.\n# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas]\npip install -v gptqmodel --no-build-isolation \nuv pip install -v gptqmodel --no-build-isolation\n```\n\n### Install from source\n\n```bash\n# clone repo\ngit clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel\n\n# python3-dev is required, ninja is to speed up compile\napt install python3-dev ninja\n\n# pip: compile and install\n# You can install optional modules like  vllm, sglang, bitblas.\n# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas]\npip install -v . --no-build-isolation\n```\n\n### Inference\nThree line api to use `GPT-QModel` for gptq model inference:\n\n```py\nfrom gptqmodel import GPTQModel\n\nmodel = GPTQModel.load(\"ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5\")\nresult = model.generate(\"Uncovering deep insights begins with\")[0] # tokens\nprint(model.tokenizer.decode(result)) # string output\n```\n\nTo use models from [ModelScope](https://www.modelscope.cn/) instead of HuggingFace Hub, set an environment variable:\n```shell\nexport GPTQMODEL_USE_MODELSCOPE=True\n```\n```py\nfrom gptqmodel import GPTQModel\n# load Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4 from modelscope\nmodel = GPTQModel.load(\"Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4\")\nresult = model.generate(\"Uncovering deep insights begins with\")[0] # tokens\nprint(model.tokenizer.decode(result)) # string output\n```\n\n### OpenAI API compatible end-point\n```py\n# load model using above inference guide first\nmodel.serve(host=\"0.0.0.0\",port=\"12345\")\n```\n\n### Quantization\nBasic example of using `GPT-QModel` to quantize a llm model:\n\n```py\nfrom datasets import load_dataset\nfrom gptqmodel import GPTQModel, QuantizeConfig\n\nmodel_id = \"meta-llama/Llama-3.2-1B-Instruct\"\nquant_path = \"Llama-3.2-1B-Instruct-gptqmodel-4bit\"\n\ncalibration_dataset = load_dataset(\n    \"allenai/c4\",\n    data_files=\"en/c4-train.00001-of-01024.json.gz\",\n    split=\"train\"\n  ).select(range(1024))[\"text\"]\n\nquant_config = QuantizeConfig(bits=4, group_size=128)\n\nmodel = GPTQModel.load(model_id, quant_config)\n\n# increase `batch_size` to match gpu/vram specs to speed up quantization\nmodel.quantize(calibration_dataset, batch_size=1)\n\nmodel.save(quant_path)\n```\n\n### Quantization using GPTQ V2* (Experimental, not MoE compatible, and results may not be better than v1)\n\nEnable GPTQ v2 quantization by setting `v2 = True`.\n```py\n# Note v2 is currently experimental, not MoE compatible, and requires 2-4x more vram to execute\n# We have many reports of v2 not working better or exceeding v1 so please use for testing only\n# If oom on 1 gpu, please set CUDA_VISIBLE_DEVICES=0,1 to 2 gpu and gptqmodel will auto use second gpu\nquant_config = QuantizeConfig(bits=4, group_size=128, v2=True)\n```\n`Llama 3.1 8B-Instruct` quantized using `test/models/test_llama3_2.py`\n\n| Method  | Bits/Group Size | ARC_CHALLENGE   | GSM8K_Platinum_COT | \n|---------|-----------------|-----------------|--------------------|\n| GPTQ    | 4 / 128         | 49.15           | 48.30              |\n| GPTQ v2 | 4 / 128         | 49.74  +1.20%   | 61.46  +27.25%     |\n| GPTQ    | 3 / 128         | 39.93           | 43.26              |\n| GPTQ v2 | 3 / 128         | 41.13  +3.01%   | 50.54  +16.83%     | \n\n# Quantization Inference\n```py\n# test post-quant inference\nmodel = GPTQModel.load(quant_path)\nresult = model.generate(\"Uncovering deep insights begins with\")[0] # tokens\nprint(model.tokenizer.decode(result)) # string output\n```\n\n### Quantization + EoRA Accuracy Recovery \n\nGPT-QModel now support EoRA, a LoRA method that can further imporve the accuracy of the quantized model\n```py\n# higher rank improves accuracy at the cost of vram usage\n# suggestion: test rank 64 and 32 before 128 or 256 as latter may overfit while increasing memory usage\neora = Lora(\n  # for eora generation, path is adapter save path; for load, it is loading path\n  path=f\"{quant_path}/eora_rank32\", \n  rank=32,\n)\n\n# provide a previously gptq quantized model path\nGPTQModel.adapter.generate(\n  adapter=eora,\n  model_id_or_path=model_id,\n  quantized_model_id_or_path=quant_path,\n  calibration_dataset=calibration_dataset,\n  calibration_dataset_concat_size=0,\n)\n\n# post-eora inference\nmodel = GPTQModel.load(\n  model_id_or_path=quant_path,\n  adapter=eora\n)\n\ntokens = model.generate(\"Capital of France is\")[0]\nresult = model.tokenizer.decode(tokens)\n\nprint(f\"Result: {result}\")\n# For more detail of EoRA please see GPTQModel/examples/eora\n# Please use the benchmark tools in later part of this README to evaluate EoRA effectiveness\n```\n\nFor more advanced features of model quantization, please reference to [this script](https://github.com/ModelCloud/GPTQModel/blob/main/examples/quantization/basic_usage_wikitext2.py)\n\n### How to Add Support for a New Model\n\nRead the [`gptqmodel/models/llama.py`](https://github.com/ModelCloud/GPTQModel/blob/5627f5ffeb3f19b1a2a97e3b6de6fbe668b0dc42/gptqmodel/models/llama.py) code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.\n\n### Evaluation and Quality Benchmarks\n\nGPTQModel inference is integrated into both [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) and [evalplus](https://github.com/evalplus/evalplus)  \nWe highly recommend avoid using `ppl` and use `lm-eval`/`evalplus` to validate post-quantization model quality. `ppl` should only be used for regression tests and is not a good indicator of model output quality.  \n\n```\n# gptqmodel is integrated into lm-eval >= v0.4.7\npip install lm-eval>=0.4.7\n```\n\n```\n# gptqmodel is integrated into evalplus[main]\npip install -U \"evalplus @ git+https://github.com/evalplus/evalplus\"\n```\n\nBelow is a basic sample using `GPTQModel.eval` API\n\n```py\nfrom gptqmodel import GPTQModel\nfrom gptqmodel.utils.eval import EVAL\n\nmodel_id = \"ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1\"\n\n# Use `lm-eval` as framework to evaluate the model\nlm_eval_data = GPTQModel.eval(model_id, \n                    framework=EVAL.LM_EVAL, \n                    tasks=[EVAL.LM_EVAL.ARC_CHALLENGE])\n\n\n# Use `evalplus` as framework to evaluate the model\nevalplus_data = GPTQModel.eval(model_id, \n                    framework=EVAL.EVALPLUS, \n                    tasks=[EVAL.EVALPLUS.HUMAN])\n\n```\n### Dynamic Quantization (Per Module QuantizeConfig Override)\n\n`QuantizeConfig.dynamic` is dynamic control which allows specific matching `modules` to be skipped for quantization (negative matching)\nor have a unique `[bits, group_size, sym, desc_act, mse, pack_dtype]` property override per matching `module` vs base `QuantizeConfig` (postive match with override). \n\nSample `QuantizerConfig.dynamic` usage:\n\n```py\ndynamic = { \n    # `.*\\.` matches the layers_node prefix \n    # layer index start at 0 \n    \n    # positive match: layer 19, gate module \n    r\"+:.*\\.18\\..*gate.*\": {\"bits\": 4, \"group_size\": 32},  \n    \n    # positgive match: layer 20, gate module (prefix defaults to positive if missing)\n    r\".*\\.19\\..*gate.*\": {\"bits\": 8, \"group_size\": 64},  \n    \n    # negative match: skip layer 21, gate module\n    r\"-:.*\\.20\\..*gate.*\": {}, \n    \n    # negative match: skip all down modules for all layers\n    r\"-:.*down.*\": {},  \n } \n\n```\n\n### Group Aware Reordering (GAR)\n\nGroup Aware Reordering (GAR) is an enhanced activation reordering scheme designed to significantly improve the accuracy of quantized models without incurring additional inference overhead. Unlike traditional activation reordering, GAR restricts permutations to within individual groups or rearrangements of entire groups. This ensures each group's associated scales and zero-points remain efficiently accessible during inference, thereby avoiding any inference-time overhead.\n\nHow to enable GAR:\n\nSet the `act_group_aware` parameter to `True` and disable the default activation reordering by setting `desc_act` to `False` in your `QuantizeConfig`. For example:\n\n```python\nquant_config = QuantizeConfig(bits=4, group_size=128, act_group_aware=True)\n```\n\n\n### Experimental Features\n\n* GPTQ v2: set `v2=True` in quantization config.\n\n\n### Attribution of Quantization Methods:\n\n* GPTQ (v1): IST-DASLab, main-author: Elias Frantar, arXiv:2210.17323\n* GPTQ (v2*): Yale Intelligent Computing Lab, main-author: Yuhang Li, arXiv:2504.02692. v2 naming is by Yale author and not endorsed by original GPTQ authors.\n* QQQ: Meituan, main-author Ying Zhang, arXiv:2406.09904\n* EoRA: Nvidia, main-author: Shih-Yang Liu, arXiv preprint arXiv:2410.21271.\n* GAR: Intel, main-author: T Gafni, A Karnieli, Y Hanani, [Paper](https://openaccess.thecvf.com/content/CVPR2025W/eLVM/html/Gafni_Dual_Precision_Quantization_for_Efficient_and_Accurate_Deep_Neural_Networks_CVPRW_2025_paper.html)\n* AWQ: main-authors: Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song\n\n## Citation\n\n```bibtex\n# GPT-QModel\n@misc{qubitium2024gptqmodel,\n  author = {ModelCloud.ai and qubitium@modelcloud.ai},\n  title = {GPT-QModel},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/modelcloud/gptqmodel}},\n  note = {Contact: qubitium@modelcloud.ai},\n  year = {2024},\n}\n\n# GPTQ\n@article{frantar-gptq,\n  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, \n  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},\n  journal={arXiv preprint arXiv:2210.17323},\n  year={2022}\n  \n}\n\n# EoRA\n@article{liu2024eora,\n  title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},\n  author={Liu, Shih-Yang and Yang, Huck and Wang, Chien-Yi and Fung, Nai Chit and Yin, Hongxu and Sakr, Charbel and Muralidharan, Saurav and Cheng, Kwang-Ting and Kautz, Jan and Wang, Yu-Chiang Frank and others},\n  journal={arXiv preprint arXiv:2410.21271},\n  year={2024}\n}\n\n# Group Aware Reordering (GAR)\n@article{gar,\n  title={Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference, CVPRW 2025.},\n  author={T. Gafni, A. Karnieli, Y. Hanani},\n  journal={arXiv preprint arXiv:2505.14638},\n  year={2025}\n}\n\n# GPTQ Marlin Kernel\n@article{frantar2024marlin,\n  title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},\n  author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},\n  journal={arXiv preprint arXiv:2408.11743},\n  year={2024}\n}\n\n# QQQ \n@article{zhang2024qqq,\n      title={QQQ: Quality Quattuor-Bit Quantization for Large Language Models}, \n      author={Ying Zhang and Peng Zhang and Mincong Huang and Jingyang Xiang and Yujie Wang and Chao Wang and Yineng Zhang and Lei Yu and Chuan Liu and Wei Lin},\n      journal={arXiv preprint arXiv:2406.09904},\n      year={2024}\n}\n\n# AWQ\n@article{lin2023awq,\n  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},\n  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},\n  journal={arXiv},\n  year={2023}\n}\n\n# GPTQ v2\n@article{li2025gptqv2,\n  title={GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration}, \n  author={Yuhang Li and Ruokai Yin and Donghyun Lee and Shiting Xiao and Priyadarshini Panda},\n  journal={arXiv preprint arXiv:2504.02692},\n  year={2025}\n}\n```\n",
    "bugtrack_url": null,
    "license": "Apache License\n                                   Version 2.0, January 2004\n                                http://www.apache.org/licenses/\n        \n           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n        \n           1. Definitions.\n        \n              \"License\" shall mean the terms and conditions for use, reproduction,\n              and distribution as defined by Sections 1 through 9 of this document.\n        \n              \"Licensor\" shall mean the copyright owner or entity authorized by\n              the copyright owner that is granting the License.\n        \n              \"Legal Entity\" shall mean the union of the acting entity and all\n              other entities that control, are controlled by, or are under common\n              control with that entity. For the purposes of this definition,\n              \"control\" means (i) the power, direct or indirect, to cause the\n              direction or management of such entity, whether by contract or\n              otherwise, or (ii) ownership of fifty percent (50%) or more of the\n              outstanding shares, or (iii) beneficial ownership of such entity.\n        \n              \"You\" (or \"Your\") shall mean an individual or Legal Entity\n              exercising permissions granted by this License.\n        \n              \"Source\" form shall mean the preferred form for making modifications,\n              including but not limited to software source code, documentation\n              source, and configuration files.\n        \n              \"Object\" form shall mean any form resulting from mechanical\n              transformation or translation of a Source form, including but\n              not limited to compiled object code, generated documentation,\n              and conversions to other media types.\n        \n              \"Work\" shall mean the work of authorship, whether in Source or\n              Object form, made available under the License, as indicated by a\n              copyright notice that is included in or attached to the work\n              (an example is provided in the Appendix below).\n        \n              \"Derivative Works\" shall mean any work, whether in Source or Object\n              form, that is based on (or derived from) the Work and for which the\n              editorial revisions, annotations, elaborations, or other modifications\n              represent, as a whole, an original work of authorship. For the purposes\n              of this License, Derivative Works shall not include works that remain\n              separable from, or merely link (or bind by name) to the interfaces of,\n              the Work and Derivative Works thereof.\n        \n              \"Contribution\" shall mean any work of authorship, including\n              the original version of the Work and any modifications or additions\n              to that Work or Derivative Works thereof, that is intentionally\n              submitted to Licensor for inclusion in the Work by the copyright owner\n              or by an individual or Legal Entity authorized to submit on behalf of\n              the copyright owner. For the purposes of this definition, \"submitted\"\n              means any form of electronic, verbal, or written communication sent\n              to the Licensor or its representatives, including but not limited to\n              communication on electronic mailing lists, source code control systems,\n              and issue tracking systems that are managed by, or on behalf of, the\n              Licensor for the purpose of discussing and improving the Work, but\n              excluding communication that is conspicuously marked or otherwise\n              designated in writing by the copyright owner as \"Not a Contribution.\"\n        \n              \"Contributor\" shall mean Licensor and any individual or Legal Entity\n              on behalf of whom a Contribution has been received by Licensor and\n              subsequently incorporated within the Work.\n        \n           2. Grant of Copyright License. Subject to the terms and conditions of\n              this License, each Contributor hereby grants to You a perpetual,\n              worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n              copyright license to reproduce, prepare Derivative Works of,\n              publicly display, publicly perform, sublicense, and distribute the\n              Work and such Derivative Works in Source or Object form.\n        \n           3. Grant of Patent License. Subject to the terms and conditions of\n              this License, each Contributor hereby grants to You a perpetual,\n              worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n              (except as stated in this section) patent license to make, have made,\n              use, offer to sell, sell, import, and otherwise transfer the Work,\n              where such license applies only to those patent claims licensable\n              by such Contributor that are necessarily infringed by their\n              Contribution(s) alone or by combination of their Contribution(s)\n              with the Work to which such Contribution(s) was submitted. If You\n              institute patent litigation against any entity (including a\n              cross-claim or counterclaim in a lawsuit) alleging that the Work\n              or a Contribution incorporated within the Work constitutes direct\n              or contributory patent infringement, then any patent licenses\n              granted to You under this License for that Work shall terminate\n              as of the date such litigation is filed.\n        \n           4. Redistribution. You may reproduce and distribute copies of the\n              Work or Derivative Works thereof in any medium, with or without\n              modifications, and in Source or Object form, provided that You\n              meet the following conditions:\n        \n              (a) You must give any other recipients of the Work or\n                  Derivative Works a copy of this License; and\n        \n              (b) You must cause any modified files to carry prominent notices\n                  stating that You changed the files; and\n        \n              (c) You must retain, in the Source form of any Derivative Works\n                  that You distribute, all copyright, patent, trademark, and\n                  attribution notices from the Source form of the Work,\n                  excluding those notices that do not pertain to any part of\n                  the Derivative Works; and\n        \n              (d) If the Work includes a \"NOTICE\" text file as part of its\n                  distribution, then any Derivative Works that You distribute must\n                  include a readable copy of the attribution notices contained\n                  within such NOTICE file, excluding those notices that do not\n                  pertain to any part of the Derivative Works, in at least one\n                  of the following places: within a NOTICE text file distributed\n                  as part of the Derivative Works; within the Source form or\n                  documentation, if provided along with the Derivative Works; or,\n                  within a display generated by the Derivative Works, if and\n                  wherever such third-party notices normally appear. The contents\n                  of the NOTICE file are for informational purposes only and\n                  do not modify the License. You may add Your own attribution\n                  notices within Derivative Works that You distribute, alongside\n                  or as an addendum to the NOTICE text from the Work, provided\n                  that such additional attribution notices cannot be construed\n                  as modifying the License.\n        \n              You may add Your own copyright statement to Your modifications and\n              may provide additional or different license terms and conditions\n              for use, reproduction, or distribution of Your modifications, or\n              for any such Derivative Works as a whole, provided Your use,\n              reproduction, and distribution of the Work otherwise complies with\n              the conditions stated in this License.\n        \n           5. Submission of Contributions. Unless You explicitly state otherwise,\n              any Contribution intentionally submitted for inclusion in the Work\n              by You to the Licensor shall be under the terms and conditions of\n              this License, without any additional terms or conditions.\n              Notwithstanding the above, nothing herein shall supersede or modify\n              the terms of any separate license agreement you may have executed\n              with Licensor regarding such Contributions.\n        \n           6. Trademarks. This License does not grant permission to use the trade\n              names, trademarks, service marks, or product names of the Licensor,\n              except as required for reasonable and customary use in describing the\n              origin of the Work and reproducing the content of the NOTICE file.\n        \n           7. Disclaimer of Warranty. Unless required by applicable law or\n              agreed to in writing, Licensor provides the Work (and each\n              Contributor provides its Contributions) on an \"AS IS\" BASIS,\n              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n              implied, including, without limitation, any warranties or conditions\n              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n              PARTICULAR PURPOSE. You are solely responsible for determining the\n              appropriateness of using or redistributing the Work and assume any\n              risks associated with Your exercise of permissions under this License.\n        \n           8. Limitation of Liability. In no event and under no legal theory,\n              whether in tort (including negligence), contract, or otherwise,\n              unless required by applicable law (such as deliberate and grossly\n              negligent acts) or agreed to in writing, shall any Contributor be\n              liable to You for damages, including any direct, indirect, special,\n              incidental, or consequential damages of any character arising as a\n              result of this License or out of the use or inability to use the\n              Work (including but not limited to damages for loss of goodwill,\n              work stoppage, computer failure or malfunction, or any and all\n              other commercial damages or losses), even if such Contributor\n              has been advised of the possibility of such damages.\n        \n           9. Accepting Warranty or Additional Liability. While redistributing\n              the Work or Derivative Works thereof, You may choose to offer,\n              and charge a fee for, acceptance of support, warranty, indemnity,\n              or other liability obligations and/or rights consistent with this\n              License. However, in accepting such obligations, You may act only\n              on Your own behalf and on Your sole responsibility, not on behalf\n              of any other Contributor, and only if You agree to indemnify,\n              defend, and hold each Contributor harmless for any liability\n              incurred by, or claims asserted against, such Contributor by reason\n              of your accepting any such warranty or additional liability.\n        \n           END OF TERMS AND CONDITIONS\n        \n           APPENDIX: How to apply the Apache License to your work.\n        \n              To apply the Apache License to your work, attach the following\n              boilerplate notice, with the fields enclosed by brackets \"[]\"\n              replaced with your own identifying information. (Don't include\n              the brackets!)  The text should be enclosed in the appropriate\n              comment syntax for the file format. We also recommend that a\n              file or class name and description of purpose be included on the\n              same \"printed page\" as the copyright notice for easier\n              identification within third-party archives.\n        \n           Copyright [yyyy] [name of copyright owner]\n        \n           Licensed under the Apache License, Version 2.0 (the \"License\");\n           you may not use this file except in compliance with the License.\n           You may obtain a copy of the License at\n        \n               http://www.apache.org/licenses/LICENSE-2.0\n        \n           Unless required by applicable law or agreed to in writing, software\n           distributed under the License is distributed on an \"AS IS\" BASIS,\n           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n           See the License for the specific language governing permissions and\n           limitations under the License.",
    "summary": "Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.",
    "version": "5.0.0",
    "project_urls": {
        "Homepage": "https://github.com/ModelCloud/GPTQModel"
    },
    "split_keywords": [
        "gptq",
        " awq",
        " qqq",
        " autogptq",
        " autoawq",
        " eora",
        " gar",
        " quantization",
        " large-language-models",
        " transformers",
        " llm",
        " moe",
        " compression"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6779409687c7705314fb8fe46299c941628c250fc9f156eeefc18d34639eb1fa",
                "md5": "82770cfbd5673309e0dcd21081d6801d",
                "sha256": "f1fb1fa0c2170e8d0c4d18eaec636142029eb5de4eb1fdd82b0163ae5c1d01ee"
            },
            "downloads": -1,
            "filename": "gptqmodel-5.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "82770cfbd5673309e0dcd21081d6801d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 576816,
            "upload_time": "2025-10-24T04:39:03",
            "upload_time_iso_8601": "2025-10-24T04:39:03.475923Z",
            "url": "https://files.pythonhosted.org/packages/67/79/409687c7705314fb8fe46299c941628c250fc9f156eeefc18d34639eb1fa/gptqmodel-5.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-24 04:39:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ModelCloud",
    "github_project": "GPTQModel",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "accelerate",
            "specs": [
                [
                    ">=",
                    "1.10.1"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.2.6"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "2.8.0"
                ]
            ]
        },
        {
            "name": "safetensors",
            "specs": [
                [
                    ">=",
                    "0.6.2"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.57.1"
                ]
            ]
        },
        {
            "name": "threadpoolctl",
            "specs": [
                [
                    ">=",
                    "3.6.0"
                ]
            ]
        },
        {
            "name": "packaging",
            "specs": [
                [
                    ">=",
                    "24.2"
                ]
            ]
        },
        {
            "name": "device-smi",
            "specs": [
                [
                    ">=",
                    "0.5.1"
                ]
            ]
        },
        {
            "name": "protobuf",
            "specs": [
                [
                    ">=",
                    "6.32.0"
                ]
            ]
        },
        {
            "name": "pillow",
            "specs": [
                [
                    ">=",
                    "11.3.0"
                ]
            ]
        },
        {
            "name": "pypcre",
            "specs": [
                [
                    ">=",
                    "0.2.3"
                ]
            ]
        },
        {
            "name": "hf_transfer",
            "specs": [
                [
                    ">=",
                    "0.1.9"
                ]
            ]
        },
        {
            "name": "huggingface_hub",
            "specs": [
                [
                    ">=",
                    "0.34.4"
                ]
            ]
        },
        {
            "name": "random_word",
            "specs": [
                [
                    ">=",
                    "1.0.13"
                ]
            ]
        },
        {
            "name": "tokenicer",
            "specs": [
                [
                    ">=",
                    "0.0.5"
                ]
            ]
        },
        {
            "name": "logbar",
            "specs": [
                [
                    ">=",
                    "0.1.7"
                ]
            ]
        },
        {
            "name": "maturin",
            "specs": [
                [
                    ">=",
                    "1.9.4"
                ]
            ]
        },
        {
            "name": "datasets",
            "specs": [
                [
                    ">=",
                    "3.6.0"
                ]
            ]
        },
        {
            "name": "pyarrow",
            "specs": [
                [
                    ">=",
                    "21.0"
                ]
            ]
        },
        {
            "name": "dill",
            "specs": [
                [
                    ">=",
                    "0.3.8"
                ]
            ]
        },
        {
            "name": "pypcre",
            "specs": [
                [
                    ">=",
                    "0.2.4"
                ]
            ]
        },
        {
            "name": "torchao",
            "specs": [
                [
                    ">=",
                    "0.14.0"
                ]
            ]
        }
    ],
    "lcname": "gptqmodel"
}
        
Elapsed time: 1.79504s