<div align="center">
AutoRound
===========================
<h3> Advanced Quantization Algorithm for LLMs</h3>
[](https://github.com/intel/auto-round)
[](https://github.com/intel/auto-round)
[](https://github.com/intel/auto-round/blob/main/LICENSE)
<a href="https://huggingface.co/Intel">
<img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-F57C00">
</a>
---
<div align="left">
## 🚀 What is AutoRound?
AutoRound is an advanced quantization library designed for Large Language Models (LLMs) and Vision-Language Models (VLMs).
It delivers high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and offering broad hardware compatibility.
For more details, see our [paper](https://arxiv.org/pdf/2309.05516) for more details and explore quantized models available on several Hugging Face Spaces, e.g. [Intel](https://huggingface.co/Intel), [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup)
and [fbaldassarri](https://huggingface.co/fbaldassarri). For usage instructions, please refer to [User Guide](./docs/step_by_step.md).
<p align="center">
<img src="docs/imgs/autoround_overview.png" alt="AutoRound Overview" width="80%">
</p>
## 🆕 What's New
[2025/09] AutoRound now includes experimental support for the mxfp4 and nvfp4 dtypes. For accuracy results, see the [documentation](./docs/mxnv_acc.md)
. We currently recommend exporting to the LLM-Compressor format.
[2025/08] AutoRound now provides experimental support for an improved INT2 algorithm via `--enable_alg_ext`. See this [documentation](./docs/alg_202508.md)
for some accuracy results.
[2025/07] AutoRound now offers experimental support for **GGUF** format, and recommends using optimized RTN mode (--iters 0) for
all bits other than 3 bits. Example
models: [Intel/Qwen3-235B-A22B-q2ks-mixed-AutoRound](https://huggingface.co/Intel/Qwen3-235B-A22B-q2ks-mixed-AutoRound)
and [Intel/DeepSeek-R1-0528-q2ks-mixed-AutoRound](https://huggingface.co/Intel/DeepSeek-R1-0528-q2ks-mixed-AutoRound). **A more advanced algorithm** tailored for specific configurations may be available in
v0.7.1.
[2025/05] AutoRound has been integrated into **vLLM**. You can now run models in the AutoRound format directly with
vLLM versions later than v0.85.post1.
[2025/04] AutoRound has been integrated into **Transformers**. You can run models in the AutoRound format directly
with Transformers versions later than 4.51.3.
[2025/03] The INT2-mixed **DeepSeek-R1** model (~200GB) retains 97.9% accuracy. Check
out [OPEA/DeepSeek-R1-int2-mixed-sym-inc](https://huggingface.co/OPEA/DeepSeek-R1-int2-mixed-sym-inc).
## ✨ Key Features
✅ **Superior Accuracy**
Delivers strong performance even at 2–3 bits [example models](https://huggingface.co/collections/OPEA/2-3-bits-67a5f0bc6b49d73c01b4753b), with leading results at 4 bits [benchmark](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard).
✅ **Ecosystem Integration**
Seamlessly works with **Transformers, vLLM,** and more.
✅ **Multiple Formats Export**
Support **AutoRound, AutoAWQ, AutoGPTQ, and GGUF** for maximum compatibility. Details are shown in [export formats](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#supported-export-formats)
✅ **Affordable Quantization Cost**
Quantize 7B models in about 10 minutes on a single GPU. Details are shown in [quantization costs](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#quantization-costs)
✅ **10+ VLMs Support**
Out-of-the-box quantization for 10+ vision-language models [example models](https://huggingface.co/collections/OPEA/vlms-autoround-675bc712fdd6a55ebaf11bfa), [support matrix](https://github.com/intel/auto-round/tree/main/auto_round/mllm#support-matrix)
✅ **Layerwise Mixed Bits Quantization**
Assign different bits per layer for fine-grained accuracy/performance trade-offs. Details are shown in [mixed bits quantization](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#mixed-bits-usage)
✅ **Round-to-Nearest Mode**
Use `--iters 0` for fast, calibration-free quantization with some accuracy drop. Details are shown in [rtn mode](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#rtn-mode)
✅ **Multiple Recipes**
Choose from `auto-round-best`, `auto-round`, and `auto-round-light` to suit your needs. Details are shown in [quantization recipes](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#recipe-recommendation)
✅ Advanced Utilities
Includes [multiple gpus quantization](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#devicemulti-gpu-setting-in-quantization), [multiple calibration datasets](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#default-dataset) and support for [10+ runtime backends](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#specify-inference-backend).
✅ Beyond weight only quantization. We are actively expanding support for additional datatypes such as **MXFP**, NVFP, W8A8, and more.
## Installation
### Install from pypi
```bash
# CPU/Intel GPU/CUDA
pip install auto-round
# HPU
pip install auto-round-lib
```
<details>
<summary>Build from Source</summary>
```bash
# CPU/Intel GPU/CUDA
pip install .
# HPU
python setup.py install lib
```
</details>
## Model Quantization (CPU/Intel GPU/Gaudi/CUDA)
### CLI Usage
Please change to `auto-round-mllm` for visual-language models (VLMs) quantization. The full list of supported arguments is provided by calling `auto-round -h` on the terminal.
```bash
auto-round \
--model Qwen/Qwen3-0.6B \
--scheme "W4A16" \
--format "auto_round" \
--output_dir ./tmp_autoround
```
We offer another two recipes, `auto-round-best` and `auto-round-light`, designed for optimal accuracy and improved speed, respectively. Details are as follows.
<details>
<summary>Other Recipes</summary>
```bash
# Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto-round-best \
--model Qwen/Qwen3-0.6B \
--scheme "W4A16" \
--low_gpu_mem_usage
```
```bash
# 2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2
auto-round-light \
--model Qwen/Qwen3-0.6B \
--scheme "W4A16"
```
<!-- ```bash
auto-round-fast \
# Fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
--model Qwen/Qwen3-0.6B \
--bits 4 \
--group_size 128 \
``` -->
</details>
In conclusion, we recommend using **auto-round for W4A16 and auto-round-best with `enable_alg_ext` for W2A16**. However, you may adjust the
configuration to suit your specific requirements and available resources.
### API Usage
```python
from auto_round import AutoRound
# Load a model (supports FP8/BF16/FP16/FP32)
model_name_or_path = "Qwen/Qwen3-0.6B"
# Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc.
ar = AutoRound(model_name_or_path, scheme="W4A16")
# Highest accuracy (4–5× slower).
# `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower.
# ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)
# Faster quantization (2–3× speedup) with slight accuracy drop at W4G128.
# ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)
# Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m", etc.
ar.quantize_and_save(output_dir="./tmp_autoround", format="auto_round")
```
<details>
<summary>Detailed Hyperparameters</summary>
- `model`: The PyTorch model to be quantized.
- `tokenizer`: An optional tokenizer for processing input data. If none, a dataset must be provided.
- `bits (int)`: Number of bits for quantization (default is 4).
- `group_size (int)`: Size of the quantization group (default is 128).
- `sym (bool)`: Whether to use symmetric quantization (default is True).
- `enable_quanted_input (bool)`: Whether to use the output of the previous quantized block as the input for the current
block for tuning (default is True).
- `enable_minmax_tuning (bool)`: Whether to enable weight min-max tuning (default is True).
- `iters (int)`: Number of tuning iterations (default is 200).
- `lr (float)`: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).
- `minmax_lr (float)`: The learning rate for min-max tuning (default is None, it will be set to lr automatically).
- `nsamples (int)`: Number of samples for tuning (default is 128).
- `seqlen (int)`: Data length of the sequence for tuning (default is 2048).
- `batch_size (int)`: Batch size for training (default is 8).
- `scale_dtype (str)`: The data type of quantization scale to be used (default is "float16"), different kernels have
different choices.
- `amp (bool)`: Whether to use automatic mixed precision (default is True).
- `nblocks (int)`: Packing several blocks as one for tuning together (default is 1).
- `gradient_accumulate_steps (int)`: Number of gradient accumulation steps (default is 1).
- `low_gpu_mem_usage (bool)`: Whether to save GPU memory at the cost of ~20% more tuning time (default is False).
- `dataset Union[str, list, tuple, torch.utils.data.DataLoader]`: The dataset name for tuning (default is "
NeelNanda/pile-10k"). Local json file and combination of datasets have been supported, e.g. "
./tmp.json,NeelNanda/pile-10k:train, mbpp:train+validation+test"
- `layer_config (dict)`: Configuration for weight quantization (default is None), mainly for mixed bits
or mixed precision.
- `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.
</details>
### API Usage for VLMs
If you encounter issues during quantization, try setting iters=0 (to enable RTN) and use group_size=32 for better
results.
<details>
<summary>Click to expand</summary>
**This feature is experimental and may be subject to changes**.
By default, AutoRoundMLLM only quantize the text module of VLMs and uses `NeelNanda/pile-10k` for calibration. To
quantize the entire model, you can enable `quant_nontext_module` by setting it to True, though support for this feature
is limited. For more information, please refer to the AutoRoundMLLM [readme](./auto_round/mllm/README.md).
```python
from auto_round import AutoRoundMLLM
# Load the model
model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct"
# Quantize the model
ar = AutoRoundMLLM(model_name_or_path, scheme="W4A16")
output_dir = "./tmp_autoround"
ar.quantize_and_save(output_dir)
```
</details>
## Model Inference
### vLLM (CPU/Intel GPU/CUDA)
Please note that support for the MoE models and visual language models is currently limited.
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95)
model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
llm = LLM(model=model_name)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
### Transformers (CPU/Intel GPU/Gaudi/CUDA)
AutoRound support 10+ backends and automatically selects the best available backend based on the installed libraries and prompts the user to
install additional libraries when a better backend is found.
**Please avoid manually moving the quantized model to a different device** (e.g., model.to('cpu')) during inference, as
this may cause unexpected exceptions.
The support for Gaudi device is limited.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```
## Acknowledgement
Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.
## 🌟 Support Us
If you find AutoRound helpful, please ⭐ star the repo and share it with your community!
Raw data
{
"_id": null,
"home_page": "https://github.com/intel/auto-round",
"name": "auto-round",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": null,
"keywords": "quantization, auto-around, LLM, SignRound",
"author": "Intel AIPT Team",
"author_email": "wenhua.cheng@intel.com, weiwei1.zhang@intel.com, heng.guo@intel.com",
"download_url": "https://files.pythonhosted.org/packages/c5/81/bcf95c0d63b9abd3579c4d94d5b2f7c5194f99e93d77410c29b9f55cdfd7/auto_round-0.8.0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n\nAutoRound\n===========================\n<h3> Advanced Quantization Algorithm for LLMs</h3>\n\n[](https://github.com/intel/auto-round)\n[](https://github.com/intel/auto-round)\n[](https://github.com/intel/auto-round/blob/main/LICENSE)\n<a href=\"https://huggingface.co/Intel\">\n<img alt=\"Model Checkpoints\" src=\"https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-F57C00\">\n</a>\n---\n<div align=\"left\">\n\n## \ud83d\ude80 What is AutoRound?\n\nAutoRound is an advanced quantization library designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). \nIt delivers high accuracy at ultra-low bit widths (2\u20134 bits) with minimal tuning by leveraging sign-gradient descent and offering broad hardware compatibility. \nFor more details, see our [paper](https://arxiv.org/pdf/2309.05516) for more details and explore quantized models available on several Hugging Face Spaces, e.g. [Intel](https://huggingface.co/Intel), [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup)\nand [fbaldassarri](https://huggingface.co/fbaldassarri). For usage instructions, please refer to [User Guide](./docs/step_by_step.md).\n\n<p align=\"center\">\n <img src=\"docs/imgs/autoround_overview.png\" alt=\"AutoRound Overview\" width=\"80%\">\n</p>\n\n\n## \ud83c\udd95 What's New\n\n[2025/09] AutoRound now includes experimental support for the mxfp4 and nvfp4 dtypes. For accuracy results, see the [documentation](./docs/mxnv_acc.md)\n. We currently recommend exporting to the LLM-Compressor format.\n\n[2025/08] AutoRound now provides experimental support for an improved INT2 algorithm via `--enable_alg_ext`. See this [documentation](./docs/alg_202508.md)\n for some accuracy results. \n\n[2025/07] AutoRound now offers experimental support for **GGUF** format, and recommends using optimized RTN mode (--iters 0) for\n all bits other than 3 bits. Example\n models: [Intel/Qwen3-235B-A22B-q2ks-mixed-AutoRound](https://huggingface.co/Intel/Qwen3-235B-A22B-q2ks-mixed-AutoRound)\n and [Intel/DeepSeek-R1-0528-q2ks-mixed-AutoRound](https://huggingface.co/Intel/DeepSeek-R1-0528-q2ks-mixed-AutoRound). **A more advanced algorithm** tailored for specific configurations may be available in\n v0.7.1.\n\n[2025/05] AutoRound has been integrated into **vLLM**. You can now run models in the AutoRound format directly with\n vLLM versions later than v0.85.post1.\n\n[2025/04] AutoRound has been integrated into **Transformers**. You can run models in the AutoRound format directly\n with Transformers versions later than 4.51.3.\n\n[2025/03] The INT2-mixed **DeepSeek-R1** model (~200GB) retains 97.9% accuracy. Check\n out [OPEA/DeepSeek-R1-int2-mixed-sym-inc](https://huggingface.co/OPEA/DeepSeek-R1-int2-mixed-sym-inc).\n\n\n## \u2728 Key Features\n\n\n\u2705 **Superior Accuracy**\nDelivers strong performance even at 2\u20133 bits [example models](https://huggingface.co/collections/OPEA/2-3-bits-67a5f0bc6b49d73c01b4753b), with leading results at 4 bits [benchmark](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard).\n\n\u2705 **Ecosystem Integration**\nSeamlessly works with **Transformers, vLLM,** and more.\n\n\u2705 **Multiple Formats Export**\nSupport **AutoRound, AutoAWQ, AutoGPTQ, and GGUF** for maximum compatibility. Details are shown in [export formats](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#supported-export-formats)\n\n\u2705 **Affordable Quantization Cost**\nQuantize 7B models in about 10 minutes on a single GPU. Details are shown in [quantization costs](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#quantization-costs)\n\n\u2705 **10+ VLMs Support**\nOut-of-the-box quantization for 10+ vision-language models [example models](https://huggingface.co/collections/OPEA/vlms-autoround-675bc712fdd6a55ebaf11bfa), [support matrix](https://github.com/intel/auto-round/tree/main/auto_round/mllm#support-matrix)\n\n\u2705 **Layerwise Mixed Bits Quantization**\nAssign different bits per layer for fine-grained accuracy/performance trade-offs. Details are shown in [mixed bits quantization](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#mixed-bits-usage)\n\n\u2705 **Round-to-Nearest Mode**\nUse `--iters 0` for fast, calibration-free quantization with some accuracy drop. Details are shown in [rtn mode](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#rtn-mode)\n\n\u2705 **Multiple Recipes**\nChoose from `auto-round-best`, `auto-round`, and `auto-round-light` to suit your needs. Details are shown in [quantization recipes](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#recipe-recommendation)\n\n\u2705 Advanced Utilities\nIncludes [multiple gpus quantization](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#devicemulti-gpu-setting-in-quantization), [multiple calibration datasets](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#default-dataset) and support for [10+ runtime backends](https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#specify-inference-backend).\n\n\u2705 Beyond weight only quantization. We are actively expanding support for additional datatypes such as **MXFP**, NVFP, W8A8, and more.\n\n\n## Installation\n\n### Install from pypi\n\n```bash\n# CPU/Intel GPU/CUDA\npip install auto-round\n\n# HPU\npip install auto-round-lib\n```\n\n<details>\n <summary>Build from Source</summary>\n\n ```bash\n # CPU/Intel GPU/CUDA\n pip install .\n\n # HPU\n python setup.py install lib\n ```\n\n</details>\n\n## Model Quantization (CPU/Intel GPU/Gaudi/CUDA)\n\n### CLI Usage\nPlease change to `auto-round-mllm` for visual-language models (VLMs) quantization. The full list of supported arguments is provided by calling `auto-round -h` on the terminal.\n\n```bash\nauto-round \\\n --model Qwen/Qwen3-0.6B \\\n --scheme \"W4A16\" \\\n --format \"auto_round\" \\\n --output_dir ./tmp_autoround\n```\n\nWe offer another two recipes, `auto-round-best` and `auto-round-light`, designed for optimal accuracy and improved speed, respectively. Details are as follows.\n<details>\n <summary>Other Recipes</summary>\n\n ```bash\n# Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower\nauto-round-best \\\n --model Qwen/Qwen3-0.6B \\\n --scheme \"W4A16\" \\\n --low_gpu_mem_usage \n ```\n\n ```bash\n# 2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2\nauto-round-light \\\n --model Qwen/Qwen3-0.6B \\\n --scheme \"W4A16\" \n\n ```\n\n <!-- ```bash\nauto-round-fast \\\n# Fast and low memory, 2-3X speedup, slight accuracy drop at W4G128\n --model Qwen/Qwen3-0.6B \\\n --bits 4 \\\n --group_size 128 \\\n ``` -->\n\n</details>\n\nIn conclusion, we recommend using **auto-round for W4A16 and auto-round-best with `enable_alg_ext` for W2A16**. However, you may adjust the\nconfiguration to suit your specific requirements and available resources.\n\n### API Usage\n\n```python\nfrom auto_round import AutoRound\n\n# Load a model (supports FP8/BF16/FP16/FP32)\nmodel_name_or_path = \"Qwen/Qwen3-0.6B\"\n\n# Available schemes: \"W2A16\", \"W3A16\", \"W4A16\", \"W8A16\", \"NVFP4\", \"MXFP4\" (no real kernels), \"GGUF:Q4_K_M\", etc.\nar = AutoRound(model_name_or_path, scheme=\"W4A16\")\n\n# Highest accuracy (4\u20135\u00d7 slower).\n# `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower.\n# ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)\n\n# Faster quantization (2\u20133\u00d7 speedup) with slight accuracy drop at W4G128.\n# ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)\n\n# Supported formats: \"auto_round\" (default), \"auto_gptq\", \"auto_awq\", \"llm_compressor\", \"gguf:q4_k_m\", etc.\nar.quantize_and_save(output_dir=\"./tmp_autoround\", format=\"auto_round\")\n```\n\n<details>\n <summary>Detailed Hyperparameters</summary>\n\n- `model`: The PyTorch model to be quantized.\n\n- `tokenizer`: An optional tokenizer for processing input data. If none, a dataset must be provided.\n\n- `bits (int)`: Number of bits for quantization (default is 4).\n\n- `group_size (int)`: Size of the quantization group (default is 128).\n\n- `sym (bool)`: Whether to use symmetric quantization (default is True).\n\n- `enable_quanted_input (bool)`: Whether to use the output of the previous quantized block as the input for the current\n block for tuning (default is True).\n\n- `enable_minmax_tuning (bool)`: Whether to enable weight min-max tuning (default is True).\n\n- `iters (int)`: Number of tuning iterations (default is 200).\n\n- `lr (float)`: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).\n\n- `minmax_lr (float)`: The learning rate for min-max tuning (default is None, it will be set to lr automatically).\n\n- `nsamples (int)`: Number of samples for tuning (default is 128).\n\n- `seqlen (int)`: Data length of the sequence for tuning (default is 2048).\n\n- `batch_size (int)`: Batch size for training (default is 8).\n\n- `scale_dtype (str)`: The data type of quantization scale to be used (default is \"float16\"), different kernels have\n different choices.\n\n- `amp (bool)`: Whether to use automatic mixed precision (default is True).\n\n- `nblocks (int)`: Packing several blocks as one for tuning together (default is 1).\n\n- `gradient_accumulate_steps (int)`: Number of gradient accumulation steps (default is 1).\n\n- `low_gpu_mem_usage (bool)`: Whether to save GPU memory at the cost of ~20% more tuning time (default is False).\n\n- `dataset Union[str, list, tuple, torch.utils.data.DataLoader]`: The dataset name for tuning (default is \"\n NeelNanda/pile-10k\"). Local json file and combination of datasets have been supported, e.g. \"\n ./tmp.json,NeelNanda/pile-10k:train, mbpp:train+validation+test\"\n\n- `layer_config (dict)`: Configuration for weight quantization (default is None), mainly for mixed bits\n or mixed precision.\n\n- `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.\n\n</details>\n\n### API Usage for VLMs\n\nIf you encounter issues during quantization, try setting iters=0 (to enable RTN) and use group_size=32 for better\nresults.\n\n\n<details>\n <summary>Click to expand</summary>\n\n**This feature is experimental and may be subject to changes**.\n\nBy default, AutoRoundMLLM only quantize the text module of VLMs and uses `NeelNanda/pile-10k` for calibration. To\nquantize the entire model, you can enable `quant_nontext_module` by setting it to True, though support for this feature\nis limited. For more information, please refer to the AutoRoundMLLM [readme](./auto_round/mllm/README.md).\n\n```python\nfrom auto_round import AutoRoundMLLM\n\n# Load the model\nmodel_name_or_path = \"Qwen/Qwen2.5-VL-7B-Instruct\"\n# Quantize the model\nar = AutoRoundMLLM(model_name_or_path, scheme=\"W4A16\")\noutput_dir = \"./tmp_autoround\"\nar.quantize_and_save(output_dir)\n```\n\n</details>\n\n\n\n## Model Inference\n\n### vLLM (CPU/Intel GPU/CUDA)\nPlease note that support for the MoE models and visual language models is currently limited.\n```python\nfrom vllm import LLM, SamplingParams\n\nprompts = [\n \"Hello, my name is\",\n]\nsampling_params = SamplingParams(temperature=0.6, top_p=0.95)\nmodel_name = \"Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound\"\nllm = LLM(model=model_name)\n\noutputs = llm.generate(prompts, sampling_params)\n\nfor output in outputs:\n prompt = output.prompt\n generated_text = output.outputs[0].text\n print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")\n```\n\n### Transformers (CPU/Intel GPU/Gaudi/CUDA)\n\n\nAutoRound support 10+ backends and automatically selects the best available backend based on the installed libraries and prompts the user to\ninstall additional libraries when a better backend is found.\n\n**Please avoid manually moving the quantized model to a different device** (e.g., model.to('cpu')) during inference, as\nthis may cause unexpected exceptions.\n\nThe support for Gaudi device is limited.\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_name = \"Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound\"\nmodel = AutoModelForCausalLM.from_pretrained(model_name, device_map=\"auto\", torch_dtype=\"auto\")\ntokenizer = AutoTokenizer.from_pretrained(model_name)\ntext = \"There is a girl who likes adventure,\"\ninputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\nprint(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))\n```\n## Acknowledgement\nSpecial thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.\n\n## \ud83c\udf1f Support Us\nIf you find AutoRound helpful, please \u2b50 star the repo and share it with your community!\n\n\n\n\n\n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Repository of AutoRound: Advanced Weight-Only Quantization Algorithm for LLMs",
"version": "0.8.0",
"project_urls": {
"Homepage": "https://github.com/intel/auto-round"
},
"split_keywords": [
"quantization",
" auto-around",
" llm",
" signround"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e0baa80b1e684f7fb3ba1c5e22f8d7712112414d61d13aa548f7d92e329bb7a1",
"md5": "b9c56b612f1bb1ee0d09ddebdef80543",
"sha256": "9dea6ca9da49324ef84d1492d86201fa7bd27e08da46a2b2223f4cc2611f1d7f"
},
"downloads": -1,
"filename": "auto_round-0.8.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b9c56b612f1bb1ee0d09ddebdef80543",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 615102,
"upload_time": "2025-10-23T08:47:03",
"upload_time_iso_8601": "2025-10-23T08:47:03.454352Z",
"url": "https://files.pythonhosted.org/packages/e0/ba/a80b1e684f7fb3ba1c5e22f8d7712112414d61d13aa548f7d92e329bb7a1/auto_round-0.8.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c581bcf95c0d63b9abd3579c4d94d5b2f7c5194f99e93d77410c29b9f55cdfd7",
"md5": "6ef63a2997cafc1986a1610a4bb13798",
"sha256": "dfa2c260af9a3ffb5e70d65359268096e82a1d52245d8792564750f3e3aefe34"
},
"downloads": -1,
"filename": "auto_round-0.8.0.tar.gz",
"has_sig": false,
"md5_digest": "6ef63a2997cafc1986a1610a4bb13798",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 513666,
"upload_time": "2025-10-23T08:47:04",
"upload_time_iso_8601": "2025-10-23T08:47:04.662393Z",
"url": "https://files.pythonhosted.org/packages/c5/81/bcf95c0d63b9abd3579c4d94d5b2f7c5194f99e93d77410c29b9f55cdfd7/auto_round-0.8.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-23 08:47:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "intel",
"github_project": "auto-round",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "accelerate",
"specs": [
[
">=",
"1.10.0"
]
]
},
{
"name": "datasets",
"specs": []
},
{
"name": "py-cpuinfo",
"specs": []
},
{
"name": "sentencepiece",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "packaging",
"specs": []
},
{
"name": "pillow",
"specs": []
},
{
"name": "tbb",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "transformers",
"specs": [
[
">=",
"4.38"
]
]
},
{
"name": "threadpoolctl",
"specs": []
},
{
"name": "lm-eval",
"specs": [
[
"<",
"0.5"
],
[
">=",
"0.4.2"
]
]
}
],
"lcname": "auto-round"
}