<div align="center">
AutoRound
===========================
<h3> Advanced Quantization Algorithm for LLMs</h3>
[![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/auto-round)
[![version](https://img.shields.io/badge/release-0.3.1-green)](https://github.com/intel/auto-round)
[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/auto-round/blob/main/LICENSE)
---
<div align="left">
AutoRound is an advanced quantization algorithm for low-bits LLM inference. It's tailored for a wide range
of models. AutoRound adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200
steps,
which competes impressively against recent methods without introducing any additional inference overhead and keeping low
tuning cost. The below
image presents an overview of AutoRound. Check out our paper on [arxiv](https://arxiv.org/pdf/2309.05516) for more
details and visit [low_bit_open_llm_leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard) for
more accuracy data and recipes across various models.
<div align="center">
![](docs/imgs/autoround_overview.png)
<div align="left">
## What's New
* [2024/10] Important update: We now support full-range symmetric quantization and have made it the default
configuration. This approach is typically better or comparable to asymmetric quantization and significantly
outperforms other symmetric variants, especially at low bit-widths like 2-bit. And,no need to compile from source to run
AutoRound format anymore.
* [2024/09] AutoRound format supports several LVM models, check out the
examples [Qwen2-Vl](./examples/multimodal-modeling/Qwen-VL),[Phi-3-vision](./examples/multimodal-modeling/Phi-3-vision), [Llava](./examples/multimodal-modeling/Llava)
* [2024/08] AutoRound format supports Intel Gaudi2 devices. Please refer
to [Intel/Qwen2-7B-int4-inc](https://huggingface.co/Intel/Qwen2-7B-int4-inc).
* [2024/08] AutoRound introduces several experimental features, including fast tuning of norm/bias parameters (for 2-bit
and W4A4), activation quantization, and the mx_fp data type.
* [2024/07] Important change: the default value of nsamples has been changed from 512 to 128 to reduce the memory
usages, which may cause a slight accuracy drop in some scenarios
## Installation
### Build from Source
```bash
pip install -vvv --no-build-isolation -e .
```
### Install from pypi
```bash
pip install auto-round
```
## Model Quantization
### API Usage (Gaudi2/CPU/GPU)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
from auto_round import AutoRound
bits, group_size = 4, 128
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size)
## the best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size)
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
# autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size)
autoround.quantize()
output_dir = "./tmp_autoround"
## format= 'auto_round'(default in version>0.3.0), 'auto_gptq'(default in version<=0.3.0), 'auto_awq'
autoround.save_quantized(output_dir, format='auto_round', inplace=True)
```
<details>
<summary>Detailed Hyperparameters</summary>
- `model`: The PyTorch model to be quantized.
- `tokenizer`: An optional tokenizer for processing input data. If none, a dataset must be provided.
- `bits (int)`: Number of bits for quantization (default is 4).
- `group_size (int)`: Size of the quantization group (default is 128).
- `sym (bool)`: Whether to use symmetric quantization (default is True).
- `enable_quanted_input (bool)`: Whether to use the output of the previous quantized block as the input for the current
block for tuning (default is True).
- `enable_minmax_tuning (bool)`: Whether to enable weight min-max tuning (default is True).
- `iters (int)`: Number of tuning iterations (default is 200).
- `lr (float)`: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).
- `minmax_lr (float)`: The learning rate for min-max tuning (default is None, it will be set to lr automatically).
- `nsamples (int)`: Number of samples for tuning (default is 128).
- `seqlen (int)`: Data length of the sequence for tuning (default is 2048).
- `batch_size (int)`: Batch size for training (default is 8).
- `scale_dtype (str)`: The data type of quantization scale to be used (default is "float16"), different kernels have
different choices.
- `amp (bool)`: Whether to use automatic mixed precision (default is True).
- `nblocks (int)`: Packing several blocks as one for tuning together (default is 1).
- `gradient_accumulate_steps (int)`: Number of gradient accumulation steps (default is 1).
- `low_gpu_mem_usage (bool)`: Whether to save GPU memory at the cost of ~20% more tuning time (default is False).
- `dataset Union[str, list, tuple, torch.utils.data.DataLoader]`: The dataset name for tuning (default is "
NeelNanda/pile-10k"). Local json file and combination of datasets have been supported, e.g. "
./tmp.json,NeelNanda/pile-10k:train, mbpp:train+validation+test"
- `layer_config (dict)`: Configuration for weight quantization (default is None), mainly for mixed bits
or mixed precision.
- `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.
</details>
### Basic Usage (version > 0.3.0)
A user guide detailing the full list of supported arguments is provided by calling ```auto_round -h``` on the terminal.
Alternatively, you can use ```auto-round``` instead of ```auto_round```.
```bash
auto_round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--format auto_round \
--disable_eval \
--output_dir ./tmp_autoround
```
We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
<details>
<summary>Other Recipes</summary>
```bash
## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto_round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 512 \
--iters 1000 \
--low_gpu_mem_usage \
--disable_eval
```
```bash
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
auto_round --model facebook/opt-125m \
--bits 4 \
--group_size 128 \
--nsamples 128 \
--iters 200 \
--seqlen 512 \
--batch_size 4 \
--disable_eval
```
</details>
#### Formats
**AutoRound Format**:This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
inference. [2,4]
bits are supported. It also benefits
from the Marlin kernel, which can boost inference performance notably.However, it has not yet gained widespread
community adoption. For CUDA support, you will need to
install from the source.
**AutoGPTQ Format**: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the
community, [2,3,4,8] bits are supported, for 3 bits, pip install auto-gptq first before quantization. It also benefits
from the Marlin kernel, which can boost inference performance notably. However, **the
asymmetric kernel has issues** that can cause considerable accuracy drops, particularly at 2-bit quantization and small
models.
Additionally, symmetric quantization tends to perform poorly at 2-bit precision.
**AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely adopted
within the community, only 4-bits quantization is supported. It features
specialized layer fusion tailored for Llama models.
## Model Inference
Please run the quantization code first
### AutoGPTQ/AutoAWQ format
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```
### AutoRound format
**CPU**: pip install intel-extension-for-transformers
**HPU**: docker image with Gaudi Software Stack is recommended. More details can be found
in [Gaudi Guide](https://docs.habana.ai/en/latest/).
**CUDA**: pip install auto-gptq for sym quantization(tuning needs auto-round 0.30+), for asym quantization, need to install auto-round from source
#### CPU/HPU/CUDA on 0.3.0+
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRoundConfig
device = "auto" ##cpu, hpu, cuda
quantization_config = AutoRoundConfig(
backend=device
)
quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
device_map=device, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```
#### CPU/HPU/CUDA on 0.3.0
**CUDA**: need to install auto-round from source
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round.auto_quantizer import AutoHfQuantizer ## must import
quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
```
<br>
<details>
<summary>Evaluation</summary>
```bash
## version > 0.3.0
auto_round --model saved_quantized_model \
--eval \
--task lambada_openai \
--eval_bs 1
```
</details>
## Support List
AutoRound supports basically all the major large language models.
Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a
different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot
release most of the models ourselves.
Model | Supported |
|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| meta-llama/Meta-Llama-3.1-70B-Instruct | [recipe](https://huggingface.co/Intel/Meta-Llama-3.1-70B-Instruct-int4-inc) |
| meta-llama/Meta-Llama-3.1-8B-Instruct | [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-asym), [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-sym), [recipe](https://huggingface.co/Intel/Meta-Llama-3.1-8B-Instruct-int4-inc) |
| meta-llama/Meta-Llama-3.1-8B | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-autoround-gptq-4bit-sym) |
| Qwen/Qwen-VL | [accuracy](./examples/multimodal-modeling/Qwen-VL/README.md), [recipe](./examples/multimodal-modeling/Qwen-VL/run_autoround.sh)
| Qwen/Qwen2-7B | [model-autoround-int4](https://huggingface.co/Intel/Qwen2-7B-int4-inc) |
| Qwen/Qwen2-57B-A14B-Instruct | [model-autoround-int4](https://huggingface.co/Intel/Qwen2-57B-A14B-Instruct-int4-inc) |
| 01-ai/Yi-1.5-9B | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-4bit-gptq-autoround) |
| 01-ai/Yi-1.5-9B-Chat | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-Chat-4bit-gptq-autoround) |
| Intel/neural-chat-7b-v3-3 | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-3-int4-inc) |
| Intel/neural-chat-7b-v3-1 | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-1-int4-inc) |
| TinyLlama-1.1B-intermediate | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse) |
| mistralai/Mistral-7B-v0.1 | [model-autogptq-lmhead-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc-lmhead), [model-autogptq-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc) |
| google/gemma-2b | [model-autogptq-int4](https://huggingface.co/Intel/gemma-2b-int4-inc) |
| tiiuae/falcon-7b | [model-autogptq-int4-G64](https://huggingface.co/Intel/falcon-7b-int4-inc) |
| sapienzanlp/modello-italia-9b | [model-fbaldassarri-autogptq-int4*](https://huggingface.co/fbaldassarri/modello-italia-9b-autoround-w4g128-cpu) |
| microsoft/phi-2 | [model-autogptq-sym-int4](https://huggingface.co/Intel/phi-2-int4-inc) |
| microsoft/Phi-3.5-mini-instruct | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Phi-3.5-Mini-instruct-AutoRound-4bit) |
| microsoft/Phi-3-vision-128k-instruct | [recipe](./examples/multimodal-modeling/Phi-3-vision/run_autoround.sh)
| mistralai/Mistral-7B-Instruct-v0.2 | [accuracy](./docs/Mistral-7B-Instruct-v0.2-acc.md), [recipe](./examples/language-modeling/scripts/Mistral-7B-Instruct-v0.2.sh), [example](./examples/language-modeling/) |
| mistralai/Mixtral-8x7B-Instruct-v0.1 | [accuracy](./docs/Mixtral-8x7B-Instruct-v0.1-acc.md), [recipe](./examples/language-modeling/scripts/Mixtral-8x7B-Instruct-v0.1.sh), [example](./examples/language-modeling/) |
| mistralai/Mixtral-8x7B-v0.1 | [accuracy](./docs/Mixtral-8x7B-v0.1-acc.md), [recipe](./examples/language-modeling/scripts/Mixtral-8x7B-v0.1.sh), [example](./examples/language-modeling/) |
| meta-llama/Meta-Llama-3-8B-Instruct | [accuracy](./docs/Meta-Llama-3-8B-Instruct-acc.md), [recipe](./examples/language-modeling/scripts/Meta-Llama-3-8B-Instruct.sh), [example](./examples/language-modeling/) |
| google/gemma-7b | [accuracy](./docs/gemma-7b-acc.md), [recipe](./examples/language-modeling/scripts/gemma-7b.sh), [example](./examples/language-modeling/) |
| meta-llama/Llama-2-7b-chat-hf | [accuracy](./docs/Llama-2-7b-chat-hf-acc.md), [recipe](./examples/language-modeling/scripts/Llama-2-7b-chat-hf.sh), [example](./examples/language-modeling/) |
| Qwen/Qwen1.5-7B-Chat | [accuracy](./docs/Qwen1.5-7B-Chat-acc.md), [sym recipe](./examples/language-modeling/scripts/Qwen1.5-7B-Chat-sym.sh), [asym recipe ](./examples/language-modeling/scripts/Qwen1.5-7B-Chat-asym.sh), [example](./examples/language-modeling/) |
| baichuan-inc/Baichuan2-7B-Chat | [accuracy](./docs/baichuan2-7b-chat-acc.md), [recipe](./examples/language-modeling/scripts/baichuan2-7b-chat.sh), [example](./examples/language-modeling/) |
| 01-ai/Yi-6B-Chat | [accuracy](./docs/Yi-6B-Chat-acc.md), [recipe](./examples/language-modeling/scripts/Yi-6B-Chat.sh), [example](./examples/language-modeling/) |
| facebook/opt-2.7b | [accuracy](./docs/opt-2.7b-acc.md), [recipe](./examples/language-modeling/scripts/opt-2.7b.sh), [example](./examples/language-modeling/) |
| bigscience/bloom-3b | [accuracy](./docs/bloom-3B-acc.md), [recipe](./examples/language-modeling/scripts/bloom-3b.sh), [example](./examples/language-modeling/) |
| EleutherAI/gpt-j-6b | [accuracy](./docs/gpt-j-6B-acc.md), [recipe](./examples/language-modeling/scripts/gpt-j-6b.sh), [example](./examples/language-modeling/) |
## Integration
AutoRound has been integrated into multiple repositories.
[Intel Neural Compressor](https://github.com/intel/neural-compressor)
[ModelCloud/GPTQModel](https://github.com/ModelCloud/GPTQModel)
[pytorch/ao](https://github.com/pytorch/ao)
## Reference
If you find AutoRound useful for your research, please cite our paper:
```bash
@article{cheng2023optimize,
title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
journal={arXiv preprint arXiv:2309.05516},
year={2023}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/intel/auto-round",
"name": "auto-round",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": null,
"keywords": "quantization, auto-around, LLM, SignRound",
"author": "Intel AIPT Team",
"author_email": "wenhua.cheng@intel.com, weiwei1.zhang@intel.com",
"download_url": "https://files.pythonhosted.org/packages/a5/6e/20585aab4240bf6cfdbb8ba116f32cbf1ce7d1518bd83e7545e416aa8929/auto_round-0.3.1.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\nAutoRound\n===========================\n<h3> Advanced Quantization Algorithm for LLMs</h3>\n\n[![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/auto-round)\n[![version](https://img.shields.io/badge/release-0.3.1-green)](https://github.com/intel/auto-round)\n[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/auto-round/blob/main/LICENSE)\n---\n<div align=\"left\">\n\nAutoRound is an advanced quantization algorithm for low-bits LLM inference. It's tailored for a wide range\nof models. AutoRound adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200\nsteps,\nwhich competes impressively against recent methods without introducing any additional inference overhead and keeping low\ntuning cost. The below\nimage presents an overview of AutoRound. Check out our paper on [arxiv](https://arxiv.org/pdf/2309.05516) for more\ndetails and visit [low_bit_open_llm_leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard) for\nmore accuracy data and recipes across various models.\n\n<div align=\"center\">\n\n![](docs/imgs/autoround_overview.png)\n\n<div align=\"left\">\n\n## What's New\n\n* [2024/10] Important update: We now support full-range symmetric quantization and have made it the default\n configuration. This approach is typically better or comparable to asymmetric quantization and significantly\n outperforms other symmetric variants, especially at low bit-widths like 2-bit. And,no need to compile from source to run\n AutoRound format anymore.\n* [2024/09] AutoRound format supports several LVM models, check out the\n examples [Qwen2-Vl](./examples/multimodal-modeling/Qwen-VL),[Phi-3-vision](./examples/multimodal-modeling/Phi-3-vision), [Llava](./examples/multimodal-modeling/Llava)\n* [2024/08] AutoRound format supports Intel Gaudi2 devices. Please refer\n to [Intel/Qwen2-7B-int4-inc](https://huggingface.co/Intel/Qwen2-7B-int4-inc).\n* [2024/08] AutoRound introduces several experimental features, including fast tuning of norm/bias parameters (for 2-bit\n and W4A4), activation quantization, and the mx_fp data type.\n* [2024/07] Important change: the default value of nsamples has been changed from 512 to 128 to reduce the memory\n usages, which may cause a slight accuracy drop in some scenarios\n\n## Installation\n\n### Build from Source\n\n```bash\npip install -vvv --no-build-isolation -e .\n```\n\n### Install from pypi\n\n```bash\npip install auto-round\n```\n\n## Model Quantization\n\n### API Usage (Gaudi2/CPU/GPU)\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_name = \"facebook/opt-125m\"\nmodel = AutoModelForCausalLM.from_pretrained(model_name)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\nfrom auto_round import AutoRound\n\nbits, group_size = 4, 128\nautoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size)\n\n## the best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower\n# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size)\n\n## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128\n# autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size)\n\nautoround.quantize()\noutput_dir = \"./tmp_autoround\"\n## format= 'auto_round'(default in version>0.3.0), 'auto_gptq'(default in version<=0.3.0), 'auto_awq'\nautoround.save_quantized(output_dir, format='auto_round', inplace=True) \n```\n\n<details>\n <summary>Detailed Hyperparameters</summary>\n\n- `model`: The PyTorch model to be quantized.\n\n- `tokenizer`: An optional tokenizer for processing input data. If none, a dataset must be provided.\n\n- `bits (int)`: Number of bits for quantization (default is 4).\n\n- `group_size (int)`: Size of the quantization group (default is 128).\n\n- `sym (bool)`: Whether to use symmetric quantization (default is True).\n\n- `enable_quanted_input (bool)`: Whether to use the output of the previous quantized block as the input for the current\n block for tuning (default is True).\n\n- `enable_minmax_tuning (bool)`: Whether to enable weight min-max tuning (default is True).\n\n- `iters (int)`: Number of tuning iterations (default is 200).\n\n- `lr (float)`: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).\n\n- `minmax_lr (float)`: The learning rate for min-max tuning (default is None, it will be set to lr automatically).\n\n- `nsamples (int)`: Number of samples for tuning (default is 128).\n\n- `seqlen (int)`: Data length of the sequence for tuning (default is 2048).\n\n- `batch_size (int)`: Batch size for training (default is 8).\n\n- `scale_dtype (str)`: The data type of quantization scale to be used (default is \"float16\"), different kernels have\n different choices.\n\n- `amp (bool)`: Whether to use automatic mixed precision (default is True).\n\n- `nblocks (int)`: Packing several blocks as one for tuning together (default is 1).\n\n- `gradient_accumulate_steps (int)`: Number of gradient accumulation steps (default is 1).\n\n- `low_gpu_mem_usage (bool)`: Whether to save GPU memory at the cost of ~20% more tuning time (default is False).\n\n- `dataset Union[str, list, tuple, torch.utils.data.DataLoader]`: The dataset name for tuning (default is \"\n NeelNanda/pile-10k\"). Local json file and combination of datasets have been supported, e.g. \"\n ./tmp.json,NeelNanda/pile-10k:train, mbpp:train+validation+test\"\n\n- `layer_config (dict)`: Configuration for weight quantization (default is None), mainly for mixed bits\n or mixed precision.\n\n- `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.\n\n</details>\n\n### Basic Usage (version > 0.3.0)\n\nA user guide detailing the full list of supported arguments is provided by calling ```auto_round -h``` on the terminal.\nAlternatively, you can use ```auto-round``` instead of ```auto_round```.\n\n```bash\nauto_round --model facebook/opt-125m \\\n --bits 4 \\\n --group_size 128 \\\n --format auto_round \\\n --disable_eval \\\n --output_dir ./tmp_autoround\n```\n\nWe provide two recipes for best accuracy and fast running speed with low memory. Details as below.\n<details>\n <summary>Other Recipes</summary>\n\n ```bash\n## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower\n auto_round --model facebook/opt-125m \\\n --bits 4 \\\n --group_size 128 \\\n --nsamples 512 \\\n --iters 1000 \\\n --low_gpu_mem_usage \\\n --disable_eval \n ```\n\n ```bash\n## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128\n auto_round --model facebook/opt-125m \\\n --bits 4 \\\n --group_size 128 \\\n --nsamples 128 \\\n --iters 200 \\\n --seqlen 512 \\\n --batch_size 4 \\\n --disable_eval \n ```\n\n</details>\n\n#### Formats\n\n**AutoRound Format**\uff1aThis format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision\ninference. [2,4]\nbits are supported. It also benefits\nfrom the Marlin kernel, which can boost inference performance notably.However, it has not yet gained widespread\ncommunity adoption. For CUDA support, you will need to\ninstall from the source.\n\n**AutoGPTQ Format**: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the\ncommunity, [2,3,4,8] bits are supported, for 3 bits, pip install auto-gptq first before quantization. It also benefits\nfrom the Marlin kernel, which can boost inference performance notably. However, **the\nasymmetric kernel has issues** that can cause considerable accuracy drops, particularly at 2-bit quantization and small\nmodels.\nAdditionally, symmetric quantization tends to perform poorly at 2-bit precision.\n\n**AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely adopted\nwithin the community, only 4-bits quantization is supported. It features\nspecialized layer fusion tailored for Llama models.\n\n## Model Inference\n\nPlease run the quantization code first\n\n### AutoGPTQ/AutoAWQ format\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nquantized_model_path = \"./tmp_autoround\"\nmodel = AutoModelForCausalLM.from_pretrained(quantized_model_path,\n device_map=\"auto\")\ntokenizer = AutoTokenizer.from_pretrained(quantized_model_path)\ntext = \"There is a girl who likes adventure,\"\ninputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\nprint(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))\n```\n\n### AutoRound format\n\n**CPU**: pip install intel-extension-for-transformers\n\n**HPU**: docker image with Gaudi Software Stack is recommended. More details can be found\nin [Gaudi Guide](https://docs.habana.ai/en/latest/).\n\n**CUDA**: pip install auto-gptq for sym quantization(tuning needs auto-round 0.30+), for asym quantization, need to install auto-round from source\n\n#### CPU/HPU/CUDA on 0.3.0+\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom auto_round import AutoRoundConfig\n\ndevice = \"auto\" ##cpu, hpu, cuda\nquantization_config = AutoRoundConfig(\n backend=device\n)\nquantized_model_path = \"./tmp_autoround\"\nmodel = AutoModelForCausalLM.from_pretrained(quantized_model_path,\n device_map=device, quantization_config=quantization_config)\ntokenizer = AutoTokenizer.from_pretrained(quantized_model_path)\ntext = \"There is a girl who likes adventure,\"\ninputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\nprint(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))\n```\n\n#### CPU/HPU/CUDA on 0.3.0\n\n**CUDA**: need to install auto-round from source\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom auto_round.auto_quantizer import AutoHfQuantizer ## must import\n\nquantized_model_path = \"./tmp_autoround\"\nmodel = AutoModelForCausalLM.from_pretrained(quantized_model_path,\n device_map=\"auto\")\ntokenizer = AutoTokenizer.from_pretrained(quantized_model_path)\ntext = \"There is a girl who likes adventure,\"\ninputs = tokenizer(text, return_tensors=\"pt\").to(model.device)\nprint(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))\n```\n\n<br>\n<details>\n <summary>Evaluation</summary>\n\n```bash\n## version > 0.3.0\nauto_round --model saved_quantized_model \\\n --eval \\\n --task lambada_openai \\\n --eval_bs 1\n```\n\n</details>\n\n## Support List\n\nAutoRound supports basically all the major large language models.\n\nPlease note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a\ndifferent recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot\nrelease most of the models ourselves.\n\n Model | Supported |\n|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| meta-llama/Meta-Llama-3.1-70B-Instruct | [recipe](https://huggingface.co/Intel/Meta-Llama-3.1-70B-Instruct-int4-inc) |\n| meta-llama/Meta-Llama-3.1-8B-Instruct | [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-asym), [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-sym), [recipe](https://huggingface.co/Intel/Meta-Llama-3.1-8B-Instruct-int4-inc) |\n| meta-llama/Meta-Llama-3.1-8B | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-autoround-gptq-4bit-sym) |\n| Qwen/Qwen-VL | [accuracy](./examples/multimodal-modeling/Qwen-VL/README.md), [recipe](./examples/multimodal-modeling/Qwen-VL/run_autoround.sh) \n| Qwen/Qwen2-7B | [model-autoround-int4](https://huggingface.co/Intel/Qwen2-7B-int4-inc) |\n| Qwen/Qwen2-57B-A14B-Instruct | [model-autoround-int4](https://huggingface.co/Intel/Qwen2-57B-A14B-Instruct-int4-inc) |\n| 01-ai/Yi-1.5-9B | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-4bit-gptq-autoround) |\n| 01-ai/Yi-1.5-9B-Chat | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-Chat-4bit-gptq-autoround) |\n| Intel/neural-chat-7b-v3-3 | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-3-int4-inc) |\n| Intel/neural-chat-7b-v3-1 | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-1-int4-inc) |\n| TinyLlama-1.1B-intermediate | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse) |\n| mistralai/Mistral-7B-v0.1 | [model-autogptq-lmhead-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc-lmhead), [model-autogptq-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc) |\n| google/gemma-2b | [model-autogptq-int4](https://huggingface.co/Intel/gemma-2b-int4-inc) |\n| tiiuae/falcon-7b | [model-autogptq-int4-G64](https://huggingface.co/Intel/falcon-7b-int4-inc) |\n| sapienzanlp/modello-italia-9b | [model-fbaldassarri-autogptq-int4*](https://huggingface.co/fbaldassarri/modello-italia-9b-autoround-w4g128-cpu) |\n| microsoft/phi-2 | [model-autogptq-sym-int4](https://huggingface.co/Intel/phi-2-int4-inc) |\n| microsoft/Phi-3.5-mini-instruct | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Phi-3.5-Mini-instruct-AutoRound-4bit) |\n| microsoft/Phi-3-vision-128k-instruct | [recipe](./examples/multimodal-modeling/Phi-3-vision/run_autoround.sh) \n| mistralai/Mistral-7B-Instruct-v0.2 | [accuracy](./docs/Mistral-7B-Instruct-v0.2-acc.md), [recipe](./examples/language-modeling/scripts/Mistral-7B-Instruct-v0.2.sh), [example](./examples/language-modeling/) |\n| mistralai/Mixtral-8x7B-Instruct-v0.1 | [accuracy](./docs/Mixtral-8x7B-Instruct-v0.1-acc.md), [recipe](./examples/language-modeling/scripts/Mixtral-8x7B-Instruct-v0.1.sh), [example](./examples/language-modeling/) |\n| mistralai/Mixtral-8x7B-v0.1 | [accuracy](./docs/Mixtral-8x7B-v0.1-acc.md), [recipe](./examples/language-modeling/scripts/Mixtral-8x7B-v0.1.sh), [example](./examples/language-modeling/) |\n| meta-llama/Meta-Llama-3-8B-Instruct | [accuracy](./docs/Meta-Llama-3-8B-Instruct-acc.md), [recipe](./examples/language-modeling/scripts/Meta-Llama-3-8B-Instruct.sh), [example](./examples/language-modeling/) |\n| google/gemma-7b | [accuracy](./docs/gemma-7b-acc.md), [recipe](./examples/language-modeling/scripts/gemma-7b.sh), [example](./examples/language-modeling/) |\n| meta-llama/Llama-2-7b-chat-hf | [accuracy](./docs/Llama-2-7b-chat-hf-acc.md), [recipe](./examples/language-modeling/scripts/Llama-2-7b-chat-hf.sh), [example](./examples/language-modeling/) |\n| Qwen/Qwen1.5-7B-Chat | [accuracy](./docs/Qwen1.5-7B-Chat-acc.md), [sym recipe](./examples/language-modeling/scripts/Qwen1.5-7B-Chat-sym.sh), [asym recipe ](./examples/language-modeling/scripts/Qwen1.5-7B-Chat-asym.sh), [example](./examples/language-modeling/) |\n| baichuan-inc/Baichuan2-7B-Chat | [accuracy](./docs/baichuan2-7b-chat-acc.md), [recipe](./examples/language-modeling/scripts/baichuan2-7b-chat.sh), [example](./examples/language-modeling/) | \n| 01-ai/Yi-6B-Chat | [accuracy](./docs/Yi-6B-Chat-acc.md), [recipe](./examples/language-modeling/scripts/Yi-6B-Chat.sh), [example](./examples/language-modeling/) | \n| facebook/opt-2.7b | [accuracy](./docs/opt-2.7b-acc.md), [recipe](./examples/language-modeling/scripts/opt-2.7b.sh), [example](./examples/language-modeling/) |\n| bigscience/bloom-3b | [accuracy](./docs/bloom-3B-acc.md), [recipe](./examples/language-modeling/scripts/bloom-3b.sh), [example](./examples/language-modeling/) |\n| EleutherAI/gpt-j-6b | [accuracy](./docs/gpt-j-6B-acc.md), [recipe](./examples/language-modeling/scripts/gpt-j-6b.sh), [example](./examples/language-modeling/) | \n\n## Integration\n\nAutoRound has been integrated into multiple repositories.\n\n[Intel Neural Compressor](https://github.com/intel/neural-compressor)\n\n[ModelCloud/GPTQModel](https://github.com/ModelCloud/GPTQModel)\n\n[pytorch/ao](https://github.com/pytorch/ao)\n\n## Reference\n\nIf you find AutoRound useful for your research, please cite our paper:\n\n```bash\n@article{cheng2023optimize,\n title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},\n author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},\n journal={arXiv preprint arXiv:2309.05516},\n year={2023}\n}\n```\n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Repository of AutoRound: Advanced Weight-Only Quantization Algorithm for LLMs",
"version": "0.3.1",
"project_urls": {
"Homepage": "https://github.com/intel/auto-round"
},
"split_keywords": [
"quantization",
" auto-around",
" llm",
" signround"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9d59be4098aaedd7c682270af93e5d875b15112bc244258b5728d18783ad424c",
"md5": "f7d685d2a014d868175f7104e7f222b8",
"sha256": "ac0d447cddbeee5ad5a646eb12c75ab117c2e42d81e201a7a11dbc0e2162f173"
},
"downloads": -1,
"filename": "auto_round-0.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f7d685d2a014d868175f7104e7f222b8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 158278,
"upload_time": "2024-10-17T08:35:17",
"upload_time_iso_8601": "2024-10-17T08:35:17.589307Z",
"url": "https://files.pythonhosted.org/packages/9d/59/be4098aaedd7c682270af93e5d875b15112bc244258b5728d18783ad424c/auto_round-0.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a56e20585aab4240bf6cfdbb8ba116f32cbf1ce7d1518bd83e7545e416aa8929",
"md5": "5a057dae2837742d6c325a799ccb6225",
"sha256": "2e7d5b525b75d5ddebb96d38f7378c40d010c284f4d3d26b0eb3c1c08e6c6a7f"
},
"downloads": -1,
"filename": "auto_round-0.3.1.tar.gz",
"has_sig": false,
"md5_digest": "5a057dae2837742d6c325a799ccb6225",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 129998,
"upload_time": "2024-10-17T08:35:19",
"upload_time_iso_8601": "2024-10-17T08:35:19.733666Z",
"url": "https://files.pythonhosted.org/packages/a5/6e/20585aab4240bf6cfdbb8ba116f32cbf1ce7d1518bd83e7545e416aa8929/auto_round-0.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-17 08:35:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "intel",
"github_project": "auto-round",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "accelerate",
"specs": []
},
{
"name": "datasets",
"specs": []
},
{
"name": "py-cpuinfo",
"specs": []
},
{
"name": "sentencepiece",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "triton",
"specs": []
},
{
"name": "numpy",
"specs": [
[
"<",
"2.0"
]
]
},
{
"name": "threadpoolctl",
"specs": []
},
{
"name": "lm-eval",
"specs": [
[
"==",
"0.4.4"
]
]
},
{
"name": "tqdm",
"specs": []
},
{
"name": "packaging",
"specs": []
},
{
"name": "auto-gptq",
"specs": [
[
">=",
"0.7.1"
]
]
}
],
"lcname": "auto-round"
}