auto-around

Name	auto-around JSON
Version	0.0 JSON
	download
home_page	https://github.com/intel/auto-round
Summary	Repository of AutoRound: Advanced Weight-Only Quantization Algorithm for LLMs
upload_time	2024-01-30 09:40:20
maintainer
docs_url	None
author	Intel AIPT Team
requires_python	>=3.7.0
license	Apache 2.0
keywords	quantization auto-around llm signround
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">

AutoRound
===========================
<h3> Advanced Weight-Only Quantization Algorithm for LLMs</h3>

[![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/auto-round)
[![version](https://img.shields.io/badge/release-0.1-green)](https://github.com/intel/auto-round)
[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/auto-round/blob/main/LICENSE)
---
<div align="left">

AutoRound is an advanced weight-only quantization algorithm, based on SignRound. It's tailored for a wide range of models and consistently delivers noticeable improvements, often significantly outperforming SignRound with the cost of more tuning time for quantization.

## Prerequisites
- Python 3.9 or higher

## Installation
### Build from Source
```bash
pip install -r requirements.txt
python setup.py install
```
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound

model_name = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(
            model_name, low_cpu_mem_usage=True, torch_dtype="auto", trust_remote_code=True
        )
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
bits, group_size, scheme = 4, 128, "asym"
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, scheme=scheme)
autoround.quantize()

```

<details>
  <summary>Detailed Hyperparameters</summary>

- `model`: The PyTorch model to be quantized.
            
- `tokenizer`: An optional tokenizer for processing input data. If none is provided, a dataloader must be supplied.
  
- `bits (int)`: Number of bits for quantization (default is 4).
  
- `group_size (int)`: Size of the quantization group (default is 128).

- `scheme (str)`: The quantization scheme (symmetric/asymmetric) to be used (default is "asym").
  
- `use_quant_input (bool)`: Whether to use the output of the previous quantized block as the input for the current block (default is True).
  
- `enable_minmax_tuning (bool)`: Whether to enable weight min-max tuning (default is True).
  
- `iters (int)`: Number of tuning iterations (default is 200).
  
- `lr (float)`: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).
  
- `minmax_lr (float)`: The learning rate for min-max tuning (default is None, it will be set to lr automatically).
  
- `n_samples (int)`: Number of samples for tuning (default is 512).
  
- `seqlen (int)`: Data length of the sequence for tuning (default is 2048).
  
- `bs (int)`: Batch size for training (default is 8).
  
- `amp (bool)`: Whether to use automatic mixed precision (default is True).
  
- `n_blocks (int)`: Packing several blocks as one for tuning together (default is 1).
  
- `gradient_accumulate_steps (int)`: Number of gradient accumulation steps (default is 1).
  
- `low_gpu_mem_usage (bool)`: Whether to save GPU memory at the cost of a little tuning time (default is True).
  
- `dataset_name (str)`: The default dataset name for tuning (default is "NeelNanda/pile-10k").
  
- `dataset_split (str)`: The split of the dataset to be used for tuning (default is "train").
  
- `dataloader`: The dataloader for tuning data.
  
- `weight_config (dict)`: Configuration for weight quantization (default is an empty dictionary), mainly for mixed bits or mixed precision.
  
- `device`: The device to be used for tuning (default is "cuda:0").
  
</details>

## Validated Models
For wikitext2/ptb-new/c4-new ppl, we follow the code of gptq and set the sequence length to 2048. For lm-eval wikitext ppl, we adopt lm-eval. The quantization configure is W4G128.

<table border="1">
  <tr>
    <th>Model</th>
    <th>Method </th>
    <th>Acc AVG.</th>
    <th>MMLU</th>
    <th>Lamb.</th>
    <th>Hella.</th>
    <th>Wino.</th>
    <th>Piqa</th>
    <th>Truth.</th>
    <th>Open.</th>
    <th>Boolq</th>
    <th>RTE</th>
    <th>ARC-e</th>
    <th>ARC-c.</th>
    <th>wikitext2 ppl
    <th>ptb_new ppl</th>
    <th>c4_new ppl</th>
    <th>lm_eval wikitext ppl</th>
   
  </tr>

  <tr>
    <td rowspan="3">Intel/neural-chat-7b-v3 </td>
    <th>FP16</th>
    <td>67.92</td> <! acc avg -->
    <td>61.13</td> <! MMLU -->
    <td>73.03</td> <! Lambada_openai -->
    <td>66.39</td> <! Hellsaswag -->
    <td>76.40</td> <! Winogrande -->
    <td>81.01</td> <! Piqa -->
    <td>47.37</td> <! Truthfulqa -->
    <td>38.8</td> <! Openbookqa -->
    <td>86.97</td> <! Boolq -->
    <td>75.81</td> <! RTE -->
    <td>82.66</td> <! Arc easy -->
    <td>57.51</td> <! Arc Challenge  -->
    <td>6.00</td>  <! wikitext2 ppl  -->
    <td>48.96</td> <! ptb_new ppl  -->
    <td>9.65</td>    <! c4_new ppl  -->
    <td>-</td> <! lm-eval wikitext ppl  -->
  </tr>

  </tr>
    <th>Ours</th>
    <td>66.90</td> <! acc avg -->
    <td>60.56</td> <! MMLU -->
    <td>72.19</td> <! Lambada_openai -->
    <td>65.28</td> <! Hellsaswag -->
    <td>75.37</td> <! Winogrande -->
    <td>81.18</td> <! Piqa -->
    <td>46.76</td> <! Truthfulqa -->
    <td>36.0</td> <! Openbookqa -->
    <td>86.91</td> <! Boolq -->
    <td>73.29</td> <! RTE -->
    <td>81.73</td> <! Arc easy -->
    <td>56.66</td> <! Arc Challenge  -->
    <td>6.21</td>  <! wikitext2 ppl  -->
    <td>59.78</td> <! ptb_new ppl  -->
    <td>10.01</td>    <! c4_new ppl  -->
    <td>-</td> <! lm-eval wikitext ppl  -->
  </tr>

  </tr>
    <th>Ours iters1K, disable use_quant_input, minmax_lr 0.002</th>
    <td>67.70</td> <! acc avg -->
    <td>60.57</td> <! MMLU -->
    <td>73.74</td> <! Lambada_openai -->
    <td>65.62</td> <! Hellsaswag -->
    <td>77.43</td> <! Winogrande -->
    <td>80.85</td> <! Piqa -->
    <td>47.61</td> <! Truthfulqa -->
    <td>36.8</td> <! Openbookqa -->
    <td>86.94</td> <! Boolq -->
    <td>75.09</td> <! RTE -->
    <td>82.66</td> <! Arc easy -->
    <td>57.34</td> <! Arc Challenge  -->
    <td>6.17</td>  <! wikitext2 ppl  -->
    <td>59.12</td> <! ptb_new ppl  -->
    <td>9.83</td>    <! c4_new ppl  -->
    <td>-</td> <! lm-eval wikitext ppl  -->
  </tr>


  <tr>
    <td rowspan="3">mistralai/Mixtral-8x7B-v0.1 </td>
    <th>BF16</th>
   <td>67.16</td>
    <td>69.83</td>
    <td>78.44</td>
    <td>64.89</td>
    <td>76.40</td>
    <td>82.43</td>
    <td>34.15</td>
    <td>35.40</td>
    <td>84.98</td>
    <td>71.12</td>
    <td>84.22</td>
    <td>56.91</td>
    <td>3.84</td>
    <td>19.22</td>
    <td>7.41</td>
    <td>-</td>
 
  </tr>
  <tr>
    <th>Ours</th>
    <td>65.98</td>
    <td>68.90</td>
    <td>78.11</td>
    <td>64.31</td>
    <td>74.27</td>
    <td>82.10</td>
    <td>30.97</td>
    <td>34.20</td>
    <td>84.57</td>
    <td>67.87</td>
    <td>83.96</td>
    <td>56.57</td>
    <td>4.08</td>
    <td>354</td>
    <td>7.56</td>
    <td>-</td>
  </tr>
  <tr>
    <th>Ours iters1K, disable use_quant_input 
    <td>66.78</td>
    <td>68.68</td>
    <td>78.61</td>
    <td>64.40</td>
    <td>76.56</td>
    <td>81.99</td>
    <td>32.56</td>
    <td>34.80</td>
    <td>85.96</td>
    <td>70.76</td>
    <td>83.96</td>
    <td>56.31</td>
    <td>3.99</td>
    <td>17.65</td>
    <td>7.52</td>
    <td>-</td>
 
  </tr>
  <tr>
    <td rowspan="2">microsoft/phi-2 </td>
    <th>FP16</th>
    <td>61.80</td>
    <td>56.40</td>
    <td>62.78</td>
    <td>55.83</td>
    <td>75.77</td>
    <td>78.67</td>
    <td>31.21</td>
    <td>40.40</td>
    <td>83.36</td>
    <td>62.45</td>
    <td>80.05</td>
    <td>52.90</td>
    <td>9.71</td>
    <td>18.16</td>
    <td>14.12</td>
    <td>11.05</td>

  </tr>
  <tr>
    <th>AutoRound</th>
    <td>61.67</td>
    <td>54.57</td>
    <td>61.32</td>
    <td>55.04</td>
    <td>76.48</td>
    <td>78.89</td>
    <td>29.74</td>
    <td>40.60</td>
    <td>83.24</td>
    <td>66.43</td>
    <td>79.76</td>
    <td>52.30</td>
    <td>9.98</td>
    <td>18.67</td>
    <td>14.39</td>
    <td>11.37</td>

  </tr>
</table>


We provide a comparative analysis with other methods [link](docs/README.md) in our accuracy data section. Notably, our approach has outperformed GPTQ with a score of 30/32 and AWQ with a score of 27/32 across llamv1/llamav2/mistral-7b on W4G-1, W4G128, W3G128, W2G128.  And the tuning costs are comparable.
### Models passed smoke test
LaMini-GPT-124M; QWEN1-8B; OPT-125M; Bloom-560m;falcon-7b;gpt-leo-125m;stablelm-base-alpha-3b;dolly-v2-3b;mpt-7b;gpt-j-6b;chatglm2-6b


## Tips
1 Consider increasing tuning steps to achieve better results, albeit with increased tuning time. 

2 Leverage AutoGPTQ to evaluate the model on GPU
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound

model_name = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(
            model_name, low_cpu_mem_usage=True, torch_dtype="auto", trust_remote_code=True
        )
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

autoround = AutoRound(model, tokenizer, bits=4, group_size=128, scheme="asym")
autoround.quantize()

## export to autogptq
# please install auto-gptq https://github.com/AutoGPTQ/
output_dir = "/path/to/quantized_model"
autoround.export(output_dir, target="auto_gptq", use_triton=True)
# then follow auto-gptq to load the model and inference  
```

## Known Issues
* Random issues in tuning Qwen models
* ChatGlm-V1 is not supported
  
### Examples
Enter into the examples folder and install lm-eval to run the evaluation
```bash
pip install -r requirements.txt
```

- **Default Settings:**
```bash
CUDA_VISIBLE_DEVICES=0 python3 main.py --model_name facebook/opt-125m --amp --bits 4 --group_size -1 --enable_minmax_tuning --use_quant_input
```
- **Reduced GPU Memory Usage and Adjusted Training Batch Size:**
```bash
CUDA_VISIBLE_DEVICES=0 python3 main.py --model_name facebook/opt-125m --amp --bits 4 --group_size -1 --low_gpu_mem_usage --train_bs 1 --gradient_accumulate_steps 8
```
- **Utilizing the AdamW Optimizer:**
Include the flag `--adam`. Note that AdamW is less effective than Sign gradient descent in many scenarios we tested.

- **Running the Original SignRound:**
```bash
CUDA_VISIBLE_DEVICES=0 python3 main.py --model_name facebook/opt-125m --amp --bits 4 --group_size -1 --iters 400 --lr 0.0025 --minmax_lr 0.0025
```
 `--enable_minmax_tuning` is strongly recommended 

- The transformers version required varies across different types of models. Here, the transformers version used for running models during experiments is provided as a reference.

    | Model | Transformers version |
    |  :----: | :----: |
    | EleutherAI/gpt-j-6b | 4.28/4.30/4.34/4.36 |
    | huggyllama/llama-7b | 4.28/4.30/4.34/4.36 |
    | meta-llama/Llama-2-7b-hf | 4.30/4.34/4.36 |
    | facebook/opt-6.7b | 4.28/4.30/4.34/4.36 |
    | tiiuae/falcon-7b | 4.28/4.30/4.34/4.36 |
    | mosaicml/mpt-7b | 4.28/4.30/4.34/4.36 |
    | bigscience/bloom-7b1 | 4.28/4.30/4.34/4.36 |
    | baichuan-inc/Baichuan-7B | 4.28/4.30 |
    | Qwen/Qwen-7B | 4.28/4.30/4.34/4.36 |
    | THUDM/chatglm3-6b | 4.34/4.36 |
    | mistralai/Mistral-7B-v0.1 | 4.34/4.36 |
    

## Reference
If you find SignRound useful for your research, please cite our paper:
```bash
@article{cheng2023optimize,
  title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao},
  journal={arXiv preprint arXiv:2309.05516},
  year={2023}
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/intel/auto-round",
    "name": "auto-around",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": "",
    "keywords": "quantization,auto-around,LLM,SignRound",
    "author": "Intel AIPT Team",
    "author_email": "wenhua.cheng@intel.com, weiwei1.zhang@intel.com",
    "download_url": "",
    "platform": null,
    "description": "<div align=\"center\">\n\nAutoRound\n===========================\n<h3> Advanced Weight-Only Quantization Algorithm for LLMs</h3>\n\n[![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/auto-round)\n[![version](https://img.shields.io/badge/release-0.1-green)](https://github.com/intel/auto-round)\n[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/auto-round/blob/main/LICENSE)\n---\n<div align=\"left\">\n\nAutoRound is an advanced weight-only quantization algorithm, based on SignRound. It's tailored for a wide range of models and consistently delivers noticeable improvements, often significantly outperforming SignRound with the cost of more tuning time for quantization.\n\n## Prerequisites\n- Python 3.9 or higher\n\n## Installation\n### Build from Source\n```bash\npip install -r requirements.txt\npython setup.py install\n```\n## Usage\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom auto_round import AutoRound\n\nmodel_name = \"facebook/opt-125m\"\nmodel = AutoModelForCausalLM.from_pretrained(\n            model_name, low_cpu_mem_usage=True, torch_dtype=\"auto\", trust_remote_code=True\n        )\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\nbits, group_size, scheme = 4, 128, \"asym\"\nautoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, scheme=scheme)\nautoround.quantize()\n\n```\n\n<details>\n  <summary>Detailed Hyperparameters</summary>\n\n- `model`: The PyTorch model to be quantized.\n            \n- `tokenizer`: An optional tokenizer for processing input data. If none is provided, a dataloader must be supplied.\n  \n- `bits (int)`: Number of bits for quantization (default is 4).\n  \n- `group_size (int)`: Size of the quantization group (default is 128).\n\n- `scheme (str)`: The quantization scheme (symmetric/asymmetric) to be used (default is \"asym\").\n  \n- `use_quant_input (bool)`: Whether to use the output of the previous quantized block as the input for the current block (default is True).\n  \n- `enable_minmax_tuning (bool)`: Whether to enable weight min-max tuning (default is True).\n  \n- `iters (int)`: Number of tuning iterations (default is 200).\n  \n- `lr (float)`: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).\n  \n- `minmax_lr (float)`: The learning rate for min-max tuning (default is None, it will be set to lr automatically).\n  \n- `n_samples (int)`: Number of samples for tuning (default is 512).\n  \n- `seqlen (int)`: Data length of the sequence for tuning (default is 2048).\n  \n- `bs (int)`: Batch size for training (default is 8).\n  \n- `amp (bool)`: Whether to use automatic mixed precision (default is True).\n  \n- `n_blocks (int)`: Packing several blocks as one for tuning together (default is 1).\n  \n- `gradient_accumulate_steps (int)`: Number of gradient accumulation steps (default is 1).\n  \n- `low_gpu_mem_usage (bool)`: Whether to save GPU memory at the cost of a little tuning time (default is True).\n  \n- `dataset_name (str)`: The default dataset name for tuning (default is \"NeelNanda/pile-10k\").\n  \n- `dataset_split (str)`: The split of the dataset to be used for tuning (default is \"train\").\n  \n- `dataloader`: The dataloader for tuning data.\n  \n- `weight_config (dict)`: Configuration for weight quantization (default is an empty dictionary), mainly for mixed bits or mixed precision.\n  \n- `device`: The device to be used for tuning (default is \"cuda:0\").\n  \n</details>\n\n## Validated Models\nFor wikitext2/ptb-new/c4-new ppl, we follow the code of gptq and set the sequence length to 2048. For lm-eval wikitext ppl, we adopt lm-eval. The quantization configure is W4G128.\n\n<table border=\"1\">\n  <tr>\n    <th>Model</th>\n    <th>Method </th>\n    <th>Acc AVG.</th>\n    <th>MMLU</th>\n    <th>Lamb.</th>\n    <th>Hella.</th>\n    <th>Wino.</th>\n    <th>Piqa</th>\n    <th>Truth.</th>\n    <th>Open.</th>\n    <th>Boolq</th>\n    <th>RTE</th>\n    <th>ARC-e</th>\n    <th>ARC-c.</th>\n    <th>wikitext2 ppl\n    <th>ptb_new ppl</th>\n    <th>c4_new ppl</th>\n    <th>lm_eval wikitext ppl</th>\n   \n  </tr>\n\n  <tr>\n    <td rowspan=\"3\">Intel/neural-chat-7b-v3 </td>\n    <th>FP16</th>\n    <td>67.92</td> <! acc avg -->\n    <td>61.13</td> <! MMLU -->\n    <td>73.03</td> <! Lambada_openai -->\n    <td>66.39</td> <! Hellsaswag -->\n    <td>76.40</td> <! Winogrande -->\n    <td>81.01</td> <! Piqa -->\n    <td>47.37</td> <! Truthfulqa -->\n    <td>38.8</td> <! Openbookqa -->\n    <td>86.97</td> <! Boolq -->\n    <td>75.81</td> <! RTE -->\n    <td>82.66</td> <! Arc easy -->\n    <td>57.51</td> <! Arc Challenge  -->\n    <td>6.00</td>  <! wikitext2 ppl  -->\n    <td>48.96</td> <! ptb_new ppl  -->\n    <td>9.65</td>    <! c4_new ppl  -->\n    <td>-</td> <! lm-eval wikitext ppl  -->\n  </tr>\n\n  </tr>\n    <th>Ours</th>\n    <td>66.90</td> <! acc avg -->\n    <td>60.56</td> <! MMLU -->\n    <td>72.19</td> <! Lambada_openai -->\n    <td>65.28</td> <! Hellsaswag -->\n    <td>75.37</td> <! Winogrande -->\n    <td>81.18</td> <! Piqa -->\n    <td>46.76</td> <! Truthfulqa -->\n    <td>36.0</td> <! Openbookqa -->\n    <td>86.91</td> <! Boolq -->\n    <td>73.29</td> <! RTE -->\n    <td>81.73</td> <! Arc easy -->\n    <td>56.66</td> <! Arc Challenge  -->\n    <td>6.21</td>  <! wikitext2 ppl  -->\n    <td>59.78</td> <! ptb_new ppl  -->\n    <td>10.01</td>    <! c4_new ppl  -->\n    <td>-</td> <! lm-eval wikitext ppl  -->\n  </tr>\n\n  </tr>\n    <th>Ours iters1K, disable use_quant_input, minmax_lr 0.002</th>\n    <td>67.70</td> <! acc avg -->\n    <td>60.57</td> <! MMLU -->\n    <td>73.74</td> <! Lambada_openai -->\n    <td>65.62</td> <! Hellsaswag -->\n    <td>77.43</td> <! Winogrande -->\n    <td>80.85</td> <! Piqa -->\n    <td>47.61</td> <! Truthfulqa -->\n    <td>36.8</td> <! Openbookqa -->\n    <td>86.94</td> <! Boolq -->\n    <td>75.09</td> <! RTE -->\n    <td>82.66</td> <! Arc easy -->\n    <td>57.34</td> <! Arc Challenge  -->\n    <td>6.17</td>  <! wikitext2 ppl  -->\n    <td>59.12</td> <! ptb_new ppl  -->\n    <td>9.83</td>    <! c4_new ppl  -->\n    <td>-</td> <! lm-eval wikitext ppl  -->\n  </tr>\n\n\n  <tr>\n    <td rowspan=\"3\">mistralai/Mixtral-8x7B-v0.1 </td>\n    <th>BF16</th>\n   <td>67.16</td>\n    <td>69.83</td>\n    <td>78.44</td>\n    <td>64.89</td>\n    <td>76.40</td>\n    <td>82.43</td>\n    <td>34.15</td>\n    <td>35.40</td>\n    <td>84.98</td>\n    <td>71.12</td>\n    <td>84.22</td>\n    <td>56.91</td>\n    <td>3.84</td>\n    <td>19.22</td>\n    <td>7.41</td>\n    <td>-</td>\n \n  </tr>\n  <tr>\n    <th>Ours</th>\n    <td>65.98</td>\n    <td>68.90</td>\n    <td>78.11</td>\n    <td>64.31</td>\n    <td>74.27</td>\n    <td>82.10</td>\n    <td>30.97</td>\n    <td>34.20</td>\n    <td>84.57</td>\n    <td>67.87</td>\n    <td>83.96</td>\n    <td>56.57</td>\n    <td>4.08</td>\n    <td>354</td>\n    <td>7.56</td>\n    <td>-</td>\n  </tr>\n  <tr>\n    <th>Ours iters1K, disable use_quant_input \n    <td>66.78</td>\n    <td>68.68</td>\n    <td>78.61</td>\n    <td>64.40</td>\n    <td>76.56</td>\n    <td>81.99</td>\n    <td>32.56</td>\n    <td>34.80</td>\n    <td>85.96</td>\n    <td>70.76</td>\n    <td>83.96</td>\n    <td>56.31</td>\n    <td>3.99</td>\n    <td>17.65</td>\n    <td>7.52</td>\n    <td>-</td>\n \n  </tr>\n  <tr>\n    <td rowspan=\"2\">microsoft/phi-2 </td>\n    <th>FP16</th>\n    <td>61.80</td>\n    <td>56.40</td>\n    <td>62.78</td>\n    <td>55.83</td>\n    <td>75.77</td>\n    <td>78.67</td>\n    <td>31.21</td>\n    <td>40.40</td>\n    <td>83.36</td>\n    <td>62.45</td>\n    <td>80.05</td>\n    <td>52.90</td>\n    <td>9.71</td>\n    <td>18.16</td>\n    <td>14.12</td>\n    <td>11.05</td>\n\n  </tr>\n  <tr>\n    <th>AutoRound</th>\n    <td>61.67</td>\n    <td>54.57</td>\n    <td>61.32</td>\n    <td>55.04</td>\n    <td>76.48</td>\n    <td>78.89</td>\n    <td>29.74</td>\n    <td>40.60</td>\n    <td>83.24</td>\n    <td>66.43</td>\n    <td>79.76</td>\n    <td>52.30</td>\n    <td>9.98</td>\n    <td>18.67</td>\n    <td>14.39</td>\n    <td>11.37</td>\n\n  </tr>\n</table>\n\n\nWe provide a comparative analysis with other methods [link](docs/README.md) in our accuracy data section. Notably, our approach has outperformed GPTQ with a score of 30/32 and AWQ with a score of 27/32 across llamv1/llamav2/mistral-7b on W4G-1, W4G128, W3G128, W2G128.  And the tuning costs are comparable.\n### Models passed smoke test\nLaMini-GPT-124M; QWEN1-8B; OPT-125M; Bloom-560m;falcon-7b;gpt-leo-125m;stablelm-base-alpha-3b;dolly-v2-3b;mpt-7b;gpt-j-6b;chatglm2-6b\n\n\n## Tips\n1 Consider increasing tuning steps to achieve better results, albeit with increased tuning time. \n\n2 Leverage AutoGPTQ to evaluate the model on GPU\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom auto_round import AutoRound\n\nmodel_name = \"facebook/opt-125m\"\nmodel = AutoModelForCausalLM.from_pretrained(\n            model_name, low_cpu_mem_usage=True, torch_dtype=\"auto\", trust_remote_code=True\n        )\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n\nautoround = AutoRound(model, tokenizer, bits=4, group_size=128, scheme=\"asym\")\nautoround.quantize()\n\n## export to autogptq\n# please install auto-gptq https://github.com/AutoGPTQ/\noutput_dir = \"/path/to/quantized_model\"\nautoround.export(output_dir, target=\"auto_gptq\", use_triton=True)\n# then follow auto-gptq to load the model and inference  \n```\n\n## Known Issues\n* Random issues in tuning Qwen models\n* ChatGlm-V1 is not supported\n  \n### Examples\nEnter into the examples folder and install lm-eval to run the evaluation\n```bash\npip install -r requirements.txt\n```\n\n- **Default Settings:**\n```bash\nCUDA_VISIBLE_DEVICES=0 python3 main.py --model_name facebook/opt-125m --amp --bits 4 --group_size -1 --enable_minmax_tuning --use_quant_input\n```\n- **Reduced GPU Memory Usage and Adjusted Training Batch Size:**\n```bash\nCUDA_VISIBLE_DEVICES=0 python3 main.py --model_name facebook/opt-125m --amp --bits 4 --group_size -1 --low_gpu_mem_usage --train_bs 1 --gradient_accumulate_steps 8\n```\n- **Utilizing the AdamW Optimizer:**\nInclude the flag `--adam`. Note that AdamW is less effective than Sign gradient descent in many scenarios we tested.\n\n- **Running the Original SignRound:**\n```bash\nCUDA_VISIBLE_DEVICES=0 python3 main.py --model_name facebook/opt-125m --amp --bits 4 --group_size -1 --iters 400 --lr 0.0025 --minmax_lr 0.0025\n```\n `--enable_minmax_tuning` is strongly recommended \n\n- The transformers version required varies across different types of models. Here, the transformers version used for running models during experiments is provided as a reference.\n\n    | Model | Transformers version |\n    |  :----: | :----: |\n    | EleutherAI/gpt-j-6b | 4.28/4.30/4.34/4.36 |\n    | huggyllama/llama-7b | 4.28/4.30/4.34/4.36 |\n    | meta-llama/Llama-2-7b-hf | 4.30/4.34/4.36 |\n    | facebook/opt-6.7b | 4.28/4.30/4.34/4.36 |\n    | tiiuae/falcon-7b | 4.28/4.30/4.34/4.36 |\n    | mosaicml/mpt-7b | 4.28/4.30/4.34/4.36 |\n    | bigscience/bloom-7b1 | 4.28/4.30/4.34/4.36 |\n    | baichuan-inc/Baichuan-7B | 4.28/4.30 |\n    | Qwen/Qwen-7B | 4.28/4.30/4.34/4.36 |\n    | THUDM/chatglm3-6b | 4.34/4.36 |\n    | mistralai/Mistral-7B-v0.1 | 4.34/4.36 |\n    \n\n## Reference\nIf you find SignRound useful for your research, please cite our paper:\n```bash\n@article{cheng2023optimize,\n  title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},\n  author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao},\n  journal={arXiv preprint arXiv:2309.05516},\n  year={2023}\n}\n```\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Repository of AutoRound: Advanced Weight-Only Quantization Algorithm for LLMs",
    "version": "0.0",
    "project_urls": {
        "Homepage": "https://github.com/intel/auto-round"
    },
    "split_keywords": [
        "quantization",
        "auto-around",
        "llm",
        "signround"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0d859629467e086324175e5ddfb8a0215eb4ce38cd708242dae164ecbe5180dc",
                "md5": "610c74e0525effad3e3bc4ef2c1a05f0",
                "sha256": "198d002b44f1df57cf318babe7c3466fee78fb120da519a9d56b5a2da1feb0ee"
            },
            "downloads": -1,
            "filename": "auto_around-0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "610c74e0525effad3e3bc4ef2c1a05f0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.0",
            "size": 43312,
            "upload_time": "2024-01-30T09:40:20",
            "upload_time_iso_8601": "2024-01-30T09:40:20.076682Z",
            "url": "https://files.pythonhosted.org/packages/0d/85/9629467e086324175e5ddfb8a0215eb4ce38cd708242dae164ecbe5180dc/auto_around-0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-30 09:40:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "intel",
    "github_project": "auto-round",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "auto-around"
}

Intel AIPT Team