gptq-triton


Namegptq-triton JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/fpgaminer/GPTQ-triton
SummaryFast GPTQ kernels written in Triton
upload_time2023-04-20 03:29:50
maintainer
docs_urlNone
authorfpgaminer
requires_python>=3.6
licenseApache License 2.0
keywords gptq triton torch cuda gpu quantization quantize quantized inference deep learning machine learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GPTQ-triton

This is my attempt at implementing a Triton kernel for GPTQ inference.  This code is based on the [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) codebase, which is itself based on the [GPTQ](https://github.com/IST-DASLab/gptq) codebase.

```
@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  year={2022},
  journal={arXiv preprint arXiv:2210.17323}
}
```

## Installation

`pip install .`


## Motivation

As of today (2023-03-27) the CUDA kernels in the aforementioned codebases do not scale well with context length, running up to 10x slower when the context is large versus the equivilent FP16 model.  To solve this I'm implementing the inference kernel in Triton, which should allow for much better scaling.

The implementation is based around the matmul tutorial from the Triton documentation.  The main difference is decoding the quantized weights before performing each sub-block of the matrix multiplication.

Fusing of the FF layers and QKV matrix are also applied.


## Performance

This benchmark was run on a 3090 using the `benchmark_generate.py` script.

![Triton benchmark graph](TritonBench.png)


## Accuracy (PPL)

The following results were obtained using the `ppl.py` script with a stride of 512 and a context length of 2048.
For the 4bit CUDA results, a custom version of `ppl.py` was used, as the current script is dedicated to the Triton kernel convensions.
it/s numbers are from a 3090.


| [LLaMA-7B](https://arxiv.org/abs/2302.13971)       | Bits | group-size | memory(MiB) | it/s | Wikitext2 |  PTB  |  C4  | 
| -------------------------------------------------- | ---- | ---------- | ----------- | ---- | --------- | ----- | ---- |
| FP16                                               |  16  |      -     |    17373    | 1.64 |    5.04   |  7.85 | 6.99 |
| GPTQ CUDA                                          |   4  |     -1     |     8805    | 0.11 |    5.44   |  8.24 |   -  |
| GPTQ Triton                                        |   4  |     -1     |     8099    | 1.63 |    5.44   |  8.24 | 7.48 |


| [LLaMA-13B](https://arxiv.org/abs/2302.13971)      | Bits | group-size | memory(MiB) | it/s | Wikitext2 |  PTB  |  C4  |
| -------------------------------------------------- | ---- | ---------- | ----------- | ---- | --------- | ----- | ---- |
| FP16                                               |  16  |      -     |    31633    |   -  |    4.52   |  7.19 | 6.66 |
| GPTQ Triton                                        |   4  |     -1     |    13241    | 0.89 |    4.74   |  7.49 | 7.00 |


| [LLaMA-30B](https://arxiv.org/abs/2302.13971)      | Bits | group-size | memory(MiB) | it/s | Wikitext2 |  PTB  |  C4  |
| -------------------------------------------------- | ---- | ---------- | ----------- | ---- | --------- | ----- | ---- |
| FP16                                               |  16  |      -     |    72491    |   -  |    3.61   |  6.50 | 6.07 |


## Requirements

See `setup.cfg`, but note that a nightly `transformers` is preferred right now. v4.28.1 might work.  Known working `transformers` commit is `28f26c107b4a1c5c7e32ed4d9575622da0627a40`.


## Quantizing a model

The `quantize.py` script is used to quantize a HuggingFace model.  Example usage:

`./quantize.py --model <Path to a HF FP16 model> --dataset c4 --wbits 4 --groupsize -1 --act-order --true-sequential --save <Path to the output folder>`

Arguments:

* `--model`: Path to a HF FP16 model
* `--dataset`: Dataset to use for calibration.  Can be `wikitext-2`, `ptb`, `ptb-new` or `c4`.
* `--seed`: Seed for sampling the calibration data.
* `--nsamples`: Number of calibration data samples.
* `--percdamp`: Percent of the average Hessian diagonal to use for dampening (default 0.01).
* `--wbits`: Number of bits to use for quantization.
* `--groupsize`: Groupsize to use for quantization; default (-1) uses full row.
* `--save`: Save quantized result to this folder.
* `--safetensors`: Save using the safetensors format.
* `--act-order`: Use activation order quantization.
* `--true-sequential`: Use true sequential quantization.

**NOTE:** The Triton kernel is currently only implemented for 4-bits and groupsize -1.

### Explanation of `groupsize`

The GPTQ quantization algorithm gets applied to `nn.Linear`, `nn.Conv2d`, and `transformers.Conv1d` layers.  (NOTE: `quantize.py` currently only supports LLaMA like models, and thus only `nn.Linear` layers are quantized, and `lm_head` is skipped.)  Each matrix is quantized into a quantized weight matrix, quantized zeros, and float16 scale (bias is not quantized).  During matmul, the weights are decoded using the formula `w = (w - z - 1) * s`.

Scales and zeros are per-outfeature, so when there is no grouping, scales and zeros would be `1xOutfeatures`.  That means that each row of the matrix (i.e. along the infeatures dimension) is quantized using the same scalar scale and zero.  When grouping is used, each row is split into `groupsize` values, and each group is quantized using its own scalar scale and zero.  This means that the scales and zeros are `(Infeatures//groupsize)xOutfeatures`.

Groupsize provides a tradeoff.  Lower groupsizes offer more granularity to the quantization and thus less loss of accuracy, but decrease the memory savings offered by quantization.

### Explanation of `nsamples` and `dataset`
nsamples and dataset effect the calibration data.  This input data is fed through the network during quantization to calibrate the algorithm.  See the GPTQ paper for more detail.

### Explanation of `true-sequential`
Models are quantized sequentially, one "layer" at a time.  For example, LLaMA 7B has 32 layers, starting after the input embedding, following by the head.  Each of these layers has many different Linear modules that can be quantized.  Without the `true-sequential` flag, these Linear modules will be quantized in an arbitrary order.  With the `true-sequential` flag, the Linear modules will be quantized in the order they would be encountered during a forward pass.  This can provide an accuracy boost.

### Explanation of `act-order`

I don't know.  Looking at the code (in `gptq.py`), it seems to re-order the matrix before quantization based on the `argsort` of the estimated `H`.  The order in which the matrix columns of the matrix are quantized might have an impact on final accuracy.  `act-order` was introduced by the GPTQ authors to improve accuracy when quantizing "small" models like LLaMA 7B.


## Files

* `benchmark_generate.py` - A script for benchmarking generation speed at different prompt lengths and generation lengths.

* `Benchmark.ipynb` - A notebook for benchmarking the Triton kernel against the CUDA kernel and FP16.

* `quantize.py` - A script for quantizing a model.

* `generate.py` - An example script for generating text from a model.  Example usage: `./generate.py --model <Path to your quantized model> --quant --prompt "Write a story about a duck: Once upon a time there was a duck" --temperature 0.6 --top-p 0.6 --repetition-penalty 1.1`

* `ppl.py` - A script for calculating the perplexity of a model against wikitext2, PTB, and C4.  This is useful for verifying correctness of the Triton kernel, comparing it to the CUDA kernel and the original FP16 model.

* `Verify.ipynb` - A notebook for verifying the correctness of the Triton kernel.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/fpgaminer/GPTQ-triton",
    "name": "gptq-triton",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "gptq,triton,torch,cuda,gpu,quantization,quantize,quantized,inference,deep learning,machine learning",
    "author": "fpgaminer",
    "author_email": "fpgaminer@bitcoin-mining.com",
    "download_url": "https://files.pythonhosted.org/packages/df/9f/57d70fdbc6666bdd102bfc9432335ff223393ffa69d25f84f7f062a24342/gptq_triton-0.0.3.tar.gz",
    "platform": null,
    "description": "# GPTQ-triton\n\nThis is my attempt at implementing a Triton kernel for GPTQ inference.  This code is based on the [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) codebase, which is itself based on the [GPTQ](https://github.com/IST-DASLab/gptq) codebase.\n\n```\n@article{frantar-gptq,\n  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, \n  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},\n  year={2022},\n  journal={arXiv preprint arXiv:2210.17323}\n}\n```\n\n## Installation\n\n`pip install .`\n\n\n## Motivation\n\nAs of today (2023-03-27) the CUDA kernels in the aforementioned codebases do not scale well with context length, running up to 10x slower when the context is large versus the equivilent FP16 model.  To solve this I'm implementing the inference kernel in Triton, which should allow for much better scaling.\n\nThe implementation is based around the matmul tutorial from the Triton documentation.  The main difference is decoding the quantized weights before performing each sub-block of the matrix multiplication.\n\nFusing of the FF layers and QKV matrix are also applied.\n\n\n## Performance\n\nThis benchmark was run on a 3090 using the `benchmark_generate.py` script.\n\n![Triton benchmark graph](TritonBench.png)\n\n\n## Accuracy (PPL)\n\nThe following results were obtained using the `ppl.py` script with a stride of 512 and a context length of 2048.\nFor the 4bit CUDA results, a custom version of `ppl.py` was used, as the current script is dedicated to the Triton kernel convensions.\nit/s numbers are from a 3090.\n\n\n| [LLaMA-7B](https://arxiv.org/abs/2302.13971)       | Bits | group-size | memory(MiB) | it/s | Wikitext2 |  PTB  |  C4  | \n| -------------------------------------------------- | ---- | ---------- | ----------- | ---- | --------- | ----- | ---- |\n| FP16                                               |  16  |      -     |    17373    | 1.64 |    5.04   |  7.85 | 6.99 |\n| GPTQ CUDA                                          |   4  |     -1     |     8805    | 0.11 |    5.44   |  8.24 |   -  |\n| GPTQ Triton                                        |   4  |     -1     |     8099    | 1.63 |    5.44   |  8.24 | 7.48 |\n\n\n| [LLaMA-13B](https://arxiv.org/abs/2302.13971)      | Bits | group-size | memory(MiB) | it/s | Wikitext2 |  PTB  |  C4  |\n| -------------------------------------------------- | ---- | ---------- | ----------- | ---- | --------- | ----- | ---- |\n| FP16                                               |  16  |      -     |    31633    |   -  |    4.52   |  7.19 | 6.66 |\n| GPTQ Triton                                        |   4  |     -1     |    13241    | 0.89 |    4.74   |  7.49 | 7.00 |\n\n\n| [LLaMA-30B](https://arxiv.org/abs/2302.13971)      | Bits | group-size | memory(MiB) | it/s | Wikitext2 |  PTB  |  C4  |\n| -------------------------------------------------- | ---- | ---------- | ----------- | ---- | --------- | ----- | ---- |\n| FP16                                               |  16  |      -     |    72491    |   -  |    3.61   |  6.50 | 6.07 |\n\n\n## Requirements\n\nSee `setup.cfg`, but note that a nightly `transformers` is preferred right now. v4.28.1 might work.  Known working `transformers` commit is `28f26c107b4a1c5c7e32ed4d9575622da0627a40`.\n\n\n## Quantizing a model\n\nThe `quantize.py` script is used to quantize a HuggingFace model.  Example usage:\n\n`./quantize.py --model <Path to a HF FP16 model> --dataset c4 --wbits 4 --groupsize -1 --act-order --true-sequential --save <Path to the output folder>`\n\nArguments:\n\n* `--model`: Path to a HF FP16 model\n* `--dataset`: Dataset to use for calibration.  Can be `wikitext-2`, `ptb`, `ptb-new` or `c4`.\n* `--seed`: Seed for sampling the calibration data.\n* `--nsamples`: Number of calibration data samples.\n* `--percdamp`: Percent of the average Hessian diagonal to use for dampening (default 0.01).\n* `--wbits`: Number of bits to use for quantization.\n* `--groupsize`: Groupsize to use for quantization; default (-1) uses full row.\n* `--save`: Save quantized result to this folder.\n* `--safetensors`: Save using the safetensors format.\n* `--act-order`: Use activation order quantization.\n* `--true-sequential`: Use true sequential quantization.\n\n**NOTE:** The Triton kernel is currently only implemented for 4-bits and groupsize -1.\n\n### Explanation of `groupsize`\n\nThe GPTQ quantization algorithm gets applied to `nn.Linear`, `nn.Conv2d`, and `transformers.Conv1d` layers.  (NOTE: `quantize.py` currently only supports LLaMA like models, and thus only `nn.Linear` layers are quantized, and `lm_head` is skipped.)  Each matrix is quantized into a quantized weight matrix, quantized zeros, and float16 scale (bias is not quantized).  During matmul, the weights are decoded using the formula `w = (w - z - 1) * s`.\n\nScales and zeros are per-outfeature, so when there is no grouping, scales and zeros would be `1xOutfeatures`.  That means that each row of the matrix (i.e. along the infeatures dimension) is quantized using the same scalar scale and zero.  When grouping is used, each row is split into `groupsize` values, and each group is quantized using its own scalar scale and zero.  This means that the scales and zeros are `(Infeatures//groupsize)xOutfeatures`.\n\nGroupsize provides a tradeoff.  Lower groupsizes offer more granularity to the quantization and thus less loss of accuracy, but decrease the memory savings offered by quantization.\n\n### Explanation of `nsamples` and `dataset`\nnsamples and dataset effect the calibration data.  This input data is fed through the network during quantization to calibrate the algorithm.  See the GPTQ paper for more detail.\n\n### Explanation of `true-sequential`\nModels are quantized sequentially, one \"layer\" at a time.  For example, LLaMA 7B has 32 layers, starting after the input embedding, following by the head.  Each of these layers has many different Linear modules that can be quantized.  Without the `true-sequential` flag, these Linear modules will be quantized in an arbitrary order.  With the `true-sequential` flag, the Linear modules will be quantized in the order they would be encountered during a forward pass.  This can provide an accuracy boost.\n\n### Explanation of `act-order`\n\nI don't know.  Looking at the code (in `gptq.py`), it seems to re-order the matrix before quantization based on the `argsort` of the estimated `H`.  The order in which the matrix columns of the matrix are quantized might have an impact on final accuracy.  `act-order` was introduced by the GPTQ authors to improve accuracy when quantizing \"small\" models like LLaMA 7B.\n\n\n## Files\n\n* `benchmark_generate.py` - A script for benchmarking generation speed at different prompt lengths and generation lengths.\n\n* `Benchmark.ipynb` - A notebook for benchmarking the Triton kernel against the CUDA kernel and FP16.\n\n* `quantize.py` - A script for quantizing a model.\n\n* `generate.py` - An example script for generating text from a model.  Example usage: `./generate.py --model <Path to your quantized model> --quant --prompt \"Write a story about a duck: Once upon a time there was a duck\" --temperature 0.6 --top-p 0.6 --repetition-penalty 1.1`\n\n* `ppl.py` - A script for calculating the perplexity of a model against wikitext2, PTB, and C4.  This is useful for verifying correctness of the Triton kernel, comparing it to the CUDA kernel and the original FP16 model.\n\n* `Verify.ipynb` - A notebook for verifying the correctness of the Triton kernel.\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Fast GPTQ kernels written in Triton",
    "version": "0.0.3",
    "split_keywords": [
        "gptq",
        "triton",
        "torch",
        "cuda",
        "gpu",
        "quantization",
        "quantize",
        "quantized",
        "inference",
        "deep learning",
        "machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "857a35b7a3f1cf21de9b5377da496dbc37aba735c637b1852537efa20997452b",
                "md5": "cfa8987ef14433897d37fc67b8cd7d6d",
                "sha256": "c98f68fe1afc53bab6de4a9e8a6af8a553714c4ca5b124c4587a06bc182b8137"
            },
            "downloads": -1,
            "filename": "gptq_triton-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cfa8987ef14433897d37fc67b8cd7d6d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 22598,
            "upload_time": "2023-04-20T03:29:47",
            "upload_time_iso_8601": "2023-04-20T03:29:47.987460Z",
            "url": "https://files.pythonhosted.org/packages/85/7a/35b7a3f1cf21de9b5377da496dbc37aba735c637b1852537efa20997452b/gptq_triton-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "df9f57d70fdbc6666bdd102bfc9432335ff223393ffa69d25f84f7f062a24342",
                "md5": "3c95895be4e78d1bb85ecf9d6bc7dcd4",
                "sha256": "fc11b7e8e6ca75f682b40b212ae7dca9e6fd17123ac963a5f13fa9a0f6ddd794"
            },
            "downloads": -1,
            "filename": "gptq_triton-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "3c95895be4e78d1bb85ecf9d6bc7dcd4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 21953,
            "upload_time": "2023-04-20T03:29:50",
            "upload_time_iso_8601": "2023-04-20T03:29:50.220690Z",
            "url": "https://files.pythonhosted.org/packages/df/9f/57d70fdbc6666bdd102bfc9432335ff223393ffa69d25f84f7f062a24342/gptq_triton-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-20 03:29:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "fpgaminer",
    "github_project": "GPTQ-triton",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "gptq-triton"
}
        
Elapsed time: 0.05811s