quanto

Name	quanto JSON
Version	0.2.0 JSON
	download
home_page	https://github.com/huggingface/quanto
Summary	A quantization toolkit for pytorch.
upload_time	2024-05-24 10:49:08
maintainer	None
docs_url	None
author	David Corvoysier
requires_python	>=3.8.0
license	Apache-2.0
keywords	torch quantization
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Quanto

**IMPORTANT**:

After having gathered feedback from our partners and the community, we have decided that `quanto` would not continue as a standalone project but would rather be merged into the [optimum](https://huggingface.co/docs/optimum/en/index) project.
External contributions to quanto will be suspended until the merge is complete.

**DISCLAIMER**: This package is still beta. Expect breaking changes in API and serialization.

🤗 Quanto is a python quantization toolkit that provides several features that are either not supported or limited by the base [pytorch quantization tools](https://pytorch.org/docs/stable/quantization.html):

- all features are available in eager mode (works with non-traceable models),
- quantized models can be placed on any device (including CUDA and MPS),
- automatically inserts quantization and dequantization stubs,
- automatically inserts quantized functional operations,
- automatically inserts quantized modules (see below the list of supported modules),
- provides a seamless workflow from a float model to a dynamic to a static quantized model,
- serialization compatible with pytorch `weight_only` and 🤗 `safetensors`,
- accelerated matrix multiplications on CUDA devices (int8-int8, fp16-int4),
- supports int2, int4, int8 and float8 weights,
- supports int8 and float8 activations.

Features yet to be implemented:

- dynamic activations smoothing,
- kernels for all mixed matrix multiplications on all devices,
- compatibility with [torch compiler](https://pytorch.org/docs/stable/torch.compiler.html) (aka dynamo).

## Quantized modules

Thanks to a seamless propagation mechanism through quantized tensors, only a few modules working as quantized
tensors insertion points are actually required.

The following modules can be quantized:

- [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) (QLinear).
Weights are always quantized, and biases are not quantized. Inputs and outputs can be quantized.
- [Conv2d](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) (QConv2D).
Weights are always quantized, and biases are not quantized. Inputs and outputs can be quantized.
- [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html),
Weights and biases are __not__ quantized. Outputs can be quantized.

## Limitations and design choices

### Tensors

At the heart of quanto is a Tensor subclass that corresponds to:
- the projection of a source Tensor into the optimal range for a given destination type,
- the mapping of projected values to the destination type.

For floating-point destination types, the mapping is done by the native pytorch cast (i.e. `Tensor.to()`).

For integer destination types, the mapping is a simple rounding operation (i.e. `torch.round()`).

The goal of the projection is to increase the accuracy of the conversion by minimizing the number of:
- saturated values (i.e. mapped to the destination type min/max),
- zeroed values (because they are below the smallest number that can be represented by the destination type)

The projection is symmetric (affine), i.e. it does not use a zero-point. This makes quantized Tensors
compatible with many operations.

One of the benefits of using a lower-bitwidth representation is that you will be able to take advantage of accelerated operations
for the destination type, which is typically faster than their higher precision equivalents.

The current implementation however falls back to `float32` operations for a lot of operations because of a lack of dedicated kernels
(only `int8` matrix multiplication is available).

Note: integer operations cannot be performed in `float16` as a fallback because this format is very bad at representing
`integer` and will likely lead to overflows in intermediate calculations.

Quanto does not support the conversion of a Tensor using mixed destination types.

### Modules

Quanto provides a generic mechanism to replace torch modules by quanto modules that are able to process quanto tensors.

Quanto modules dynamically convert their weights until a model is frozen, which slows down inference a bit but is
required if the model needs to be tuned.

Biases are not converted because to preserve the accuracy of a typical `addmm` operation, they must be converted with a
scale that is equal to the product of the input and weight scales, which leads to a ridiculously small scale, and conversely
requires a very high bitwidth to avoid clipping. Typically, with `int8` inputs and weights, biases would need to be quantized
with at least `12` bits, i.e. in `int16`. Since most biases are today `float16`, this is a waste of time.

Activations are dynamically quantized using static scales (defaults to the range `[-1, 1]`). The model needs to be calibrated to evaluate the best activation scales (using a momentum).

## Performances

In a nutshell:

- accuracy: models compiled with `int8`/`float8` weights and `float8` activations are very close to the `16-bit` models,
- latency: all models are at least `2x` slower than the `16-bit` models due to the lack of optimized kernels (for now).
- device memory: approximately divided by float bits / integer bits.

The paragraph below is just an example. Please refer to the `bench` folder for detailed results per use-case of model.

### NousResearch/Llama-2-7b-hf

<div class="row"><center>
  <div class="column">
    <img src="https://github.com/huggingface/quanto/blob/main/bench/generation/charts/NousResearch-Llama-2-7b-hf_Perplexity.png" alt="NousResearch/Llama-2-7b-hf WikiText perplexity">
  </div>
 </center>
</div>

## Installation

Quanto is available as a pip package.

```sh
pip install quanto
```

## Quantization workflow

Quanto does not make a clear distinction between dynamic and static quantization: models are always dynamically quantized,
but their weights can later be "frozen" to integer values.

A typical quantization workflow would consist of the following steps:

**1. Quantize**

The first step converts a standard float model into a dynamically quantized model.

```python
quantize(model, weights=quanto.qint8, activations=quanto.qint8)
```

At this stage, only the inference of the model is modified to dynamically quantize the weights.

**2. Calibrate (optional if activations are not quantized)**

Quanto supports a calibration mode that allows to record the activation ranges while passing representative samples through the quantized model.

```python
with calibration(momentum=0.9):
    model(samples)
```

This automatically activates the quantization of the activations in the quantized modules.


**3. Tune, aka Quantization-Aware-Training (optional)**

If the performance of the model degrades too much, one can tune it for a few epochs to recover the float model performance.

```python
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = model(data).dequantize()
    loss = torch.nn.functional.nll_loss(output, target)
    loss.backward()
    optimizer.step()
```

**4. Freeze integer weights**

When freezing a model, its float weights are replaced by quantized integer weights.

```python
freeze(model)
```

Please refer to the [examples](https://github.com/huggingface/quanto/tree/main/examples) for instantiations of that workflow.

## Per-axis versus per-tensor

Activations are always quantized per-tensor because most linear algebra operations in a model graph are not compatible with per-axis inputs: you simply cannot add numbers that are not expressed in the same base (`you cannot add apples and oranges`).

Weights involved in matrix multiplications are, on the contrary, always quantized along their first axis, because all output features are evaluated independently from one another.

The outputs of a quantized matrix multiplication will anyway always be dequantized, even if activations are quantized, because:

- the resulting integer values are expressed with a much higher bitwidth (typically `int32`) than the activation bitwidth (typically `int8`),
- they might be combined with a `float` bias.

Quantizing activations per-tensor to `int8` can lead to serious quantization errors if the corresponding tensors contain large outlier values. Typically, this will lead to quantized tensors with most values set to zero (except the outliers).

A possible solution to work around that issue is to 'smooth' the activations statically as illustrated by [SmoothQuant](https://github.com/mit-han-lab/smoothquant). You can find a script to smooth some model architectures under [external/smoothquant](external/smoothquant).

A better option is to represent activations using `float8`.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/huggingface/quanto",
    "name": "quanto",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8.0",
    "maintainer_email": null,
    "keywords": "torch, quantization",
    "author": "David Corvoysier",
    "author_email": "David Corvoysier <david@huggingface.co>",
    "download_url": "https://files.pythonhosted.org/packages/c9/d8/dff2a53258368022c1b1ea2fab28ba7830fddc0a1c73a31605f3cf17d34e/quanto-0.2.0.tar.gz",
    "platform": null,
    "description": "# Quanto\n\n**IMPORTANT**:\n\nAfter having gathered feedback from our partners and the community, we have decided that `quanto` would not continue as a standalone project but would rather be merged into the [optimum](https://huggingface.co/docs/optimum/en/index) project.\nExternal contributions to quanto will be suspended until the merge is complete.\n\n**DISCLAIMER**: This package is still beta. Expect breaking changes in API and serialization.\n\n\ud83e\udd17 Quanto is a python quantization toolkit that provides several features that are either not supported or limited by the base [pytorch quantization tools](https://pytorch.org/docs/stable/quantization.html):\n\n- all features are available in eager mode (works with non-traceable models),\n- quantized models can be placed on any device (including CUDA and MPS),\n- automatically inserts quantization and dequantization stubs,\n- automatically inserts quantized functional operations,\n- automatically inserts quantized modules (see below the list of supported modules),\n- provides a seamless workflow from a float model to a dynamic to a static quantized model,\n- serialization compatible with pytorch `weight_only` and \ud83e\udd17 `safetensors`,\n- accelerated matrix multiplications on CUDA devices (int8-int8, fp16-int4),\n- supports int2, int4, int8 and float8 weights,\n- supports int8 and float8 activations.\n\nFeatures yet to be implemented:\n\n- dynamic activations smoothing,\n- kernels for all mixed matrix multiplications on all devices,\n- compatibility with [torch compiler](https://pytorch.org/docs/stable/torch.compiler.html) (aka dynamo).\n\n## Quantized modules\n\nThanks to a seamless propagation mechanism through quantized tensors, only a few modules working as quantized\ntensors insertion points are actually required.\n\nThe following modules can be quantized:\n\n- [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) (QLinear).\nWeights are always quantized, and biases are not quantized. Inputs and outputs can be quantized.\n- [Conv2d](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) (QConv2D).\nWeights are always quantized, and biases are not quantized. Inputs and outputs can be quantized.\n- [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html),\nWeights and biases are __not__ quantized. Outputs can be quantized.\n\n## Limitations and design choices\n\n### Tensors\n\nAt the heart of quanto is a Tensor subclass that corresponds to:\n- the projection of a source Tensor into the optimal range for a given destination type,\n- the mapping of projected values to the destination type.\n\nFor floating-point destination types, the mapping is done by the native pytorch cast (i.e. `Tensor.to()`).\n\nFor integer destination types, the mapping is a simple rounding operation (i.e. `torch.round()`).\n\nThe goal of the projection is to increase the accuracy of the conversion by minimizing the number of:\n- saturated values (i.e. mapped to the destination type min/max),\n- zeroed values (because they are below the smallest number that can be represented by the destination type)\n\nThe projection is symmetric (affine), i.e. it does not use a zero-point. This makes quantized Tensors\ncompatible with many operations.\n\nOne of the benefits of using a lower-bitwidth representation is that you will be able to take advantage of accelerated operations\nfor the destination type, which is typically faster than their higher precision equivalents.\n\nThe current implementation however falls back to `float32` operations for a lot of operations because of a lack of dedicated kernels\n(only `int8` matrix multiplication is available).\n\nNote: integer operations cannot be performed in `float16` as a fallback because this format is very bad at representing\n`integer` and will likely lead to overflows in intermediate calculations.\n\nQuanto does not support the conversion of a Tensor using mixed destination types.\n\n### Modules\n\nQuanto provides a generic mechanism to replace torch modules by quanto modules that are able to process quanto tensors.\n\nQuanto modules dynamically convert their weights until a model is frozen, which slows down inference a bit but is\nrequired if the model needs to be tuned.\n\nBiases are not converted because to preserve the accuracy of a typical `addmm` operation, they must be converted with a\nscale that is equal to the product of the input and weight scales, which leads to a ridiculously small scale, and conversely\nrequires a very high bitwidth to avoid clipping. Typically, with `int8` inputs and weights, biases would need to be quantized\nwith at least `12` bits, i.e. in `int16`. Since most biases are today `float16`, this is a waste of time.\n\nActivations are dynamically quantized using static scales (defaults to the range `[-1, 1]`). The model needs to be calibrated to evaluate the best activation scales (using a momentum).\n\n## Performances\n\nIn a nutshell:\n\n- accuracy: models compiled with `int8`/`float8` weights and `float8` activations are very close to the `16-bit` models,\n- latency: all models are at least `2x` slower than the `16-bit` models due to the lack of optimized kernels (for now).\n- device memory: approximately divided by float bits / integer bits.\n\nThe paragraph below is just an example. Please refer to the `bench` folder for detailed results per use-case of model.\n\n### NousResearch/Llama-2-7b-hf\n\n<div class=\"row\"><center>\n  <div class=\"column\">\n    <img src=\"https://github.com/huggingface/quanto/blob/main/bench/generation/charts/NousResearch-Llama-2-7b-hf_Perplexity.png\" alt=\"NousResearch/Llama-2-7b-hf WikiText perplexity\">\n  </div>\n </center>\n</div>\n\n## Installation\n\nQuanto is available as a pip package.\n\n```sh\npip install quanto\n```\n\n## Quantization workflow\n\nQuanto does not make a clear distinction between dynamic and static quantization: models are always dynamically quantized,\nbut their weights can later be \"frozen\" to integer values.\n\nA typical quantization workflow would consist of the following steps:\n\n**1. Quantize**\n\nThe first step converts a standard float model into a dynamically quantized model.\n\n```python\nquantize(model, weights=quanto.qint8, activations=quanto.qint8)\n```\n\nAt this stage, only the inference of the model is modified to dynamically quantize the weights.\n\n**2. Calibrate (optional if activations are not quantized)**\n\nQuanto supports a calibration mode that allows to record the activation ranges while passing representative samples through the quantized model.\n\n```python\nwith calibration(momentum=0.9):\n    model(samples)\n```\n\nThis automatically activates the quantization of the activations in the quantized modules.\n\n\n**3. Tune, aka Quantization-Aware-Training (optional)**\n\nIf the performance of the model degrades too much, one can tune it for a few epochs to recover the float model performance.\n\n```python\nmodel.train()\nfor batch_idx, (data, target) in enumerate(train_loader):\n    data, target = data.to(device), target.to(device)\n    optimizer.zero_grad()\n    output = model(data).dequantize()\n    loss = torch.nn.functional.nll_loss(output, target)\n    loss.backward()\n    optimizer.step()\n```\n\n**4. Freeze integer weights**\n\nWhen freezing a model, its float weights are replaced by quantized integer weights.\n\n```python\nfreeze(model)\n```\n\nPlease refer to the [examples](https://github.com/huggingface/quanto/tree/main/examples) for instantiations of that workflow.\n\n## Per-axis versus per-tensor\n\nActivations are always quantized per-tensor because most linear algebra operations in a model graph are not compatible with per-axis inputs: you simply cannot add numbers that are not expressed in the same base (`you cannot add apples and oranges`).\n\nWeights involved in matrix multiplications are, on the contrary, always quantized along their first axis, because all output features are evaluated independently from one another.\n\nThe outputs of a quantized matrix multiplication will anyway always be dequantized, even if activations are quantized, because:\n\n- the resulting integer values are expressed with a much higher bitwidth (typically `int32`) than the activation bitwidth (typically `int8`),\n- they might be combined with a `float` bias.\n\nQuantizing activations per-tensor to `int8` can lead to serious quantization errors if the corresponding tensors contain large outlier values. Typically, this will lead to quantized tensors with most values set to zero (except the outliers).\n\nA possible solution to work around that issue is to 'smooth' the activations statically as illustrated by [SmoothQuant](https://github.com/mit-han-lab/smoothquant). You can find a script to smooth some model architectures under [external/smoothquant](external/smoothquant).\n\nA better option is to represent activations using `float8`.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A quantization toolkit for pytorch.",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/huggingface/quanto"
    },
    "split_keywords": [
        "torch",
        " quantization"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "062289e363878024a21b0093f69c7040c33715ff52cd0cbfb1ac91a0977f511f",
                "md5": "bad124c321a8b4412ab8b244bd178843",
                "sha256": "85d23b28e732b628e5bf84a4fd6c78a51c9fc343f7197ed838a9491e557bbd8a"
            },
            "downloads": -1,
            "filename": "quanto-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bad124c321a8b4412ab8b244bd178843",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.0",
            "size": 90023,
            "upload_time": "2024-05-24T10:48:56",
            "upload_time_iso_8601": "2024-05-24T10:48:56.567905Z",
            "url": "https://files.pythonhosted.org/packages/06/22/89e363878024a21b0093f69c7040c33715ff52cd0cbfb1ac91a0977f511f/quanto-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c9d8dff2a53258368022c1b1ea2fab28ba7830fddc0a1c73a31605f3cf17d34e",
                "md5": "9ca87cf62c29c650fbc28c74fb03fe51",
                "sha256": "1d559db9d5d0f3a4548fa11a07d5aba875c3ddc97ec68591ad9aec5cc023e10c"
            },
            "downloads": -1,
            "filename": "quanto-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "9ca87cf62c29c650fbc28c74fb03fe51",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.0",
            "size": 722414,
            "upload_time": "2024-05-24T10:49:08",
            "upload_time_iso_8601": "2024-05-24T10:49:08.506339Z",
            "url": "https://files.pythonhosted.org/packages/c9/d8/dff2a53258368022c1b1ea2fab28ba7830fddc0a1c73a31605f3cf17d34e/quanto-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-24 10:49:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "huggingface",
    "github_project": "quanto",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "quanto"
}

David Corvoysier