# Optimum Quanto
🤗 Optimum Quanto is a pytorch quantization backend for [optimum](https://huggingface.co/docs/optimum/en/index).
It has been designed with versatility and simplicity in mind:
- all features are available in eager mode (works with non-traceable models),
- quantized models can be placed on any device (including CUDA and MPS),
- automatically inserts quantization and dequantization stubs,
- automatically inserts quantized functional operations,
- automatically inserts quantized modules (see below the list of supported modules),
- provides a seamless workflow from a float model to a dynamic to a static quantized model,
- serialization compatible with pytorch `weight_only` and 🤗 `safetensors`,
- accelerated matrix multiplications on CUDA devices (int8-int8, fp16-int4, bf16-int8, bf16-int4),
- supports int2, int4, int8 and float8 weights,
- supports int8 and float8 activations.
Features yet to be implemented:
- dynamic activations smoothing,
- kernels for all mixed matrix multiplications on all devices,
- compatibility with [torch compiler](https://pytorch.org/docs/stable/torch.compiler.html) (aka dynamo).
## Performances
In a nutshell:
- accuracy: models compiled with `int8`/`float8` weights and `float8` activations are very close to the full-precision models,
- latency: whenever optimized kernels are available, the inference of quantized model is comparable with the full-precision models when quantizing only the model weights,
- device memory: approximately divided by float bits / integer bits.
The paragraph below is just an example. Please refer to the `bench` folder for detailed results per use-case of model.
### meta-llama/Meta-Llama-3.1-8B
<div class="row"><center>
<div class="column">
<img src="https://github.com/huggingface/optimum-quanto/blob/main/bench/generation/charts/meta-llama-Meta-Llama-3.1-8B_bf16_Perplexity.png" alt="meta-llama/Meta-Llama-3.1-8B WikiText perplexity">
</div>
</center>
</div>
<div class="row"><center>
<div class="column">
<img src="https://github.com/huggingface/optimum-quanto/blob/main/bench/generation/charts/meta-llama-Meta-Llama-3.1-8B_bf16_Latency__ms_.png" alt="meta-llama/Meta-Llama-3.1-8B Latency">
</div>
</center>
</div>
## Installation
Optimum Quanto is available as a pip package.
```sh
pip install optimum-quanto
```
## Quantization workflow for Hugging Face models
`optimum-quanto` provides helper classes to quantize, save and reload Hugging Face quantized models.
### LLM models
The first step is to quantize the model
```python
from transformers import AutoModelForCausalLM
from optimum.quanto import QuantizedModelForCausalLM, qint4
model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3-8B')
qmodel = QuantizedModelForCausalLM.quantize(model, weights=qint4, exclude='lm_head')
```
Note: the model quantized weights will be frozen. If you want to keep them unfrozen to train them you need to use `optimum.quanto.quantize` directly.
The quantized model can be saved using `save_pretrained`:
```python
qmodel.save_pretrained('./Llama-3-8B-quantized')
```
It can later be reloaded using `from_pretrained`:
```python
from optimum.quanto import QuantizedModelForCausalLM
qmodel = QuantizedModelForCausalLM.from_pretrained('Llama-3-8B-quantized')
```
### Diffusers models
You can quantize any of the submodels inside a diffusers pipeline and seamlessly include them later in another pipeline.
Here we quantize the `transformer` of a `Pixart` pipeline.
```python
from diffusers import PixArtTransformer2DModel
from optimum.quanto import QuantizedPixArtTransformer2DModel, qfloat8
model = PixArtTransformer2DModel.from_pretrained("PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", subfolder="transformer")
qmodel = QuantizedPixArtTransformer2DModel.quantize(model, weights=qfloat8)
qmodel.save_pretrained("./pixart-sigma-fp8")
```
Later, we can reload the quantized model and recreate the pipeline:
```python
from diffusers import PixArtTransformer2DModel
from optimum.quanto import QuantizedPixArtTransformer2DModel
transformer = QuantizedPixArtTransformer2DModel.from_pretrained("./pixart-sigma-fp8")
transformer.to(device="cuda")
pipe = PixArtSigmaPipeline.from_pretrained(
"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
transformer=None,
torch_dtype=torch.float16,
).to("cuda")
pipe.transformer = transformer
```
## Quantization workflow for vanilla pytorch models (low-level API)
One thing to keep in mind when using the low-level quanto API is that by default models
weights are dynamically quantized: an explicit call must be made to 'freeze' the quantized weights.
A typical quantization workflow would consist of the following steps:
**1. Quantize**
The first step converts a standard float model into a dynamically quantized model.
```python
from optimum.quanto import quantize, qint8
quantize(model, weights=qint8, activations=qint8)
```
At this stage, only the inference of the model is modified to dynamically quantize the weights.
**2. Calibrate (optional if activations are not quantized)**
Quanto supports a calibration mode that allows to record the activation ranges while passing representative samples through the quantized model.
```python
from optimum.quanto import Calibration
with Calibration(momentum=0.9):
model(samples)
```
This automatically activates the quantization of the activations in the quantized modules.
**3. Tune, aka Quantization-Aware-Training (optional)**
If the performance of the model degrades too much, one can tune it for a few epochs to recover the float model performance.
```python
import torch
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data).dequantize()
loss = torch.nn.functional.nll_loss(output, target)
loss.backward()
optimizer.step()
```
**4. Freeze integer weights**
When freezing a model, its float weights are replaced by quantized integer weights.
```python
from optimum.quanto import freeze
freeze(model)
```
**5. Serialize quantized model**
Quantized models weights can be serialized to a `state_dict`, and saved to a file.
Both `pickle` and `safetensors` (recommended) are supported.
```python
from safetensors.torch import save_file
save_file(model.state_dict(), 'model.safetensors')
```
In order to be able to reload these weights, you also need to store the quantized
model quantization map.
```python
import json
from optimum.quanto import quantization_map
with open('quantization_map.json', 'w') as f:
json.dump(quantization_map(model), f)
```
**5. Reload a quantized model**
A serialized quantized model can be reloaded from a `state_dict` and a `quantization_map` using the `requantize` helper.
Note that you need first to instantiate an empty model.
```python
import json
from safetensors.torch import load_file
from optimum.quanto import requantize
state_dict = load_file('model.safetensors')
with open('quantization_map.json', 'r') as f:
quantization_map = json.load(f)
# Create an empty model from your modeling code and requantize it
with torch.device('meta'):
new_model = ...
requantize(new_model, state_dict, quantization_map, device=torch.device('cuda'))
```
Please refer to the [examples](https://github.com/huggingface/quanto/tree/main/examples) for instantiations of that workflow.
## Design overview
### Tensors
At the heart of quanto is a Tensor subclass that corresponds to:
- the projection of a source Tensor into the optimal range for a given destination type,
- the mapping of projected values to the destination type.
For floating-point destination types, the mapping is done by the native pytorch cast (i.e. `Tensor.to()`).
For integer destination types, the mapping is a simple rounding operation (i.e. `torch.round()`).
The goal of the projection is to increase the accuracy of the conversion by minimizing the number of:
- saturated values (i.e. mapped to the destination type min/max),
- zeroed values (because they are below the smallest number that can be represented by the destination type)
The projection is symmetric per-tensor or per-channel for `int8` and `float8`, and group-wise affine (with a shift or 'zero-point') for lower bitwidth.
One of the benefits of using a lower-bitwidth representation is that you will be able to take advantage of accelerated operations
for the destination type, which is typically faster than their higher precision equivalents.
Quanto does not support the conversion of a Tensor using mixed destination types.
### Modules
Quanto provides a generic mechanism to replace `torch` modules by `optimum-quanto` modules that are able to process quanto tensors.
`optimum-quanto` modules dynamically convert their weights until a model is frozen, which slows down inference a bit but is
required if the model needs to be tuned.
Weights are usually quantized per-channel along the first dimension (output features).
Biases are not converted to preserve the accuracy of a typical `addmm` operation.
Explanation: to be consistent with the unquantized arithmetic operations, biases would need to be quantized with a scale that
is equal to the product of the input and weight scales, which leads to a ridiculously small scale, and conversely
requires a very high bitwidth to avoid clipping. Typically, with `int8` inputs and weights, biases would need to be quantized
with at least `12` bits, i.e. in `int16`. Since most biases are today `float16`, this is a waste of time.
Activations are dynamically quantized per-tensor using static scales (defaults to the range `[-1, 1]`).
To preserve accuracy, the model needs to be calibrated to evaluate the best activation scales (using a momentum).
The following modules can be quantized:
- [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) (QLinear).
Weights are always quantized, and biases are not quantized. Inputs and outputs can be quantized.
- [Conv2d](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) (QConv2D).
Weights are always quantized, and biases are not quantized. Inputs and outputs can be quantized.
- [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html),
Weights and biases are __not__ quantized. Outputs can be quantized.
## Pitfalls to avoid when quantizing activations
Activations are always quantized per-tensor because most linear algebra operations in a model graph are not compatible
with per-axis inputs: you simply cannot add numbers that are not expressed in the same base (`you cannot add apples and oranges`).
Weights involved in matrix multiplications are, on the contrary, always quantized along their first axis, because all output features
are evaluated independently from one another.
The outputs of a quantized matrix multiplication will anyway always be dequantized, even if activations are quantized, because:
- the resulting accumulated values are expressed with a much higher bitwidth (typically `int32` or `float32`) than the activation bitwidth (typically `int8` or `float8`),
- they might be combined with a `float` bias.
Quantizing activations per-tensor to `int8` can lead to serious quantization errors if the corresponding tensors contain large outlier values.
Typically, this will lead to quantized tensors with most values set to zero (except the outliers).
A possible solution to work around that issue is to 'smooth' the activations statically as illustrated by [SmoothQuant](https://github.com/mit-han-lab/smoothquant).
You can find a script to smooth some model architectures under [external/smoothquant](external/smoothquant).
A better option is to represent activations using `float8`.
Raw data
{
"_id": null,
"home_page": null,
"name": "optimum-quanto",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9.0",
"maintainer_email": "\"HuggingFace Inc. Special Ops Team\" <hardware@huggingface.co>",
"keywords": "torch, quantization",
"author": "David Corvoysier",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/16/a9/5d9b8c3c6bd7264f8ca03c54921d7f4beda5fb2a682d947abd1d524a71ec/optimum_quanto-0.2.6.tar.gz",
"platform": null,
"description": "# Optimum Quanto\n\n\ud83e\udd17 Optimum Quanto is a pytorch quantization backend for [optimum](https://huggingface.co/docs/optimum/en/index).\n\nIt has been designed with versatility and simplicity in mind:\n\n- all features are available in eager mode (works with non-traceable models),\n- quantized models can be placed on any device (including CUDA and MPS),\n- automatically inserts quantization and dequantization stubs,\n- automatically inserts quantized functional operations,\n- automatically inserts quantized modules (see below the list of supported modules),\n- provides a seamless workflow from a float model to a dynamic to a static quantized model,\n- serialization compatible with pytorch `weight_only` and \ud83e\udd17 `safetensors`,\n- accelerated matrix multiplications on CUDA devices (int8-int8, fp16-int4, bf16-int8, bf16-int4),\n- supports int2, int4, int8 and float8 weights,\n- supports int8 and float8 activations.\n\nFeatures yet to be implemented:\n\n- dynamic activations smoothing,\n- kernels for all mixed matrix multiplications on all devices,\n- compatibility with [torch compiler](https://pytorch.org/docs/stable/torch.compiler.html) (aka dynamo).\n\n## Performances\n\nIn a nutshell:\n\n- accuracy: models compiled with `int8`/`float8` weights and `float8` activations are very close to the full-precision models,\n- latency: whenever optimized kernels are available, the inference of quantized model is comparable with the full-precision models when quantizing only the model weights,\n- device memory: approximately divided by float bits / integer bits.\n\nThe paragraph below is just an example. Please refer to the `bench` folder for detailed results per use-case of model.\n\n### meta-llama/Meta-Llama-3.1-8B\n\n<div class=\"row\"><center>\n <div class=\"column\">\n <img src=\"https://github.com/huggingface/optimum-quanto/blob/main/bench/generation/charts/meta-llama-Meta-Llama-3.1-8B_bf16_Perplexity.png\" alt=\"meta-llama/Meta-Llama-3.1-8B WikiText perplexity\">\n </div>\n </center>\n</div>\n\n<div class=\"row\"><center>\n <div class=\"column\">\n <img src=\"https://github.com/huggingface/optimum-quanto/blob/main/bench/generation/charts/meta-llama-Meta-Llama-3.1-8B_bf16_Latency__ms_.png\" alt=\"meta-llama/Meta-Llama-3.1-8B Latency\">\n </div>\n </center>\n</div>\n\n## Installation\n\nOptimum Quanto is available as a pip package.\n\n```sh\npip install optimum-quanto\n```\n\n## Quantization workflow for Hugging Face models\n\n`optimum-quanto` provides helper classes to quantize, save and reload Hugging Face quantized models.\n\n### LLM models\n\nThe first step is to quantize the model\n\n```python\nfrom transformers import AutoModelForCausalLM\nfrom optimum.quanto import QuantizedModelForCausalLM, qint4\n\nmodel = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3-8B')\nqmodel = QuantizedModelForCausalLM.quantize(model, weights=qint4, exclude='lm_head')\n```\n\nNote: the model quantized weights will be frozen. If you want to keep them unfrozen to train them you need to use `optimum.quanto.quantize` directly.\n\nThe quantized model can be saved using `save_pretrained`:\n\n```python\nqmodel.save_pretrained('./Llama-3-8B-quantized')\n```\n\nIt can later be reloaded using `from_pretrained`:\n\n```python\nfrom optimum.quanto import QuantizedModelForCausalLM\n\nqmodel = QuantizedModelForCausalLM.from_pretrained('Llama-3-8B-quantized')\n```\n\n### Diffusers models\n\nYou can quantize any of the submodels inside a diffusers pipeline and seamlessly include them later in another pipeline.\n\nHere we quantize the `transformer` of a `Pixart` pipeline.\n\n```python\nfrom diffusers import PixArtTransformer2DModel\nfrom optimum.quanto import QuantizedPixArtTransformer2DModel, qfloat8\n\nmodel = PixArtTransformer2DModel.from_pretrained(\"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS\", subfolder=\"transformer\")\nqmodel = QuantizedPixArtTransformer2DModel.quantize(model, weights=qfloat8)\nqmodel.save_pretrained(\"./pixart-sigma-fp8\")\n```\n\nLater, we can reload the quantized model and recreate the pipeline:\n\n```python\nfrom diffusers import PixArtTransformer2DModel\nfrom optimum.quanto import QuantizedPixArtTransformer2DModel\n\ntransformer = QuantizedPixArtTransformer2DModel.from_pretrained(\"./pixart-sigma-fp8\")\ntransformer.to(device=\"cuda\")\npipe = PixArtSigmaPipeline.from_pretrained(\n \"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS\",\n transformer=None,\n torch_dtype=torch.float16,\n).to(\"cuda\")\npipe.transformer = transformer\n```\n\n## Quantization workflow for vanilla pytorch models (low-level API)\n\nOne thing to keep in mind when using the low-level quanto API is that by default models\nweights are dynamically quantized: an explicit call must be made to 'freeze' the quantized weights.\n\nA typical quantization workflow would consist of the following steps:\n\n**1. Quantize**\n\nThe first step converts a standard float model into a dynamically quantized model.\n\n```python\nfrom optimum.quanto import quantize, qint8\n\nquantize(model, weights=qint8, activations=qint8)\n```\n\nAt this stage, only the inference of the model is modified to dynamically quantize the weights.\n\n**2. Calibrate (optional if activations are not quantized)**\n\nQuanto supports a calibration mode that allows to record the activation ranges while passing representative samples through the quantized model.\n\n```python\nfrom optimum.quanto import Calibration\n\nwith Calibration(momentum=0.9):\n model(samples)\n```\n\nThis automatically activates the quantization of the activations in the quantized modules.\n\n\n**3. Tune, aka Quantization-Aware-Training (optional)**\n\nIf the performance of the model degrades too much, one can tune it for a few epochs to recover the float model performance.\n\n```python\nimport torch\n\nmodel.train()\nfor batch_idx, (data, target) in enumerate(train_loader):\n data, target = data.to(device), target.to(device)\n optimizer.zero_grad()\n output = model(data).dequantize()\n loss = torch.nn.functional.nll_loss(output, target)\n loss.backward()\n optimizer.step()\n```\n\n**4. Freeze integer weights**\n\nWhen freezing a model, its float weights are replaced by quantized integer weights.\n\n```python\nfrom optimum.quanto import freeze\n\nfreeze(model)\n```\n\n**5. Serialize quantized model**\n\nQuantized models weights can be serialized to a `state_dict`, and saved to a file.\nBoth `pickle` and `safetensors` (recommended) are supported.\n\n```python\nfrom safetensors.torch import save_file\n\nsave_file(model.state_dict(), 'model.safetensors')\n```\n\nIn order to be able to reload these weights, you also need to store the quantized\nmodel quantization map.\n\n```python\nimport json\n\nfrom optimum.quanto import quantization_map\n\nwith open('quantization_map.json', 'w') as f:\n json.dump(quantization_map(model), f)\n```\n\n**5. Reload a quantized model**\n\nA serialized quantized model can be reloaded from a `state_dict` and a `quantization_map` using the `requantize` helper.\nNote that you need first to instantiate an empty model.\n\n```python\nimport json\n\nfrom safetensors.torch import load_file\nfrom optimum.quanto import requantize\n\nstate_dict = load_file('model.safetensors')\nwith open('quantization_map.json', 'r') as f:\n quantization_map = json.load(f)\n\n# Create an empty model from your modeling code and requantize it\nwith torch.device('meta'):\n new_model = ...\nrequantize(new_model, state_dict, quantization_map, device=torch.device('cuda'))\n```\n\nPlease refer to the [examples](https://github.com/huggingface/quanto/tree/main/examples) for instantiations of that workflow.\n\n\n## Design overview\n\n### Tensors\n\nAt the heart of quanto is a Tensor subclass that corresponds to:\n- the projection of a source Tensor into the optimal range for a given destination type,\n- the mapping of projected values to the destination type.\n\nFor floating-point destination types, the mapping is done by the native pytorch cast (i.e. `Tensor.to()`).\n\nFor integer destination types, the mapping is a simple rounding operation (i.e. `torch.round()`).\n\nThe goal of the projection is to increase the accuracy of the conversion by minimizing the number of:\n- saturated values (i.e. mapped to the destination type min/max),\n- zeroed values (because they are below the smallest number that can be represented by the destination type)\n\nThe projection is symmetric per-tensor or per-channel for `int8` and `float8`, and group-wise affine (with a shift or 'zero-point') for lower bitwidth.\n\nOne of the benefits of using a lower-bitwidth representation is that you will be able to take advantage of accelerated operations\nfor the destination type, which is typically faster than their higher precision equivalents.\n\nQuanto does not support the conversion of a Tensor using mixed destination types.\n\n### Modules\n\nQuanto provides a generic mechanism to replace `torch` modules by `optimum-quanto` modules that are able to process quanto tensors.\n\n`optimum-quanto` modules dynamically convert their weights until a model is frozen, which slows down inference a bit but is\nrequired if the model needs to be tuned.\n\nWeights are usually quantized per-channel along the first dimension (output features).\n\nBiases are not converted to preserve the accuracy of a typical `addmm` operation.\n\nExplanation: to be consistent with the unquantized arithmetic operations, biases would need to be quantized with a scale that\nis equal to the product of the input and weight scales, which leads to a ridiculously small scale, and conversely\nrequires a very high bitwidth to avoid clipping. Typically, with `int8` inputs and weights, biases would need to be quantized\nwith at least `12` bits, i.e. in `int16`. Since most biases are today `float16`, this is a waste of time.\n\nActivations are dynamically quantized per-tensor using static scales (defaults to the range `[-1, 1]`).\n\nTo preserve accuracy, the model needs to be calibrated to evaluate the best activation scales (using a momentum).\n\nThe following modules can be quantized:\n\n- [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) (QLinear).\nWeights are always quantized, and biases are not quantized. Inputs and outputs can be quantized.\n- [Conv2d](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) (QConv2D).\nWeights are always quantized, and biases are not quantized. Inputs and outputs can be quantized.\n- [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html),\nWeights and biases are __not__ quantized. Outputs can be quantized.\n\n## Pitfalls to avoid when quantizing activations\n\nActivations are always quantized per-tensor because most linear algebra operations in a model graph are not compatible\nwith per-axis inputs: you simply cannot add numbers that are not expressed in the same base (`you cannot add apples and oranges`).\n\nWeights involved in matrix multiplications are, on the contrary, always quantized along their first axis, because all output features\nare evaluated independently from one another.\n\nThe outputs of a quantized matrix multiplication will anyway always be dequantized, even if activations are quantized, because:\n\n- the resulting accumulated values are expressed with a much higher bitwidth (typically `int32` or `float32`) than the activation bitwidth (typically `int8` or `float8`),\n- they might be combined with a `float` bias.\n\nQuantizing activations per-tensor to `int8` can lead to serious quantization errors if the corresponding tensors contain large outlier values.\nTypically, this will lead to quantized tensors with most values set to zero (except the outliers).\n\nA possible solution to work around that issue is to 'smooth' the activations statically as illustrated by [SmoothQuant](https://github.com/mit-han-lab/smoothquant).\nYou can find a script to smooth some model architectures under [external/smoothquant](external/smoothquant).\n\nA better option is to represent activations using `float8`.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "A pytorch quantization backend for optimum.",
"version": "0.2.6",
"project_urls": {
"homepage": "https://github.com/huggingface/optimum-quanto"
},
"split_keywords": [
"torch",
" quantization"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ba76142ddfc54fe9d12ae75fa82f96d0701ecdafcc1961e2c9003e4d334894d8",
"md5": "5bf6788c9c52786cacd5915ba3e75492",
"sha256": "5b0d41757f982c7c70f8ec23bb843c7eb0e6fe3fd99d5e35d70003df2a6ee11f"
},
"downloads": -1,
"filename": "optimum_quanto-0.2.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5bf6788c9c52786cacd5915ba3e75492",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9.0",
"size": 165113,
"upload_time": "2024-10-29T17:13:40",
"upload_time_iso_8601": "2024-10-29T17:13:40.614345Z",
"url": "https://files.pythonhosted.org/packages/ba/76/142ddfc54fe9d12ae75fa82f96d0701ecdafcc1961e2c9003e4d334894d8/optimum_quanto-0.2.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "16a95d9b8c3c6bd7264f8ca03c54921d7f4beda5fb2a682d947abd1d524a71ec",
"md5": "acdb3755214a15df4a63ca1c18b34934",
"sha256": "da9a3093a92102f11c3bd7794d566b3a9d7c100962f41800b3e9b330d0bdfb5e"
},
"downloads": -1,
"filename": "optimum_quanto-0.2.6.tar.gz",
"has_sig": false,
"md5_digest": "acdb3755214a15df4a63ca1c18b34934",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9.0",
"size": 361281,
"upload_time": "2024-10-29T17:13:31",
"upload_time_iso_8601": "2024-10-29T17:13:31.921387Z",
"url": "https://files.pythonhosted.org/packages/16/a9/5d9b8c3c6bd7264f8ca03c54921d7f4beda5fb2a682d947abd1d524a71ec/optimum_quanto-0.2.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-29 17:13:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "huggingface",
"github_project": "optimum-quanto",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "optimum-quanto"
}