compressed-tensors-nightly


Namecompressed-tensors-nightly JSON
Version 0.8.0.20241114 PyPI version JSON
download
home_pagehttps://github.com/neuralmagic/compressed-tensors
SummaryLibrary for utilization of compressed safetensors of neural network models
upload_time2024-11-14 01:07:42
maintainerNone
docs_urlNone
authorNeuralmagic, Inc.
requires_pythonNone
licenseApache 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # compressed-tensors

The `compressed-tensors` library extends the [safetensors](https://github.com/huggingface/safetensors) format, providing a versatile and efficient way to store and manage compressed tensor data. This library supports various quantization and sparsity schemes, making it a unified format for handling different model optimizations like GPTQ, AWQ, SmoothQuant, INT8, FP8, SparseGPT, and more.

## Why `compressed-tensors`?

As model compression becomes increasingly important for efficient deployment of LLMs, the landscape of quantization and compression techniques has become increasingly fragmented.
Each method often comes with its own storage format and loading procedures, making it challenging to work with multiple techniques or switch between them.
`compressed-tensors` addresses this by providing a single, extensible format that can represent a wide variety of compression schemes. 

* **Unified Checkpoint Format**: Supports various compression schemes in a single, consistent format.
* **Wide Compatibility**: Works with popular quantization methods like GPTQ, SmoothQuant, and FP8. See [llm-compressor](https://github.com/vllm-project/llm-compressor)
* **Flexible Quantization Support**: 
  * Weight-only quantization (e.g., W4A16, W8A16, WnA16)
  * Activation quantization (e.g., W8A8)
  * KV cache quantization
  * Non-uniform schemes (different layers can be quantized in different ways!)
* **Sparsity Support**: Handles both unstructured and semi-structured (e.g., 2:4) sparsity patterns.
* **Open-Source Integration**: Designed to work seamlessly with Hugging Face models and PyTorch.

This allows developers and researchers to easily experiment with composing different quantization methods, simplify model deployment pipelines, and reduce the overhead of supporting multiple compression formats in inference engines.

## Installation

### From [PyPI](https://pypi.org/project/compressed-tensors)

Stable release:
```bash
pip install compressed-tensors
```

Nightly release:
```bash
pip install compressed-tensors-nightly
```

### From Source

```bash
git clone https://github.com/neuralmagic/compressed-tensors
cd compressed-tensors
pip install -e .
```

## Getting started

### Saving/Loading Compressed Tensors (Bitmask Compression)

The function `save_compressed` uses the `compression_format` argument to apply compression to tensors.
The function `load_compressed` reverses the process: converts the compressed weights on disk to decompressed weights in device memory.

```python
from compressed_tensors import save_compressed, load_compressed, BitmaskConfig
from torch import Tensor
from typing import Dict

# the example BitmaskConfig method efficiently compresses 
# tensors with large number of zero entries 
compression_config = BitmaskConfig()

tensors: Dict[str, Tensor] = {"tensor_1": Tensor(
    [[0.0, 0.0, 0.0], 
     [1.0, 1.0, 1.0]]
)}
# compress tensors using BitmaskConfig compression format (save them efficiently on disk)
save_compressed(tensors, "model.safetensors", compression_format=compression_config.format)

# decompress tensors (load_compressed returns a generator for memory efficiency)
decompressed_tensors = {}
for tensor_name, tensor in load_compressed("model.safetensors", compression_config = compression_config):
    decompressed_tensors[tensor_name] = tensor
```

## Saving/Loading Compressed Models (Bitmask Compression)

We can apply bitmask compression to a whole model. For more detailed example see `example` directory.
```python
from compressed_tensors import save_compressed_model, load_compressed, BitmaskConfig
from transformers import AutoModelForCausalLM

model_name = "neuralmagic/llama2.c-stories110M-pruned50"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")

original_state_dict = model.state_dict()

compression_config = BitmaskConfig()

# save compressed model weights
save_compressed_model(model, "compressed_model.safetensors", compression_format=compression_config.format)

# load compressed model weights (`dict` turns generator into a dictionary)
state_dict = dict(load_compressed("compressed_model.safetensors", compression_config))
```

For more in-depth tutorial on bitmask compression, refer to the [notebook](https://github.com/neuralmagic/compressed-tensors/blob/d707c5b84bc3fef164aebdcd97cb6eaa571982f8/examples/bitmask_compression.ipynb).


## Saving a Compressed Model with PTQ

We can use compressed-tensors to run basic post training quantization (PTQ) and save the quantized model compressed on disk

```python
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype="auto")

config = QuantizationConfig.parse_file("./examples/bit_packing/int4_config.json")
config.quantization_status = QuantizationStatus.CALIBRATION
apply_quantization_config(model, config)

dataset = load_dataset("ptb_text_only")["train"]
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding=False, truncation=True, max_length=1024)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
data_loader = DataLoader(tokenized_dataset, batch_size=1, collate_fn=DefaultDataCollator())

with torch.no_grad():
    for idx, sample in tqdm(enumerate(data_loader), desc="Running calibration"):
        sample = {key: value.to(device) for key,value in sample.items()}
        _ = model(**sample)

        if idx >= 512:
            break

model.apply(freeze_module_quantization)
model.apply(compress_quantized_weights)

output_dir = "./ex_llama1.1b_w4a16_packed_quantize"
compressor = ModelCompressor(quantization_config=config)
compressed_state_dict = compressor.compress(model)
model.save_pretrained(output_dir, state_dict=compressed_state_dict)
```

For more in-depth tutorial on quantization compression, refer to the [notebook](./examples/quantize_and_pack_int4.ipynb).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/neuralmagic/compressed-tensors",
    "name": "compressed-tensors-nightly",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Neuralmagic, Inc.",
    "author_email": "support@neuralmagic.com",
    "download_url": "https://files.pythonhosted.org/packages/e9/6d/d9b65dfb87b6384e304b77f3c34888b0721b3c77ca3d53720d1958cc78d6/compressed-tensors-nightly-0.8.0.20241114.tar.gz",
    "platform": null,
    "description": "# compressed-tensors\n\nThe `compressed-tensors` library extends the [safetensors](https://github.com/huggingface/safetensors) format, providing a versatile and efficient way to store and manage compressed tensor data. This library supports various quantization and sparsity schemes, making it a unified format for handling different model optimizations like GPTQ, AWQ, SmoothQuant, INT8, FP8, SparseGPT, and more.\n\n## Why `compressed-tensors`?\n\nAs model compression becomes increasingly important for efficient deployment of LLMs, the landscape of quantization and compression techniques has become increasingly fragmented.\nEach method often comes with its own storage format and loading procedures, making it challenging to work with multiple techniques or switch between them.\n`compressed-tensors` addresses this by providing a single, extensible format that can represent a wide variety of compression schemes. \n\n* **Unified Checkpoint Format**: Supports various compression schemes in a single, consistent format.\n* **Wide Compatibility**: Works with popular quantization methods like GPTQ, SmoothQuant, and FP8. See [llm-compressor](https://github.com/vllm-project/llm-compressor)\n* **Flexible Quantization Support**: \n  * Weight-only quantization (e.g., W4A16, W8A16, WnA16)\n  * Activation quantization (e.g., W8A8)\n  * KV cache quantization\n  * Non-uniform schemes (different layers can be quantized in different ways!)\n* **Sparsity Support**: Handles both unstructured and semi-structured (e.g., 2:4) sparsity patterns.\n* **Open-Source Integration**: Designed to work seamlessly with Hugging Face models and PyTorch.\n\nThis allows developers and researchers to easily experiment with composing different quantization methods, simplify model deployment pipelines, and reduce the overhead of supporting multiple compression formats in inference engines.\n\n## Installation\n\n### From [PyPI](https://pypi.org/project/compressed-tensors)\n\nStable release:\n```bash\npip install compressed-tensors\n```\n\nNightly release:\n```bash\npip install compressed-tensors-nightly\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/neuralmagic/compressed-tensors\ncd compressed-tensors\npip install -e .\n```\n\n## Getting started\n\n### Saving/Loading Compressed Tensors (Bitmask Compression)\n\nThe function `save_compressed` uses the `compression_format` argument to apply compression to tensors.\nThe function `load_compressed` reverses the process: converts the compressed weights on disk to decompressed weights in device memory.\n\n```python\nfrom compressed_tensors import save_compressed, load_compressed, BitmaskConfig\nfrom torch import Tensor\nfrom typing import Dict\n\n# the example BitmaskConfig method efficiently compresses \n# tensors with large number of zero entries \ncompression_config = BitmaskConfig()\n\ntensors: Dict[str, Tensor] = {\"tensor_1\": Tensor(\n    [[0.0, 0.0, 0.0], \n     [1.0, 1.0, 1.0]]\n)}\n# compress tensors using BitmaskConfig compression format (save them efficiently on disk)\nsave_compressed(tensors, \"model.safetensors\", compression_format=compression_config.format)\n\n# decompress tensors (load_compressed returns a generator for memory efficiency)\ndecompressed_tensors = {}\nfor tensor_name, tensor in load_compressed(\"model.safetensors\", compression_config = compression_config):\n    decompressed_tensors[tensor_name] = tensor\n```\n\n## Saving/Loading Compressed Models (Bitmask Compression)\n\nWe can apply bitmask compression to a whole model. For more detailed example see `example` directory.\n```python\nfrom compressed_tensors import save_compressed_model, load_compressed, BitmaskConfig\nfrom transformers import AutoModelForCausalLM\n\nmodel_name = \"neuralmagic/llama2.c-stories110M-pruned50\"\nmodel = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=\"auto\")\n\noriginal_state_dict = model.state_dict()\n\ncompression_config = BitmaskConfig()\n\n# save compressed model weights\nsave_compressed_model(model, \"compressed_model.safetensors\", compression_format=compression_config.format)\n\n# load compressed model weights (`dict` turns generator into a dictionary)\nstate_dict = dict(load_compressed(\"compressed_model.safetensors\", compression_config))\n```\n\nFor more in-depth tutorial on bitmask compression, refer to the [notebook](https://github.com/neuralmagic/compressed-tensors/blob/d707c5b84bc3fef164aebdcd97cb6eaa571982f8/examples/bitmask_compression.ipynb).\n\n\n## Saving a Compressed Model with PTQ\n\nWe can use compressed-tensors to run basic post training quantization (PTQ) and save the quantized model compressed on disk\n\n```python\nmodel_name = \"TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T\"\nmodel = AutoModelForCausalLM.from_pretrained(model_name, device_map=\"cuda:0\", torch_dtype=\"auto\")\n\nconfig = QuantizationConfig.parse_file(\"./examples/bit_packing/int4_config.json\")\nconfig.quantization_status = QuantizationStatus.CALIBRATION\napply_quantization_config(model, config)\n\ndataset = load_dataset(\"ptb_text_only\")[\"train\"]\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\ndef tokenize_function(examples):\n    return tokenizer(examples[\"sentence\"], padding=False, truncation=True, max_length=1024)\n\ntokenized_dataset = dataset.map(tokenize_function, batched=True)\ndata_loader = DataLoader(tokenized_dataset, batch_size=1, collate_fn=DefaultDataCollator())\n\nwith torch.no_grad():\n    for idx, sample in tqdm(enumerate(data_loader), desc=\"Running calibration\"):\n        sample = {key: value.to(device) for key,value in sample.items()}\n        _ = model(**sample)\n\n        if idx >= 512:\n            break\n\nmodel.apply(freeze_module_quantization)\nmodel.apply(compress_quantized_weights)\n\noutput_dir = \"./ex_llama1.1b_w4a16_packed_quantize\"\ncompressor = ModelCompressor(quantization_config=config)\ncompressed_state_dict = compressor.compress(model)\nmodel.save_pretrained(output_dir, state_dict=compressed_state_dict)\n```\n\nFor more in-depth tutorial on quantization compression, refer to the [notebook](./examples/quantize_and_pack_int4.ipynb).\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Library for utilization of compressed safetensors of neural network models",
    "version": "0.8.0.20241114",
    "project_urls": {
        "Homepage": "https://github.com/neuralmagic/compressed-tensors"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "76e4ccfecbeac58c24781aa3f787eb241207cfc93a23fd1549715f367929c3f5",
                "md5": "0232c85a36e402bbb067bdfd578f318b",
                "sha256": "3b4d6260e9d422b8d1c1d08b60a73b3dc3e7a9d51b3310aa2779ad1a36189d80"
            },
            "downloads": -1,
            "filename": "compressed_tensors_nightly-0.8.0.20241114-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0232c85a36e402bbb067bdfd578f318b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 87182,
            "upload_time": "2024-11-14T01:07:39",
            "upload_time_iso_8601": "2024-11-14T01:07:39.839414Z",
            "url": "https://files.pythonhosted.org/packages/76/e4/ccfecbeac58c24781aa3f787eb241207cfc93a23fd1549715f367929c3f5/compressed_tensors_nightly-0.8.0.20241114-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e96dd9b65dfb87b6384e304b77f3c34888b0721b3c77ca3d53720d1958cc78d6",
                "md5": "fdafa35f933009e8bbe9ece846fa2fd2",
                "sha256": "84cee443cb8ef1735739b84e36bc41bee2354701aebdb8c26b93f243a1dd3501"
            },
            "downloads": -1,
            "filename": "compressed-tensors-nightly-0.8.0.20241114.tar.gz",
            "has_sig": false,
            "md5_digest": "fdafa35f933009e8bbe9ece846fa2fd2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 56313,
            "upload_time": "2024-11-14T01:07:42",
            "upload_time_iso_8601": "2024-11-14T01:07:42.342890Z",
            "url": "https://files.pythonhosted.org/packages/e9/6d/d9b65dfb87b6384e304b77f3c34888b0721b3c77ca3d53720d1958cc78d6/compressed-tensors-nightly-0.8.0.20241114.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-14 01:07:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neuralmagic",
    "github_project": "compressed-tensors",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "compressed-tensors-nightly"
}
        
Elapsed time: 1.15857s