###################################################################################
cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication
###################################################################################
**NVIDIA cuSPARSELt** is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a structured sparse matrix with 50\% sparsity ratio:
.. math::
D = Activation(\alpha op(A) \cdot op(B) + \beta op(C) + bias)
where :math:`op(A)/op(B)` refers to in-place operations such as transpose/non-transpose, and :math:`alpha, beta` are scalars or vectors.
The *cuSPARSELt APIs* allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.
**Download:** `developer.nvidia.com/cusparselt/downloads <https://developer.nvidia.com/cusparselt/downloads>`_
**Provide Feedback:** `Math-Libs-Feedback@nvidia.com <mailto:Math-Libs-Feedback@nvidia.com?subject=cuSPARSELt-Feedback>`_
**Examples**:
`cuSPARSELt Example 1 <https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul>`_,
`cuSPARSELt Example 2 <https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul_advanced>`_
**Blog post**:
- `Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt <https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/>`_
- `Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines <https://developer.nvidia.com/blog/structured-sparsity-in-the-nvidia-ampere-architecture-and-applications-in-search-engines/>`__
- `Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture <https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31552/>`__
================================================================================
Key Features
================================================================================
* *NVIDIA Sparse MMA tensor core* support
* Mixed-precision computation support:
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| Input A/B | Input C | Output D | Compute | Block scaled | Support SM arch |
+==============+================+=================+=============+=================================+====================+
| `FP32` | `FP32` | `FP32` | `FP32` | No | |
+--------------+----------------+-----------------+-------------+ + |
| `BF16` | `BF16` | `BF16` | `FP32` | | `8.0, 8.6, 8.7` |
+--------------+----------------+-----------------+-------------+ + `9.0, 10.0, 10.1` |
| `FP16` | `FP16` | `FP16` | `FP32` | | `11.0, 12.0, 12.1` |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `FP16` | `FP16` | `FP16` | `FP16` | No | `9.0` |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `INT8` | `INT8` | `INT8` | `INT32` | No | |
+ +----------------+-----------------+ + + `8.0, 8.6, 8.7` +
| | `INT32` | `INT32` | | | `9.0, 10.0, 10.1` |
+ +----------------+-----------------+ + + `11.0, 12.0, 12.1` +
| | `FP16` | `FP16` | | | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `INT8` | `INT8` | `INT8` | `INT32` | No | |
+ +----------------+-----------------+ + + `8.0, 8.6, 8.7` +
| | `INT32` | `INT32` | | | `9.0, 10.0, 10.1` |
+ +----------------+-----------------+ + + `11.0, 12.0, 12.1` +
| | `FP16` | `FP16` | | | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `E4M3` | `FP16` | `E4M3` | `FP32` | No | `9.0, 10.0, 10.1` |
+ +----------------+-----------------+ + + `11.0, 12.0, 12.1` +
| | `BF16` | `E4M3` | | | |
+ +----------------+-----------------+ + + +
| | `FP16` | `FP16` | | | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+ +----------------+-----------------+ + + +
| | `FP32` | `FP32` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `E5M2` | `FP16` | `E5M2` | `FP32` | No | `9.0, 10.0, 10.1` |
+ +----------------+-----------------+ + + `11.0, 12.0, 12.1` +
| | `BF16` | `E5M2` | | | |
+ +----------------+-----------------+ + + +
| | `FP16` | `FP16` | | | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+ +----------------+-----------------+ + + +
| | `FP32` | `FP32` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `E4M3` | `FP16` | `E4M3` | `FP32` | A/B/D_OUT_SCALE = `VEC64_UE8M0` | `10.0, 10.1, 11.0` |
+ +----------------+-----------------+ + + `12.0, 12.1` +
| | `BF16` | `E4M3` | | D_SCALE = `32F` | |
+ +----------------+-----------------+ +---------------------------------+ +
| | `FP16` | `FP16` | | A/B_SCALE = `VEC64_UE8M0` | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+ +----------------+-----------------+ + + +
| | `FP32` | `FP32` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `E2M1` | `FP16` | `E2M1` | `FP32` | A/B/D_SCALE = `VEC32_UE4M3` | `10.0, 10.1, 11.0` |
+ +----------------+-----------------+ + + `12.0, 12.1` +
| | `BF16` | `E2M1` | | D_SCALE = `32F` | |
+ +----------------+-----------------+ +---------------------------------+ +
| | `FP16` | `FP16` | | A/B_SCALE = `VEC32_UE4M3` | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+ +----------------+-----------------+ + + +
| | `FP32` | `FP32` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
* Matrix pruning and compression functionalities
* Activation functions, bias vector, and output scaling
* Batched computation (multiple matrices in a single run)
* GEMM Split-K mode
* Auto-tuning functionality (see `cusparseLtMatmulSearch()`)
* NVTX ranging and Logging functionalities
================================================================================
Support
================================================================================
* *Supported SM Architectures*: `SM 8.0`, `SM 8.6`, `SM 8.7`, `SM 8.9`, `SM 9.0`, `SM 10.0`, `SM 10.1` (for CTK 12), `SM 11.0` (for CTK 13), `SM 12.0`, `SM 12.1`
* *Supported CPU architectures and operating systems*:
+------------+--------------------+
| OS | CPU archs |
+============+====================+
| `Windows` | `x86_64` |
+------------+--------------------+
| `Linux` | `x86_64`, `Arm64` |
+------------+--------------------+
================================================================================
Documentation
================================================================================
Please refer to https://docs.nvidia.com/cuda/cusparselt/index.html for the cuSPARSELt documentation.
================================================================================
Installation
================================================================================
The cuSPARSELt wheel can be installed as follows:
.. code-block:: bash
pip install nvidia-cusparselt-cuXX
where XX is the CUDA major version.
Raw data
{
"_id": null,
"home_page": "https://developer.nvidia.com/cusparselt",
"name": "nvidia-cusparselt-cu12",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "cuda, nvidia, machine learning, high-performance computing",
"author": "NVIDIA Corporation",
"author_email": "cuda_installer@nvidia.com",
"download_url": null,
"platform": null,
"description": "###################################################################################\ncuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication\n###################################################################################\n\n**NVIDIA cuSPARSELt** is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a structured sparse matrix with 50\\% sparsity ratio:\n\n.. math::\n\n D = Activation(\\alpha op(A) \\cdot op(B) + \\beta op(C) + bias)\n\nwhere :math:`op(A)/op(B)` refers to in-place operations such as transpose/non-transpose, and :math:`alpha, beta` are scalars or vectors.\n\nThe *cuSPARSELt APIs* allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.\n\n**Download:** `developer.nvidia.com/cusparselt/downloads <https://developer.nvidia.com/cusparselt/downloads>`_\n\n**Provide Feedback:** `Math-Libs-Feedback@nvidia.com <mailto:Math-Libs-Feedback@nvidia.com?subject=cuSPARSELt-Feedback>`_\n\n**Examples**:\n`cuSPARSELt Example 1 <https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul>`_,\n`cuSPARSELt Example 2 <https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul_advanced>`_\n\n**Blog post**:\n\n- `Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt <https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/>`_\n- `Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines <https://developer.nvidia.com/blog/structured-sparsity-in-the-nvidia-ampere-architecture-and-applications-in-search-engines/>`__\n- `Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture <https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31552/>`__\n\n================================================================================\nKey Features\n================================================================================\n\n* *NVIDIA Sparse MMA tensor core* support\n* Mixed-precision computation support:\n\n +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+\n | Input A/B | Input C | Output D | Compute | Block scaled | Support SM arch |\n +==============+================+=================+=============+=================================+====================+\n | `FP32` | `FP32` | `FP32` | `FP32` | No | |\n +--------------+----------------+-----------------+-------------+ + |\n | `BF16` | `BF16` | `BF16` | `FP32` | | `8.0, 8.6, 8.7` |\n +--------------+----------------+-----------------+-------------+ + `9.0, 10.0, 10.1` |\n | `FP16` | `FP16` | `FP16` | `FP32` | | `11.0, 12.0, 12.1` |\n +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+\n | `FP16` | `FP16` | `FP16` | `FP16` | No | `9.0` |\n +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+\n | `INT8` | `INT8` | `INT8` | `INT32` | No | |\n + +----------------+-----------------+ + + `8.0, 8.6, 8.7` +\n | | `INT32` | `INT32` | | | `9.0, 10.0, 10.1` |\n + +----------------+-----------------+ + + `11.0, 12.0, 12.1` +\n | | `FP16` | `FP16` | | | |\n + +----------------+-----------------+ + + +\n | | `BF16` | `BF16` | | | |\n +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+\n | `INT8` | `INT8` | `INT8` | `INT32` | No | |\n + +----------------+-----------------+ + + `8.0, 8.6, 8.7` +\n | | `INT32` | `INT32` | | | `9.0, 10.0, 10.1` |\n + +----------------+-----------------+ + + `11.0, 12.0, 12.1` +\n | | `FP16` | `FP16` | | | |\n + +----------------+-----------------+ + + +\n | | `BF16` | `BF16` | | | |\n +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+\n | `E4M3` | `FP16` | `E4M3` | `FP32` | No | `9.0, 10.0, 10.1` |\n + +----------------+-----------------+ + + `11.0, 12.0, 12.1` +\n | | `BF16` | `E4M3` | | | |\n + +----------------+-----------------+ + + +\n | | `FP16` | `FP16` | | | |\n + +----------------+-----------------+ + + +\n | | `BF16` | `BF16` | | | |\n + +----------------+-----------------+ + + +\n | | `FP32` | `FP32` | | | |\n +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+\n | `E5M2` | `FP16` | `E5M2` | `FP32` | No | `9.0, 10.0, 10.1` |\n + +----------------+-----------------+ + + `11.0, 12.0, 12.1` +\n | | `BF16` | `E5M2` | | | |\n + +----------------+-----------------+ + + +\n | | `FP16` | `FP16` | | | |\n + +----------------+-----------------+ + + +\n | | `BF16` | `BF16` | | | |\n + +----------------+-----------------+ + + +\n | | `FP32` | `FP32` | | | |\n +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+\n | `E4M3` | `FP16` | `E4M3` | `FP32` | A/B/D_OUT_SCALE = `VEC64_UE8M0` | `10.0, 10.1, 11.0` |\n + +----------------+-----------------+ + + `12.0, 12.1` +\n | | `BF16` | `E4M3` | | D_SCALE = `32F` | |\n + +----------------+-----------------+ +---------------------------------+ +\n | | `FP16` | `FP16` | | A/B_SCALE = `VEC64_UE8M0` | |\n + +----------------+-----------------+ + + +\n | | `BF16` | `BF16` | | | |\n + +----------------+-----------------+ + + +\n | | `FP32` | `FP32` | | | |\n +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+\n | `E2M1` | `FP16` | `E2M1` | `FP32` | A/B/D_SCALE = `VEC32_UE4M3` | `10.0, 10.1, 11.0` |\n + +----------------+-----------------+ + + `12.0, 12.1` +\n | | `BF16` | `E2M1` | | D_SCALE = `32F` | |\n + +----------------+-----------------+ +---------------------------------+ +\n | | `FP16` | `FP16` | | A/B_SCALE = `VEC32_UE4M3` | |\n + +----------------+-----------------+ + + +\n | | `BF16` | `BF16` | | | |\n + +----------------+-----------------+ + + +\n | | `FP32` | `FP32` | | | |\n +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+\n\n* Matrix pruning and compression functionalities\n* Activation functions, bias vector, and output scaling\n* Batched computation (multiple matrices in a single run)\n* GEMM Split-K mode\n* Auto-tuning functionality (see `cusparseLtMatmulSearch()`)\n* NVTX ranging and Logging functionalities\n\n================================================================================\nSupport\n================================================================================\n\n* *Supported SM Architectures*: `SM 8.0`, `SM 8.6`, `SM 8.7`, `SM 8.9`, `SM 9.0`, `SM 10.0`, `SM 10.1` (for CTK 12), `SM 11.0` (for CTK 13), `SM 12.0`, `SM 12.1`\n* *Supported CPU architectures and operating systems*:\n\n+------------+--------------------+\n| OS | CPU archs |\n+============+====================+\n| `Windows` | `x86_64` |\n+------------+--------------------+\n| `Linux` | `x86_64`, `Arm64` |\n+------------+--------------------+\n\n================================================================================\nDocumentation\n================================================================================\n\nPlease refer to https://docs.nvidia.com/cuda/cusparselt/index.html for the cuSPARSELt documentation.\n\n================================================================================\nInstallation\n================================================================================\n\nThe cuSPARSELt wheel can be installed as follows:\n\n.. code-block:: bash\n\n pip install nvidia-cusparselt-cuXX\n\nwhere XX is the CUDA major version.\n",
"bugtrack_url": null,
"license": "NVIDIA Proprietary Software",
"summary": "NVIDIA cuSPARSELt",
"version": "0.8.1",
"project_urls": {
"Homepage": "https://developer.nvidia.com/cusparselt"
},
"split_keywords": [
"cuda",
" nvidia",
" machine learning",
" high-performance computing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fdf8a809966c96e824b92df09ee3b7032442f5e975d873d7dadfef818d527f48",
"md5": "21432ed00954d8546800e46b302075ed",
"sha256": "5c72f727722f74762380e5f8755557c788b26d8fdcc49df1641c1b08e16d256c"
},
"downloads": -1,
"filename": "nvidia_cusparselt_cu12-0.8.1-py3-none-manylinux2014_aarch64.whl",
"has_sig": false,
"md5_digest": "21432ed00954d8546800e46b302075ed",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 235985605,
"upload_time": "2025-09-05T18:46:39",
"upload_time_iso_8601": "2025-09-05T18:46:39.601790Z",
"url": "https://files.pythonhosted.org/packages/fd/f8/a809966c96e824b92df09ee3b7032442f5e975d873d7dadfef818d527f48/nvidia_cusparselt_cu12-0.8.1-py3-none-manylinux2014_aarch64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bb14e46964290aa587cb9fb7df20efdc60528ddd00d291ccffec47617fb06ca3",
"md5": "4c02b8c4e7d06d2cbfeb9305dab5c522",
"sha256": "cd1b1dc9e1ad31ea3353c1f985e2bd6f9e7ae0e797d7e6ce879d7b2ace5e80e8"
},
"downloads": -1,
"filename": "nvidia_cusparselt_cu12-0.8.1-py3-none-manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "4c02b8c4e7d06d2cbfeb9305dab5c522",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 239274390,
"upload_time": "2025-09-05T18:47:44",
"upload_time_iso_8601": "2025-09-05T18:47:44.816777Z",
"url": "https://files.pythonhosted.org/packages/bb/14/e46964290aa587cb9fb7df20efdc60528ddd00d291ccffec47617fb06ca3/nvidia_cusparselt_cu12-0.8.1-py3-none-manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "64f59eefe50ee49fda0657aaa061a56600a519dbc1c772d0df701f80e676c818",
"md5": "d5f8cd23ed53e5cce28cb8a9e58d7709",
"sha256": "2607ec058d53967c9caf0b7a3904ced34bbceaf7944cf9fef6d7f4ec6dab5e3a"
},
"downloads": -1,
"filename": "nvidia_cusparselt_cu12-0.8.1-py3-none-win_amd64.whl",
"has_sig": false,
"md5_digest": "d5f8cd23ed53e5cce28cb8a9e58d7709",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 225678999,
"upload_time": "2025-09-05T18:48:25",
"upload_time_iso_8601": "2025-09-05T18:48:25.074252Z",
"url": "https://files.pythonhosted.org/packages/64/f5/9eefe50ee49fda0657aaa061a56600a519dbc1c772d0df701f80e676c818/nvidia_cusparselt_cu12-0.8.1-py3-none-win_amd64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-05 18:46:39",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "nvidia-cusparselt-cu12"
}