Name | pyfastkron JSON |
Version |
1.0.1
JSON |
| download |
home_page | None |
Summary | A library for efficient matrix and kronecker product matrix multiplication on parallel hardware |
upload_time | 2024-12-17 17:46:25 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | MIT License Copyright (c) 2024 Abhinav Jangda Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
kronecker product
cuda
gpu
kronecker matrix multiplication
|
VCS |
 |
bugtrack_url |
|
requirements |
numpy
torch
torchvision
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# FastKron
FastKron is a fast library for computing *Generalized Matrix Kronecker-Matrix Multiplication (GeMKM)* and *Generalized Kronecker-Matrix Matrix Multiplication (GeKMM)* on NVIDIA GPUs and X86 CPUs.
FastKron contains specialized algorithms and implementations of GeMKM and GeKMM rather than using existing linear algebra operations.
FastKron avoids extra transposes and adds more optimizations including fusion of multiple kernels.
Therefore, FastKron performs orders of magnitude better than baseline GPyTorch, NVIDIA cuTensor, and HPTT.
Fastkron provides a C++ library and a Python library for Numpy and PyTorch autograd functions.
FastKron provides fast implementations for float and double data type, while Numpy/PyTorch functions uses Shuffle algorithm for other types.
For more details look [Fast Kronecker Matrix-Matrix Multiplication on GPUs](https://dl.acm.org/doi/abs/10.1145/3627535.3638489).
# Performance
We compare FastKron's GeMKM and GeKMM with the existing shuffle algorithm in GPyTorch based on PyTorch 2.5.1.
Below table shows the range of speedup on different hardware and data types.
### GeMKM
| Hardware | Float | Double |
|----------|----------|--------|
| AMD 64-Core CPU with AVX| 9.3-45x| 5.8-21x|
| AMD 64-Core CPU with AVX512| 9.7-38x| 6.3-21x|
| NVIDIA A100 80 GB| 1.5-9.5x| 1.1-9.5x|
| NVIDIA V100 16 GB| 2.5-10x| 1.9-11x|
### GeKMM
| Hardware | Float | Double |
|----------|----------|--------|
| AMD 64-Core CPU with AVX| 2.7-13.7x| 1.5-7x|
| AMD 64-Core CPU with AVX512| 2.2-14x| 2-7x|
| NVIDIA A100 80 GB|1.3-4.6x |0.9-4.5x |
| NVIDIA V100 16 GB| 1.4-6.4x|2-7.8x |
For more information see [documents/performance.md](https://github.com/abhijangda/FastKron/blob/main/documents/performance.md)
# Hardware and OS Support
| | Linux | WSL2 | Windows | Mac |
|----------|----------|----------|-------|-----|
| x86 | ✅ | ✅ | 🐍 | 🐍 |
| ARM | 🐍 | 🐍 | 🐍 | 🐍 |
| AVX256 | ✅ | ✅ | 🐍 | 🐍 |
| AVX512 | ✅ |✅ | 🐍 | 🐍|
| SM50+ CUDA cores |✅ | ✅ | 🐍 | 🐍 |
| SM80+ Tensor cores | ❌ | ❌ | 🐍 | 🐍 |
| AMD RoCM | 🐍 | 🐍 | 🐍 | 🐍 |
✅ FastKron supports optimized implementations for AVX256 and AVX512 CPUs and NVIDIA GPUs.\
❌ Tensor cores for double are not supported.\
🐍 Supported in Python module. x86 CPUs older than GLIBC x86-64-v2, ARM CPUs, AMD GPUs, Windows, and Mac OS are not supported in C++ API but PyFastKron *fallbacks* to the shuffle algorithm in Numpy or PyTorch.
The future roadmap is as follows in terms of priority: Windows, SM80+ Double Tensor cores, AMD GPUs, ARM CPUs.
# Example
The directory `example/` pinclude examples of using FastKron's CUDA and x86 backend using both C++ and Python.
Before using an example, follow below instructions to build FastKron.
# Installation
PyFastKron can be installed using pip.
```pip install pyfastkron```
PyFastKron's CUDA backend is built with CUDA 12.3 but is compatible with CUDA 11.8 and above.
# Build
Build the C++ library, libFastKron.so, to use with C++ programs or the Python library, PyFastKron, to use with PyTorch or Numpy programs.
### Required Pre-requisites
On Ubuntu :
```
sudo apt update && sudo apt install gcc linux-headers-$(uname -r) make g++ git python3-dev wget unzip python3-pip build-essential devscripts debhelper fakeroot intel-mkl cmake
```
### CUDA Pre-requisite
Install CUDA 11+ from https://developer.nvidia.com/cuda/ .
### Clone repository
Clone repository with submodules using
```
git clone --recurse-submodules https://github.com/abhijangda/fastkron.git
```
If already cloned and want to only clone submodules, use
```
git submodule update --init --recursive
```
### libFastKron
Build FastKron as C++ library using below commands:
```mkdir build/
cd build/
cmake ..
make -j
```
To install run
```make install```
By default both x86 and CUDA backends are built. use CMAKE option `-DENABLE_CUDA=OFF` to disable CUDA backend or `-DENABLE_X86=OFF` to disable x86 backend.
Run X86 CPU tests using
```
make run-x86-tests
```
Run CUDA tests using
```
make run-cuda-tests
```
### PyFastKron
Install PyFastKron using pip
```
pip install .
```
Run tests using
```
pytest
```
# Documentation
C++ API: [documents/cpp-api.md](https://github.com/abhijangda/FastKron/blob/main/documents/cpp-api.md)\
Python API: [documents/python-api.md](https://github.com/abhijangda/FastKron/blob/main/documents/python-api.md)\
Kernel Tuning: [documents/autotuning.md](https://github.com/abhijangda/FastKron/blob/main/documents/autotuning.md)\
Performance: [documents/performance.md](https://github.com/abhijangda/FastKron/blob/main/documents/performance.md)\
Multi-GPU: [documents/multigpu.md](https://github.com/abhijangda/FastKron/blob/main/documents/multigpu.md)\
Contributing: [documents/contributing.md](https://github.com/abhijangda/FastKron/blob/main/documents/contributing.md)
# Citation
```
@inproceedings{10.1145/3627535.3638489,
author = {Jangda, Abhinav and Yadav, Mohit},
title = {Fast Kronecker Matrix-Matrix Multiplication on GPUs},
year = {2024},
isbn = {9798400704352},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3627535.3638489},
doi = {10.1145/3627535.3638489},
booktitle = {Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming},
pages = {390–403},
numpages = {14},
keywords = {graphics processing units, CUDA, kronecker product, linear algebra},
location = {Edinburgh, United Kingdom},
series = {PPoPP '24}
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "pyfastkron",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Abhinav Jangda <abhijangda@gmail.com>",
"keywords": "kronecker product, cuda, gpu, kronecker matrix multiplication",
"author": null,
"author_email": "Abhinav Jangda <abhijangda@gmail.com>",
"download_url": null,
"platform": null,
"description": "# FastKron\n\nFastKron is a fast library for computing *Generalized Matrix Kronecker-Matrix Multiplication (GeMKM)* and *Generalized Kronecker-Matrix Matrix Multiplication (GeKMM)* on NVIDIA GPUs and X86 CPUs.\nFastKron contains specialized algorithms and implementations of GeMKM and GeKMM rather than using existing linear algebra operations.\nFastKron avoids extra transposes and adds more optimizations including fusion of multiple kernels.\nTherefore, FastKron performs orders of magnitude better than baseline GPyTorch, NVIDIA cuTensor, and HPTT.\nFastkron provides a C++ library and a Python library for Numpy and PyTorch autograd functions.\nFastKron provides fast implementations for float and double data type, while Numpy/PyTorch functions uses Shuffle algorithm for other types.\n\nFor more details look [Fast Kronecker Matrix-Matrix Multiplication on GPUs](https://dl.acm.org/doi/abs/10.1145/3627535.3638489).\n\n# Performance\nWe compare FastKron's GeMKM and GeKMM with the existing shuffle algorithm in GPyTorch based on PyTorch 2.5.1.\nBelow table shows the range of speedup on different hardware and data types.\n\n### GeMKM\n\n| Hardware | Float | Double |\n|----------|----------|--------|\n| AMD 64-Core CPU with AVX| 9.3-45x| 5.8-21x|\n| AMD 64-Core CPU with AVX512| 9.7-38x| 6.3-21x|\n| NVIDIA A100 80 GB| 1.5-9.5x| 1.1-9.5x|\n| NVIDIA V100 16 GB| 2.5-10x| 1.9-11x|\n\n### GeKMM\n\n| Hardware | Float | Double |\n|----------|----------|--------|\n| AMD 64-Core CPU with AVX| 2.7-13.7x| 1.5-7x|\n| AMD 64-Core CPU with AVX512| 2.2-14x| 2-7x|\n| NVIDIA A100 80 GB|1.3-4.6x |0.9-4.5x |\n| NVIDIA V100 16 GB| 1.4-6.4x|2-7.8x |\n\nFor more information see [documents/performance.md](https://github.com/abhijangda/FastKron/blob/main/documents/performance.md)\n\n# Hardware and OS Support\n| | Linux | WSL2 | Windows | Mac |\n|----------|----------|----------|-------|-----|\n| x86 | \u2705 | \u2705 | \ud83d\udc0d | \ud83d\udc0d |\n| ARM | \ud83d\udc0d | \ud83d\udc0d | \ud83d\udc0d | \ud83d\udc0d |\n| AVX256 | \u2705 | \u2705 | \ud83d\udc0d | \ud83d\udc0d |\n| AVX512 | \u2705 |\u2705 | \ud83d\udc0d | \ud83d\udc0d|\n| SM50+ CUDA cores |\u2705 | \u2705 | \ud83d\udc0d | \ud83d\udc0d |\n| SM80+ Tensor cores | \u274c | \u274c | \ud83d\udc0d | \ud83d\udc0d |\n| AMD RoCM | \ud83d\udc0d | \ud83d\udc0d | \ud83d\udc0d | \ud83d\udc0d |\n\n\u2705 FastKron supports optimized implementations for AVX256 and AVX512 CPUs and NVIDIA GPUs.\\\n\u274c Tensor cores for double are not supported.\\\n\ud83d\udc0d Supported in Python module. x86 CPUs older than GLIBC x86-64-v2, ARM CPUs, AMD GPUs, Windows, and Mac OS are not supported in C++ API but PyFastKron *fallbacks* to the shuffle algorithm in Numpy or PyTorch.\n\nThe future roadmap is as follows in terms of priority: Windows, SM80+ Double Tensor cores, AMD GPUs, ARM CPUs.\n\n# Example\nThe directory `example/` pinclude examples of using FastKron's CUDA and x86 backend using both C++ and Python.\nBefore using an example, follow below instructions to build FastKron.\n\n# Installation\n\nPyFastKron can be installed using pip.\n\n```pip install pyfastkron```\n\nPyFastKron's CUDA backend is built with CUDA 12.3 but is compatible with CUDA 11.8 and above.\n\n# Build\nBuild the C++ library, libFastKron.so, to use with C++ programs or the Python library, PyFastKron, to use with PyTorch or Numpy programs.\n\n### Required Pre-requisites\nOn Ubuntu :\n```\nsudo apt update && sudo apt install gcc linux-headers-$(uname -r) make g++ git python3-dev wget unzip python3-pip build-essential devscripts debhelper fakeroot intel-mkl cmake\n```\n\n### CUDA Pre-requisite\nInstall CUDA 11+ from https://developer.nvidia.com/cuda/ .\n\n### Clone repository\nClone repository with submodules using\n```\ngit clone --recurse-submodules https://github.com/abhijangda/fastkron.git\n```\n\nIf already cloned and want to only clone submodules, use\n```\ngit submodule update --init --recursive\n```\n\n### libFastKron\nBuild FastKron as C++ library using below commands:\n\n```mkdir build/\ncd build/\ncmake ..\nmake -j\n```\n\nTo install run\n```make install```\n\nBy default both x86 and CUDA backends are built. use CMAKE option `-DENABLE_CUDA=OFF` to disable CUDA backend or `-DENABLE_X86=OFF` to disable x86 backend.\n\nRun X86 CPU tests using\n```\nmake run-x86-tests\n```\n\nRun CUDA tests using\n```\nmake run-cuda-tests\n```\n\n### PyFastKron\nInstall PyFastKron using pip\n\n```\npip install .\n```\n\nRun tests using\n```\npytest\n```\n\n# Documentation\n\nC++ API: [documents/cpp-api.md](https://github.com/abhijangda/FastKron/blob/main/documents/cpp-api.md)\\\nPython API: [documents/python-api.md](https://github.com/abhijangda/FastKron/blob/main/documents/python-api.md)\\\nKernel Tuning: [documents/autotuning.md](https://github.com/abhijangda/FastKron/blob/main/documents/autotuning.md)\\\nPerformance: [documents/performance.md](https://github.com/abhijangda/FastKron/blob/main/documents/performance.md)\\\nMulti-GPU: [documents/multigpu.md](https://github.com/abhijangda/FastKron/blob/main/documents/multigpu.md)\\\nContributing: [documents/contributing.md](https://github.com/abhijangda/FastKron/blob/main/documents/contributing.md)\n\n# Citation\n\n```\n@inproceedings{10.1145/3627535.3638489,\nauthor = {Jangda, Abhinav and Yadav, Mohit},\ntitle = {Fast Kronecker Matrix-Matrix Multiplication on GPUs},\nyear = {2024},\nisbn = {9798400704352},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https://doi.org/10.1145/3627535.3638489},\ndoi = {10.1145/3627535.3638489},\nbooktitle = {Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming},\npages = {390\u2013403},\nnumpages = {14},\nkeywords = {graphics processing units, CUDA, kronecker product, linear algebra},\nlocation = {Edinburgh, United Kingdom},\nseries = {PPoPP '24}\n}\n```\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024 Abhinav Jangda Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
"summary": "A library for efficient matrix and kronecker product matrix multiplication on parallel hardware",
"version": "1.0.1",
"project_urls": {
"Documentation": "https://github.com/abhijangda/fastkron",
"Homepage": "https://github.com/abhijangda/fastkron",
"Repository": "https://github.com/abhijangda/fastkron"
},
"split_keywords": [
"kronecker product",
" cuda",
" gpu",
" kronecker matrix multiplication"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "99684b5fbadca0665e3bd4391223a9d73c334aa26262da7aca45dffe368c3d67",
"md5": "7140f398b1eba2a23272ca27a09f7189",
"sha256": "1e5060aab65919caea3afdf1c0896e44ee9bd428bb2dcb67f5d1a2e5215932ec"
},
"downloads": -1,
"filename": "pyfastkron-1.0.1-cp310-cp310-manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "7140f398b1eba2a23272ca27a09f7189",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.9",
"size": 68616412,
"upload_time": "2024-12-17T17:46:25",
"upload_time_iso_8601": "2024-12-17T17:46:25.770215Z",
"url": "https://files.pythonhosted.org/packages/99/68/4b5fbadca0665e3bd4391223a9d73c334aa26262da7aca45dffe368c3d67/pyfastkron-1.0.1-cp310-cp310-manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c7fa30958f94821b6c14eb9514ef378e7916e83f6966b796160354dba5e2b9bb",
"md5": "96e7caf39337ef9a5010f0297660c31e",
"sha256": "056a8ce195c0d25c7797d9da4f6a09007cdfd23136f046ce2f31ce827ebddd1c"
},
"downloads": -1,
"filename": "pyfastkron-1.0.1-cp311-cp311-manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "96e7caf39337ef9a5010f0297660c31e",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.9",
"size": 68638316,
"upload_time": "2024-12-17T17:46:49",
"upload_time_iso_8601": "2024-12-17T17:46:49.522210Z",
"url": "https://files.pythonhosted.org/packages/c7/fa/30958f94821b6c14eb9514ef378e7916e83f6966b796160354dba5e2b9bb/pyfastkron-1.0.1-cp311-cp311-manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "13a74e2561c0aeb5bbb279def9738b116f9b4115d4f2f665d474c3b1e25dd74a",
"md5": "60fa0e6852c93785cfab8939ccaa4b73",
"sha256": "48d383d533a9c237a6909c60682ad2f31425cbd010b5fa744ec2630678704461"
},
"downloads": -1,
"filename": "pyfastkron-1.0.1-cp312-cp312-manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "60fa0e6852c93785cfab8939ccaa4b73",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.9",
"size": 68637992,
"upload_time": "2024-12-17T17:47:09",
"upload_time_iso_8601": "2024-12-17T17:47:09.405943Z",
"url": "https://files.pythonhosted.org/packages/13/a7/4e2561c0aeb5bbb279def9738b116f9b4115d4f2f665d474c3b1e25dd74a/pyfastkron-1.0.1-cp312-cp312-manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "32de8cacaa3658968265c6df68dfe65946401673b150c93fbafe3c3f8fce0afc",
"md5": "737037eccb81241e70a44e1fdcf1dbfe",
"sha256": "db3b24f45c28b92c84da942ed7212b6bb532cb0cb9283ac6408377e366248451"
},
"downloads": -1,
"filename": "pyfastkron-1.0.1-cp39-cp39-manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "737037eccb81241e70a44e1fdcf1dbfe",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.9",
"size": 68640046,
"upload_time": "2024-12-17T17:45:21",
"upload_time_iso_8601": "2024-12-17T17:45:21.731579Z",
"url": "https://files.pythonhosted.org/packages/32/de/8cacaa3658968265c6df68dfe65946401673b150c93fbafe3c3f8fce0afc/pyfastkron-1.0.1-cp39-cp39-manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "681f4225209f2356f81095a50a3a8718b2ecd5a14b8c8bc1669dbba3c143c792",
"md5": "fc01000d68aeba64c7853f44072c4e51",
"sha256": "600f33c84967e12106e7e2b25f583422bf4a1a1f8dc887b5e8df54fa9bba2082"
},
"downloads": -1,
"filename": "pyfastkron-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fc01000d68aeba64c7853f44072c4e51",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 17757,
"upload_time": "2024-12-17T17:47:24",
"upload_time_iso_8601": "2024-12-17T17:47:24.186623Z",
"url": "https://files.pythonhosted.org/packages/68/1f/4225209f2356f81095a50a3a8718b2ecd5a14b8c8bc1669dbba3c143c792/pyfastkron-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-17 17:46:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "abhijangda",
"github_project": "fastkron",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"1.10"
]
]
},
{
"name": "torchvision",
"specs": [
[
">=",
"0.1"
]
]
}
],
"lcname": "pyfastkron"
}