# da4ml: Distributed Arithmetic for Machine Learning
This project performs Constant Matrix-Vector Multiplication (CMVM) with Distributed Arithmetic (DA) for Machine Learning (ML) on a Field Programmable Gate Arrays (FPGAs).
CMVM optimization is done through greedy CSE of two-term subexpressions, with possible Delay Constraints (DC). The optimization is done in jitted Python (Numba), and a list of optimized operations is generated as traced Python code.
At the moment, the project only generates Vitis HLS C++ code for the FPGA implementation of the optimized CMVM kernel. HDL code generation is planned for the future. Currently, the major use of this repository is through the `distributed_arithmetic` strategy in the [`hls4ml`](https://github.com/fastmachinelearning/hls4ml/) project.
## Installation
The project is available on PyPI and can be installed with pip:
```bash
pip install da4ml
```
Notice that `numba>=6.0.0` is required for the project to work. The project does not work with `python<3.10`. If the project fails to compile, try upgrading `numba` and `llvmlite` to the latest versions.
## `hls4ml`
The major use of this project is through the `distributed_arithmetic` strategy in the `hls4ml`:
```python
model_hls = hls4ml.converters.convert_from_keras_model(
model,
hls_config={
'Model': {
...
'Strategy': 'distributed_arithmetic',
},
...
},
...
)
```
Currently, `Dense/Conv1D/Conv2D` layers are supported for both `io_parallel` and `io_stream` dataflows. However, notice that distributed arithmetic implies `reuse_factor=1`, as the whole kernel is implemented in combinational logic.
### Notice
Currently, only the `da4ml-v2` branch of `hls4ml` supports the `distributed_arithmetic` strategy. The `da4ml-v2` branch is not yet merged into the `main` branch of `hls4ml`, so you need to install it from the GitHub repository.
## Direct Usage
If you want to use it directly, you can use the `da4ml.api.fn_from_kernel` function, which creates a Python function from a 2x2 kernel `float[n_in, n_out]` and its corresponding code. The function signature is:
```python
def fn_from_kernel(
kernel: np.ndarray,
signs: list[bool],
bits: list[int],
int_bits: list[int],
symmetrics: list[bool],
depths: list[int] | None = None,
n_beams: int = 1,
dc: int | None = None,
n_inp_max: int = -1,
n_out_max: int = -1,
codegen_backend: PyCodegenBackend = PyCodegenBackend(),
signed_balanced_reduction: bool = True,
) -> tuple[Callable[[list[T]], list[T]], str]:
"""Compile a CMVM operation, with the constant kernel, into a function with only accumulation/subtraction/shift operations.
Parameters
----------
kernel : np.ndarray
The kernel to compile. Must be of shape (n_inp, n_out).
signs : list[bool]
If the input is signed. Must be of length n_inp.
bits : list[int]
The bitwidth of the inputs. Must be of length n_inp.
int_bits : list[int]
The number of integer bits in the inputs (incl. sign bit!). Must be of length n_inp.
symmetrics : list[bool]
If the input is symmetricly quantized. Must be of length n_inp.
depths : list[int]|None, optional
The depth associated with each input. Must be of length n_inp. Defaults to [0]*n_inp.
n_beams : int, optional
Number of beams to use in beam search. Defaults to 1. (Currently disabled!)
dc : int | None, optional
Delay constraint. Not implemented yet. Defaults to None.
n_inp_max : int, optional
Number of inputs to process in one block. Defaults to -1 (no limit). Decrease to improve performance, but result will be less optimal.
n_out_max : int, optional
Number of outputs to process in one block. Defaults to -1 (no limit). Decrease to improve performance, but result will be less optimal.
codegen_backend : PyCodegenBackend, optional
The codegen backend to be used. Defaults to PyCodegenBackend().
signed_balanced_reduction : bool, optional
If the reduction tree should isolate the plus and minus terms. Set to False to improve latency. Defaults to True.
Returns
-------
tuple[Callable[[list[T]], list[T]], str]
fn : Callable[[list[T]], list[T]]
The compiled python function. It takes a list of inputs and returns a list of outputs with only accumulation/subtraction/powers of 2 operations.
fn_str : str
The code of the compiled function, depending on the codegen_backend used.
"""
```
Raw data
{
"_id": null,
"home_page": null,
"name": "da4ml",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "CMVM, distributed arithmetic, hls4ml, MCM, subexpression elimination",
"author": null,
"author_email": "Chang Sun <chsun@cern.ch>",
"download_url": "https://files.pythonhosted.org/packages/40/90/68a8649e4cf45df53cfdee64e2f9cc658fb380ddd6e41c7695076c4ae582/da4ml-0.1.0.tar.gz",
"platform": null,
"description": "# da4ml: Distributed Arithmetic for Machine Learning\n\nThis project performs Constant Matrix-Vector Multiplication (CMVM) with Distributed Arithmetic (DA) for Machine Learning (ML) on a Field Programmable Gate Arrays (FPGAs).\n\nCMVM optimization is done through greedy CSE of two-term subexpressions, with possible Delay Constraints (DC). The optimization is done in jitted Python (Numba), and a list of optimized operations is generated as traced Python code.\n\nAt the moment, the project only generates Vitis HLS C++ code for the FPGA implementation of the optimized CMVM kernel. HDL code generation is planned for the future. Currently, the major use of this repository is through the `distributed_arithmetic` strategy in the [`hls4ml`](https://github.com/fastmachinelearning/hls4ml/) project.\n\n\n## Installation\n\nThe project is available on PyPI and can be installed with pip:\n\n```bash\npip install da4ml\n```\n\nNotice that `numba>=6.0.0` is required for the project to work. The project does not work with `python<3.10`. If the project fails to compile, try upgrading `numba` and `llvmlite` to the latest versions.\n\n## `hls4ml`\n\nThe major use of this project is through the `distributed_arithmetic` strategy in the `hls4ml`:\n\n```python\nmodel_hls = hls4ml.converters.convert_from_keras_model(\n model,\n hls_config={\n 'Model': {\n ...\n 'Strategy': 'distributed_arithmetic',\n },\n ...\n },\n ...\n)\n```\nCurrently, `Dense/Conv1D/Conv2D` layers are supported for both `io_parallel` and `io_stream` dataflows. However, notice that distributed arithmetic implies `reuse_factor=1`, as the whole kernel is implemented in combinational logic.\n\n### Notice\n\nCurrently, only the `da4ml-v2` branch of `hls4ml` supports the `distributed_arithmetic` strategy. The `da4ml-v2` branch is not yet merged into the `main` branch of `hls4ml`, so you need to install it from the GitHub repository.\n\n## Direct Usage\n\nIf you want to use it directly, you can use the `da4ml.api.fn_from_kernel` function, which creates a Python function from a 2x2 kernel `float[n_in, n_out]` and its corresponding code. The function signature is:\n\n```python\ndef fn_from_kernel(\n kernel: np.ndarray,\n signs: list[bool],\n bits: list[int],\n int_bits: list[int],\n symmetrics: list[bool],\n depths: list[int] | None = None,\n n_beams: int = 1,\n dc: int | None = None,\n n_inp_max: int = -1,\n n_out_max: int = -1,\n codegen_backend: PyCodegenBackend = PyCodegenBackend(),\n signed_balanced_reduction: bool = True,\n) -> tuple[Callable[[list[T]], list[T]], str]:\n \"\"\"Compile a CMVM operation, with the constant kernel, into a function with only accumulation/subtraction/shift operations.\n\n Parameters\n ----------\n kernel : np.ndarray\n The kernel to compile. Must be of shape (n_inp, n_out).\n signs : list[bool]\n If the input is signed. Must be of length n_inp.\n bits : list[int]\n The bitwidth of the inputs. Must be of length n_inp.\n int_bits : list[int]\n The number of integer bits in the inputs (incl. sign bit!). Must be of length n_inp.\n symmetrics : list[bool]\n If the input is symmetricly quantized. Must be of length n_inp.\n depths : list[int]|None, optional\n The depth associated with each input. Must be of length n_inp. Defaults to [0]*n_inp.\n n_beams : int, optional\n Number of beams to use in beam search. Defaults to 1. (Currently disabled!)\n dc : int | None, optional\n Delay constraint. Not implemented yet. Defaults to None.\n n_inp_max : int, optional\n Number of inputs to process in one block. Defaults to -1 (no limit). Decrease to improve performance, but result will be less optimal.\n n_out_max : int, optional\n Number of outputs to process in one block. Defaults to -1 (no limit). Decrease to improve performance, but result will be less optimal.\n codegen_backend : PyCodegenBackend, optional\n The codegen backend to be used. Defaults to PyCodegenBackend().\n signed_balanced_reduction : bool, optional\n If the reduction tree should isolate the plus and minus terms. Set to False to improve latency. Defaults to True.\n\n Returns\n -------\n tuple[Callable[[list[T]], list[T]], str]\n fn : Callable[[list[T]], list[T]]\n The compiled python function. It takes a list of inputs and returns a list of outputs with only accumulation/subtraction/powers of 2 operations.\n fn_str : str\n The code of the compiled function, depending on the codegen_backend used.\n \"\"\"\n```\n",
"bugtrack_url": null,
"license": "GNU Lesser General Public License v3 (LGPLv3)",
"summary": "Digital Arithmetic for Machine Learning",
"version": "0.1.0",
"project_urls": {
"repository": "https://github.com/calad0i/da4ml"
},
"split_keywords": [
"cmvm",
" distributed arithmetic",
" hls4ml",
" mcm",
" subexpression elimination"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "c6069d3b0bc9108b461cc7c2400fdd0c61c87a4438daef2e7ed15dc42db84bc1",
"md5": "356e18a0c002580925b1b7d89452ba8d",
"sha256": "3bb61451fd7b9e4e58c78ee86a1dd905a7eb56bc7d40052d9eb2915aba6590a6"
},
"downloads": -1,
"filename": "da4ml-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "356e18a0c002580925b1b7d89452ba8d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 19112,
"upload_time": "2025-02-07T23:58:20",
"upload_time_iso_8601": "2025-02-07T23:58:20.041414Z",
"url": "https://files.pythonhosted.org/packages/c6/06/9d3b0bc9108b461cc7c2400fdd0c61c87a4438daef2e7ed15dc42db84bc1/da4ml-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "409068a8649e4cf45df53cfdee64e2f9cc658fb380ddd6e41c7695076c4ae582",
"md5": "e25fe441566e983a9f8a9efd23ded2fd",
"sha256": "4176acf4cb4008ef5db74f7e63f62539c3c120abf56f6b65bed3cfbd24ab8041"
},
"downloads": -1,
"filename": "da4ml-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "e25fe441566e983a9f8a9efd23ded2fd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 19282,
"upload_time": "2025-02-07T23:58:22",
"upload_time_iso_8601": "2025-02-07T23:58:22.026535Z",
"url": "https://files.pythonhosted.org/packages/40/90/68a8649e4cf45df53cfdee64e2f9cc658fb380ddd6e41c7695076c4ae582/da4ml-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-07 23:58:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "calad0i",
"github_project": "da4ml",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "da4ml"
}