# PyCudaHLL
This is a GPU accelerated implementation of HyperLogLog using the CuPy library. It was created for the class "Algorithmic Techniques for Taming Big Data" at the department of Computing and Data Science at Boston University.
## Using the Code
To use this code, you can either get the library from PyPI or build it from source.
### Get from PyPI (Recommended)
- Install using pip: `pip install pycudahll`
- In your code, import the library: `from pycudahll.CudaHLL import CudaHLL`
### Building from Source
- Clone the repository
- Install dependencies: `poetry install`
- In your code, import the library: `from pycudahll.CudaHLL import CudaHLL`
- See `test.py` for examples. (Note: `test.py` is most likely in a broken state, but should give you an idea of how to use the library.)
## API
The main class of the library is CudaHLL. It can be imported in your code with:
```python
from pycudahll.CudaHLL import CudaHLL
```
CudaHLL also includes a helper function to hash data to use with the main class:
```python
from pycudahll.CudaHLL import hashDataGPUHLL
```
A short example of how to use the library is as follows:
```python
from pycudahll.CudaHLL import CudaHLL, hashDataGPUHLL
with open('data.csv', 'r') as file:
data = file.read().split(',')
hashedData = hashDataGPUHLL(data)
threads = 64
p = 14
cudaDevice = 0 # optional
roundThreads = True # optional
hll = CudaHLL(p, threads, cudaDevice, roundThreads)
hll.add(hashedData)
print(hll.card()) # print unrounded cardinality estimate
print(len(hll)) # print rounded cardinality estimate
```
## Test Data
Text of Shakespeare plays obtained from https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt. Original text can be found in t8.shakespeare.txt and the modified text can be found in shakespeare.csv.
Total number of items = 899300
Exact cardinality = 34065
Raw data
{
"_id": null,
"home_page": "https://github.com/gabemgem/PyCudaHLL",
"name": "pycudahll",
"maintainer": "",
"docs_url": null,
"requires_python": ">3.10,<3.12",
"maintainer_email": "",
"keywords": "cupy,gpu,hll,hyperloglog",
"author": "Gabe Maayan",
"author_email": "gabemgem@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/07/01/d2f78fdd3ddd61fc9d0c111a8f6d091a1d04ebf50257c7fe783cd2653946/pycudahll-0.1.0.tar.gz",
"platform": null,
"description": "# PyCudaHLL\n\nThis is a GPU accelerated implementation of HyperLogLog using the CuPy library. It was created for the class \"Algorithmic Techniques for Taming Big Data\" at the department of Computing and Data Science at Boston University.\n\n## Using the Code\n\nTo use this code, you can either get the library from PyPI or build it from source.\n\n### Get from PyPI (Recommended)\n\n- Install using pip: `pip install pycudahll`\n- In your code, import the library: `from pycudahll.CudaHLL import CudaHLL`\n\n### Building from Source\n\n- Clone the repository\n- Install dependencies: `poetry install`\n- In your code, import the library: `from pycudahll.CudaHLL import CudaHLL`\n- See `test.py` for examples. (Note: `test.py` is most likely in a broken state, but should give you an idea of how to use the library.)\n\n## API\n\nThe main class of the library is CudaHLL. It can be imported in your code with:\n```python\nfrom pycudahll.CudaHLL import CudaHLL\n```\n\nCudaHLL also includes a helper function to hash data to use with the main class:\n```python\nfrom pycudahll.CudaHLL import hashDataGPUHLL\n```\n\nA short example of how to use the library is as follows:\n```python\nfrom pycudahll.CudaHLL import CudaHLL, hashDataGPUHLL\n\nwith open('data.csv', 'r') as file:\n data = file.read().split(',')\n hashedData = hashDataGPUHLL(data)\n\n threads = 64\n p = 14\n cudaDevice = 0 # optional\n roundThreads = True # optional\n hll = CudaHLL(p, threads, cudaDevice, roundThreads)\n\n hll.add(hashedData)\n print(hll.card()) # print unrounded cardinality estimate\n print(len(hll)) # print rounded cardinality estimate\n```\n\n\n## Test Data\n\nText of Shakespeare plays obtained from https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt. Original text can be found in t8.shakespeare.txt and the modified text can be found in shakespeare.csv.\n\nTotal number of items = 899300\nExact cardinality = 34065",
"bugtrack_url": null,
"license": "",
"summary": "A GPU implementation of HyperLogLog",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/gabemgem/PyCudaHLL",
"Repository": "https://github.com/gabemgem/PyCudaHLL"
},
"split_keywords": [
"cupy",
"gpu",
"hll",
"hyperloglog"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "77b9522efd40eae19e4d183af618dcfc23dbf59b5dcd57a821b9d090b19d0aeb",
"md5": "127cf33dec363f2a9282a24a345a867c",
"sha256": "6a367b7c5ae2071907fda59f07014baf63039b541b97fdf6a9bd22a5f9d11456"
},
"downloads": -1,
"filename": "pycudahll-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "127cf33dec363f2a9282a24a345a867c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">3.10,<3.12",
"size": 37892,
"upload_time": "2023-05-04T01:39:24",
"upload_time_iso_8601": "2023-05-04T01:39:24.756413Z",
"url": "https://files.pythonhosted.org/packages/77/b9/522efd40eae19e4d183af618dcfc23dbf59b5dcd57a821b9d090b19d0aeb/pycudahll-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0701d2f78fdd3ddd61fc9d0c111a8f6d091a1d04ebf50257c7fe783cd2653946",
"md5": "4d9862ac16c35ca4955b312be59c8753",
"sha256": "5806b9d6557a7b816f07f1750dd21d10f816c1bddefaf990cd3f5ffe0642ffd2"
},
"downloads": -1,
"filename": "pycudahll-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "4d9862ac16c35ca4955b312be59c8753",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">3.10,<3.12",
"size": 37473,
"upload_time": "2023-05-04T01:39:26",
"upload_time_iso_8601": "2023-05-04T01:39:26.789715Z",
"url": "https://files.pythonhosted.org/packages/07/01/d2f78fdd3ddd61fc9d0c111a8f6d091a1d04ebf50257c7fe783cd2653946/pycudahll-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-04 01:39:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "gabemgem",
"github_project": "PyCudaHLL",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pycudahll"
}