[![PyPI version](https://badge.fury.io/py/torch-hrp.svg)](https://badge.fury.io/py/torch-hrp)
[![PyPi downloads](https://img.shields.io/pypi/dm/torch-hrp)](https://img.shields.io/pypi/dm/torch-hrp)
# torch-hrp
Hashed Random Projection layer for PyTorch.
## Usage
<a href="https://github.com/ulf1/torch-hrp/blob/main/demo/Hashed%20Random%20Projections.ipynb">Hashed Random Projections (HRP), binary representations, encoding/decoding for storage</a> (notebook)
### Generate a HRP layer with a new hyperplane
The random projection or hyperplane is randomly initialized.
The initial state of the PRNG (`random_state`) is required (Default: 42) to ensure reproducibility.
```py
import torch_hrp as thrp
import torch
BATCH_SIZE = 32
NUM_FEATURES = 64
OUTPUT_SIZE = 1024
# demo inputs
inputs = torch.randn(size=(BATCH_SIZE, NUM_FEATURES))
# instantiate layer
layer = thrp.HashedRandomProjection(
output_size=OUTPUT_SIZE,
input_size=NUM_FEATURES,
random_state=42 # Default: 42
)
# run it
outputs = layer(inputs)
assert outputs.shape == (BATCH_SIZE, OUTPUT_SIZE)
```
### Instiantiate HRP layer with given hyperplane
```py
import torch_hrp as thrp
import torch
BATCH_SIZE = 32
NUM_FEATURES = 64
OUTPUT_SIZE = 1024
# demo inputs
inputs = torch.randn(size=(BATCH_SIZE, NUM_FEATURES))
# use an existing hyperplane
myhyperplane = torch.randn(size=(NUM_FEATURES, OUTPUT_SIZE))
# init layer
layer = thrp.HashedRandomProjection(hyperplane=myhyperplane)
# run it
outputs = layer(inputs)
```
### Multiprocessing on GPU-Server
The `HashedRandomProjection` layer has methods for multiprocessing of large numbers of examples for inference purposes (i.e. millions). These methods were adopted from the [SentenceTransformer code](https://github.com/UKPLab/sentence-transformers/blob/d928410803bb90f555926d145ee7ad3bd1373a83/sentence_transformers/SentenceTransformer.py#L206).
With the following script can be used to figure out how many example fit into the the RAM (e.g. 20 Mio.),
and how big the chunk of example for each process can be to fit into the GPU memory (e.g. 4.5 Mio.)
```py
import torch
import torch_hrp as thrp
model_hrp = thrp.HashedRandomProjection(
output_size=1024,
input_size=768, # the output dimension of the upstream embedding/transformer model
random_state=42
)
# Requirements: 2x GPUs w 80 Gb; approx 200 Gb RAM
if __name__ == '__main__': # multiprocessing spawning requires main
x = torch.rand(int(20e6), 768) # 20 Mio examples
pool = model_hrp.start_pool()
hashed = model_hrp.infer(x, pool, chunk_size=int(45e5)) # chunks of 4.5 Mio examples
model_hrp.stop_pool(pool)
torch.cuda.empty_cache()
```
see <a href="https://github.com/ulf1/torch-hrp/blob/main/demo/multiprocessing-on-gpu-server.py">demo/multiprocessing-on-gpu-server.py</a>
## Appendix
### Installation
The `torch-hrp` [git repo](http://github.com/ulf1/torch-hrp) is available as [PyPi package](https://pypi.org/project/torch-hrp)
```sh
pip install torch-hrp
pip install git+ssh://git@github.com/ulf1/torch-hrp.git
```
### Install a virtual environment (CPU)
```sh
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
```
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder `.venv`. Use an absolute path without whitespaces.)
### Install with conda (GPU)
```sh
conda install -y pip
conda create -y --name gpu-venv-torch-hrp-dev python=3.9 pip
conda activate gpu-venv-torch-hrp-dev
conda install -y cudatoolkit=11.3.1 cudnn=8.3.2 -c conda-forge
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install torch==1.12.1+cu113 torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html
# install other packages
pip install -e .
# pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
```
### Python commands
* Jupyter for the examples: `jupyter lab`
* Check syntax: `flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')`
* Run Unit Tests: `PYTHONPATH=. pytest`
Publish
```sh
# pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist
twine upload -r pypi dist/*
```
### Clean up
```sh
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
```
### Support
Please [open an issue](https://github.com/ulf1/torch-hrp/issues/new) for support.
### Contributing
Please contribute using [Github Flow](https://guides.github.com/introduction/flow/). Create a branch, add commits, and [open a pull request](https://github.com/ulf1/torch-hrp/compare/).
### Acknowledgements
The "Evidence" project was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - [433249742](https://gepris.dfg.de/gepris/projekt/433249742) (GU 798/27-1; GE 1119/11-1).
### Maintenance
- till 31.Aug.2023 (v0.1.1) the code repository was maintained within the DFG project [433249742](https://gepris.dfg.de/gepris/projekt/433249742)
- since 01.Sep.2023 (v0.2.0) the code repository is maintained by [@ulf1](https://github.com/ulf1).
### Citation
Please cite the arXiv Preprint when using this software for any purpose.
```
@misc{hamster2023rediscovering,
title={Rediscovering Hashed Random Projections for Efficient Quantization of Contextualized Sentence Embeddings},
author={Ulf A. Hamster and Ji-Ung Lee and Alexander Geyken and Iryna Gurevych},
year={2023},
eprint={2304.02481},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Raw data
{
"_id": null,
"home_page": "http://github.com/ulf1/torch-hrp",
"name": "torch-hrp",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Ulf Hamster",
"author_email": "554c46@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/78/61/1ca5cf7a634647f91243a7f4546150c99de7242cff0360a238c19e37f5f6/torch-hrp-0.2.0.tar.gz",
"platform": null,
"description": "[![PyPI version](https://badge.fury.io/py/torch-hrp.svg)](https://badge.fury.io/py/torch-hrp)\n[![PyPi downloads](https://img.shields.io/pypi/dm/torch-hrp)](https://img.shields.io/pypi/dm/torch-hrp)\n\n\n# torch-hrp\nHashed Random Projection layer for PyTorch.\n\n## Usage\n<a href=\"https://github.com/ulf1/torch-hrp/blob/main/demo/Hashed%20Random%20Projections.ipynb\">Hashed Random Projections (HRP), binary representations, encoding/decoding for storage</a> (notebook)\n\n\n### Generate a HRP layer with a new hyperplane\nThe random projection or hyperplane is randomly initialized.\nThe initial state of the PRNG (`random_state`) is required (Default: 42) to ensure reproducibility.\n\n```py\nimport torch_hrp as thrp\nimport torch\n\nBATCH_SIZE = 32\nNUM_FEATURES = 64\nOUTPUT_SIZE = 1024\n\n# demo inputs\ninputs = torch.randn(size=(BATCH_SIZE, NUM_FEATURES))\n\n# instantiate layer \nlayer = thrp.HashedRandomProjection(\n output_size=OUTPUT_SIZE,\n input_size=NUM_FEATURES,\n random_state=42 # Default: 42\n)\n\n# run it\noutputs = layer(inputs)\nassert outputs.shape == (BATCH_SIZE, OUTPUT_SIZE)\n```\n\n\n### Instiantiate HRP layer with given hyperplane\n\n```py\nimport torch_hrp as thrp\nimport torch\n\nBATCH_SIZE = 32\nNUM_FEATURES = 64\nOUTPUT_SIZE = 1024\n\n# demo inputs\ninputs = torch.randn(size=(BATCH_SIZE, NUM_FEATURES))\n\n# use an existing hyperplane\nmyhyperplane = torch.randn(size=(NUM_FEATURES, OUTPUT_SIZE))\n\n# init layer\nlayer = thrp.HashedRandomProjection(hyperplane=myhyperplane)\n\n# run it\noutputs = layer(inputs)\n```\n\n\n### Multiprocessing on GPU-Server\nThe `HashedRandomProjection` layer has methods for multiprocessing of large numbers of examples for inference purposes (i.e. millions). These methods were adopted from the [SentenceTransformer code](https://github.com/UKPLab/sentence-transformers/blob/d928410803bb90f555926d145ee7ad3bd1373a83/sentence_transformers/SentenceTransformer.py#L206).\nWith the following script can be used to figure out how many example fit into the the RAM (e.g. 20 Mio.), \nand how big the chunk of example for each process can be to fit into the GPU memory (e.g. 4.5 Mio.)\n\n\n```py\nimport torch\nimport torch_hrp as thrp\n\nmodel_hrp = thrp.HashedRandomProjection(\n output_size=1024,\n input_size=768, # the output dimension of the upstream embedding/transformer model\n random_state=42\n)\n\n# Requirements: 2x GPUs w 80 Gb; approx 200 Gb RAM\nif __name__ == '__main__': # multiprocessing spawning requires main\n x = torch.rand(int(20e6), 768) # 20 Mio examples\n pool = model_hrp.start_pool()\n hashed = model_hrp.infer(x, pool, chunk_size=int(45e5)) # chunks of 4.5 Mio examples\n model_hrp.stop_pool(pool)\n torch.cuda.empty_cache()\n```\n\nsee <a href=\"https://github.com/ulf1/torch-hrp/blob/main/demo/multiprocessing-on-gpu-server.py\">demo/multiprocessing-on-gpu-server.py</a>\n\n## Appendix\n\n### Installation\nThe `torch-hrp` [git repo](http://github.com/ulf1/torch-hrp) is available as [PyPi package](https://pypi.org/project/torch-hrp)\n\n```sh\npip install torch-hrp\npip install git+ssh://git@github.com/ulf1/torch-hrp.git\n```\n\n### Install a virtual environment (CPU)\n\n```sh\npython3 -m venv .venv\nsource .venv/bin/activate\npip install --upgrade pip\npip install -r requirements.txt --no-cache-dir\npip install -r requirements-dev.txt --no-cache-dir\npip install -r requirements-demo.txt --no-cache-dir\n```\n\n(If your git repo is stored in a folder with whitespaces, then don't use the subfolder `.venv`. Use an absolute path without whitespaces.)\n\n\n### Install with conda (GPU)\n\n```sh\nconda install -y pip\nconda create -y --name gpu-venv-torch-hrp-dev python=3.9 pip\nconda activate gpu-venv-torch-hrp-dev\n\nconda install -y cudatoolkit=11.3.1 cudnn=8.3.2 -c conda-forge\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/\npip install torch==1.12.1+cu113 torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html\n\n# install other packages\npip install -e .\n# pip install -r requirements.txt --no-cache-dir\npip install -r requirements-dev.txt --no-cache-dir\npip install -r requirements-demo.txt --no-cache-dir\n```\n\n### Python commands\n\n* Jupyter for the examples: `jupyter lab`\n* Check syntax: `flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')`\n* Run Unit Tests: `PYTHONPATH=. pytest`\n\nPublish\n\n```sh\n# pandoc README.md --from markdown --to rst -s -o README.rst\npython setup.py sdist \ntwine upload -r pypi dist/*\n```\n\n### Clean up \n\n```sh\nfind . -type f -name \"*.pyc\" | xargs rm\nfind . -type d -name \"__pycache__\" | xargs rm -r\nrm -r .pytest_cache\nrm -r .venv\n```\n\n\n### Support\nPlease [open an issue](https://github.com/ulf1/torch-hrp/issues/new) for support.\n\n\n### Contributing\nPlease contribute using [Github Flow](https://guides.github.com/introduction/flow/). Create a branch, add commits, and [open a pull request](https://github.com/ulf1/torch-hrp/compare/).\n\n### Acknowledgements\nThe \"Evidence\" project was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - [433249742](https://gepris.dfg.de/gepris/projekt/433249742) (GU 798/27-1; GE 1119/11-1).\n\n### Maintenance\n- till 31.Aug.2023 (v0.1.1) the code repository was maintained within the DFG project [433249742](https://gepris.dfg.de/gepris/projekt/433249742)\n- since 01.Sep.2023 (v0.2.0) the code repository is maintained by [@ulf1](https://github.com/ulf1).\n\n### Citation\nPlease cite the arXiv Preprint when using this software for any purpose.\n\n```\n@misc{hamster2023rediscovering,\n title={Rediscovering Hashed Random Projections for Efficient Quantization of Contextualized Sentence Embeddings}, \n author={Ulf A. Hamster and Ji-Ung Lee and Alexander Geyken and Iryna Gurevych},\n year={2023},\n eprint={2304.02481},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n```\n\n\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Hashed Random Projection layer for PyTorch",
"version": "0.2.0",
"project_urls": {
"Homepage": "http://github.com/ulf1/torch-hrp"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "78611ca5cf7a634647f91243a7f4546150c99de7242cff0360a238c19e37f5f6",
"md5": "60d059ded1060d7988258debdb23e13d",
"sha256": "af5820b5fc5f00b84619fa945504700d451ffa9e8e1012b9dc2fd09f91c8d763"
},
"downloads": -1,
"filename": "torch-hrp-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "60d059ded1060d7988258debdb23e13d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 12296,
"upload_time": "2023-07-10T08:26:19",
"upload_time_iso_8601": "2023-07-10T08:26:19.310287Z",
"url": "https://files.pythonhosted.org/packages/78/61/1ca5cf7a634647f91243a7f4546150c99de7242cff0360a238c19e37f5f6/torch-hrp-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-10 08:26:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ulf1",
"github_project": "torch-hrp",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "torch",
"specs": [
[
"<",
"2"
],
[
">=",
"1"
]
]
},
{
"name": "numpy",
"specs": [
[
"<",
"2"
],
[
">=",
"1.19.5"
]
]
}
],
"lcname": "torch-hrp"
}