Name | subpair JSON |
Version |
0.1.1
JSON |
| download |
home_page | |
Summary | Fast pairwise cosine distance calculation and numba accelerated evolutionary matrix subset extraction |
upload_time | 2022-12-06 22:06:15 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.9 |
license | MIT |
keywords |
numpy
numba
evolution
pairwise-cosine
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<p align="center">
<img width="300" alt="Logo" src="https://user-images.githubusercontent.com/3115640/203899211-fff1c9d8-10cd-4a84-88b5-518a591cd1e5.jpeg">
<p align="center">/sʌb.pɛɹ/</p>
</p>
# SubPair ![CI](https://github.com/lfrati/subpair/actions/workflows/test.yml/badge.svg)
> "All you need is love and _evolutionary matrix subset extraction_." - J. Lennon
Pairwise cosine distance is great to easily compare many vectors. However, you can end up with a very sizeable distance matrix. What if you would like to find a small subset of that matrix? Let's search it by evolution.
Given N elements and their (N,N) pairwise distance matrix we would like to get the subset of S elements such that the sum of elements in the corresponding (S,S) submatrix is minimal. See example below.
```
[0 1 2 3 4] indeces
i j k
│ │ │ i j k = [1, 2, 4]
0 1 6 4 1
i──1 0 3 1 7 i 0 3 7
j──6 3 0 2 3 --> j 3 0 3 --> 7 + 3 + 3 = 13 👎
4 1 2 0 1 k 7 3 0
k──1 7 3 1 0
i j k
│ │ │ i j k = [2, 3, 4]
0 1 6 4 1
1 0 3 1 7 i 0 2 3
i──6 3 0 2 3 --> j 2 0 1 --> 2 + 1 + 3 = 6 👍
j──4 1 2 0 1 k 3 1 0
k──1 7 3 1 0
```
All the possible subsets are ${N}\choose{S}$ and for N = 1024, S = 20 (like in the tests) we would have to check ${1024}\choose{20}$ $= 5.479 \times 10^{41}$ of them.
A few too many. Instead we are going to use an evolutionary approach to search for it.
# Example usage
The usage is quite straight forward since there are only a couple of functions exported `pairwise_cosine` and `extract`.
```python
>>> import matplotlib.pyplot as plt
>>> from subpair import pairwise_cosine
>>>
>>> X = np.random.rand(N, K).astype(np.float32)
>>> distances = pairwise_cosine(X) # (N,N)
>>> ...
>>> best, stats = extract(distances, P=200, S=S, K=50, M=3, O=2, its=3_000)
100%|█████████████████████████████████| 3000/3000 [00:03<00:00, 817.42it/s]
>>> plt.plot(stats["fits"]); plt.show()
```
<p align="left">
<img width="500" alt="Logo" src="https://user-images.githubusercontent.com/3115640/204059389-730df61a-4e87-4023-b7c7-038b329dc6a6.png">
<p>(We have sprinkled a few negative numbers to see if the algorithm can find them)</p>
</p>
Where the options of extract are parameters for the evolutionary algorithm:
```
distances (int, int) : N vectors of length L
P (int) : population size
S (int) : desired subset size <- determines size of output
K (int) : number of parents (P-K children)
M (int) : number of mutations
O (int) : fraction of crossovers e.g. O=2 -> 1/2, O=10 -> 1/10, (bigger=faster)
```
# Note
This repo contains both numpy and numba/CUDA versions of the pairwise cosine distance matrix calculation. But numpy is already _blazingly_ fast so the cuda version is provided mostly for inspiration. Our numpy version is very similar to sklearn's [metrics.pairwise.cosine_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html) but slightly faster. Sklearn's one has some extra nicities that our simplified version does not have.
```bash
> python flops.py # On Macbook pro M1 Max
N=513 K=2304 GOPs=1
sklearn: 0.01s - 109.4 GFLOPS
numpy: 0.00s - 162.4 GFLOPS
N=1027 K=2304 GOPs=2
sklearn: 0.02s - 135.9 GFLOPS
numpy: 0.01s - 192.4 GFLOPS
N=2055 K=2304 GOPs=10
sklearn: 0.07s - 142.9 GFLOPS
numpy: 0.06s - 166.0 GFLOPS
N=4111 K=2304 GOPs=39
sklearn: 0.20s - 195.8 GFLOPS
numpy: 0.16s - 248.6 GFLOPS
N=8223 K=2304 GOPs=156
sklearn: 0.61s - 255.3 GFLOPS
numpy: 0.54s - 289.5 GFLOPS
N=16447 K=2304 GOPs=623
sklearn: 2.11s - 295.4 GFLOPS
numpy: 1.79s - 347.9 GFLOPS
```
# Todo
- [ ] Add type info to minimize.py to allow for AOT compilation.
Raw data
{
"_id": null,
"home_page": "",
"name": "subpair",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "",
"keywords": "numpy,numba,evolution,pairwise-cosine",
"author": "",
"author_email": "lfrati <lfrati.github@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/09/c6/fbb4394423f7cadddc9b78d5ef6804bd071f3635a7d5af0469c23f128a29/subpair-0.1.1.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img width=\"300\" alt=\"Logo\" src=\"https://user-images.githubusercontent.com/3115640/203899211-fff1c9d8-10cd-4a84-88b5-518a591cd1e5.jpeg\">\n <p align=\"center\">/s\u028cb.p\u025b\u0279/</p>\n</p>\n\n# SubPair ![CI](https://github.com/lfrati/subpair/actions/workflows/test.yml/badge.svg)\n\n> \"All you need is love and _evolutionary matrix subset extraction_.\" - J. Lennon\n\nPairwise cosine distance is great to easily compare many vectors. However, you can end up with a very sizeable distance matrix. What if you would like to find a small subset of that matrix? Let's search it by evolution.\n\nGiven N elements and their (N,N) pairwise distance matrix we would like to get the subset of S elements such that the sum of elements in the corresponding (S,S) submatrix is minimal. See example below.\n\n```\n [0 1 2 3 4] indeces \n i j k \n \u2502 \u2502 \u2502 i j k = [1, 2, 4]\n 0 1 6 4 1 \ni\u2500\u25001 0 3 1 7 i 0 3 7 \nj\u2500\u25006 3 0 2 3 --> j 3 0 3 --> 7 + 3 + 3 = 13 \ud83d\udc4e\n 4 1 2 0 1 k 7 3 0\nk\u2500\u25001 7 3 1 0\n\n i j k \n \u2502 \u2502 \u2502 i j k = [2, 3, 4] \n 0 1 6 4 1 \n 1 0 3 1 7 i 0 2 3 \ni\u2500\u25006 3 0 2 3 --> j 2 0 1 --> 2 + 1 + 3 = 6 \ud83d\udc4d\nj\u2500\u25004 1 2 0 1 k 3 1 0\nk\u2500\u25001 7 3 1 0\n```\n\nAll the possible subsets are ${N}\\choose{S}$ and for N = 1024, S = 20 (like in the tests) we would have to check ${1024}\\choose{20}$ $= 5.479 \\times 10^{41}$ of them. \n\nA few too many. Instead we are going to use an evolutionary approach to search for it.\n\n# Example usage\n\nThe usage is quite straight forward since there are only a couple of functions exported `pairwise_cosine` and `extract`.\n\n```python\n>>> import matplotlib.pyplot as plt\n>>> from subpair import pairwise_cosine\n>>>\n>>> X = np.random.rand(N, K).astype(np.float32)\n>>> distances = pairwise_cosine(X) # (N,N)\n>>> ...\n>>> best, stats = extract(distances, P=200, S=S, K=50, M=3, O=2, its=3_000)\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 3000/3000 [00:03<00:00, 817.42it/s]\n>>> plt.plot(stats[\"fits\"]); plt.show()\n```\n<p align=\"left\">\n <img width=\"500\" alt=\"Logo\" src=\"https://user-images.githubusercontent.com/3115640/204059389-730df61a-4e87-4023-b7c7-038b329dc6a6.png\">\n <p>(We have sprinkled a few negative numbers to see if the algorithm can find them)</p>\n</p>\nWhere the options of extract are parameters for the evolutionary algorithm:\n\n``` \ndistances (int, int) : N vectors of length L\n P (int) : population size\n S (int) : desired subset size <- determines size of output\n K (int) : number of parents (P-K children)\n M (int) : number of mutations\n O (int) : fraction of crossovers e.g. O=2 -> 1/2, O=10 -> 1/10, (bigger=faster)\n```\n\n# Note\n\nThis repo contains both numpy and numba/CUDA versions of the pairwise cosine distance matrix calculation. But numpy is already _blazingly_ fast so the cuda version is provided mostly for inspiration. Our numpy version is very similar to sklearn's [metrics.pairwise.cosine_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html) but slightly faster. Sklearn's one has some extra nicities that our simplified version does not have.\n\n```bash\n> python flops.py # On Macbook pro M1 Max\nN=513 K=2304 GOPs=1\n sklearn: 0.01s - 109.4 GFLOPS\n numpy: 0.00s - 162.4 GFLOPS\n\nN=1027 K=2304 GOPs=2\n sklearn: 0.02s - 135.9 GFLOPS\n numpy: 0.01s - 192.4 GFLOPS\n\nN=2055 K=2304 GOPs=10\n sklearn: 0.07s - 142.9 GFLOPS\n numpy: 0.06s - 166.0 GFLOPS\n\nN=4111 K=2304 GOPs=39\n sklearn: 0.20s - 195.8 GFLOPS\n numpy: 0.16s - 248.6 GFLOPS\n\nN=8223 K=2304 GOPs=156\n sklearn: 0.61s - 255.3 GFLOPS\n numpy: 0.54s - 289.5 GFLOPS\n\nN=16447 K=2304 GOPs=623\n sklearn: 2.11s - 295.4 GFLOPS\n numpy: 1.79s - 347.9 GFLOPS\n```\n\n# Todo\n- [ ] Add type info to minimize.py to allow for AOT compilation.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Fast pairwise cosine distance calculation and numba accelerated evolutionary matrix subset extraction",
"version": "0.1.1",
"split_keywords": [
"numpy",
"numba",
"evolution",
"pairwise-cosine"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "2794e4965a311286f3ea3f446d96aa23",
"sha256": "a55ccb182fa3841589d9eed3edd3a746c42f107957898e55f84bf51abdf9f8ea"
},
"downloads": -1,
"filename": "subpair-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2794e4965a311286f3ea3f446d96aa23",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 7628,
"upload_time": "2022-12-06T22:06:12",
"upload_time_iso_8601": "2022-12-06T22:06:12.676649Z",
"url": "https://files.pythonhosted.org/packages/fb/ae/dd312e3233ac0137115fe0cf0372da35b43eac8db9fd60df5f1167e66aa2/subpair-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "f5e065442b2690dc08e9a3c6ef269172",
"sha256": "e3b51a8b9423358bfac561301b96e4ef167c1c74bd65cb7dfd0a69bb737a8600"
},
"downloads": -1,
"filename": "subpair-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "f5e065442b2690dc08e9a3c6ef269172",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 7632,
"upload_time": "2022-12-06T22:06:15",
"upload_time_iso_8601": "2022-12-06T22:06:15.423712Z",
"url": "https://files.pythonhosted.org/packages/09/c6/fbb4394423f7cadddc9b78d5ef6804bd071f3635a7d5af0469c23f128a29/subpair-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-06 22:06:15",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "subpair"
}