subpair


Namesubpair JSON
Version 0.1.1 PyPI version JSON
download
home_page
SummaryFast pairwise cosine distance calculation and numba accelerated evolutionary matrix subset extraction
upload_time2022-12-06 22:06:15
maintainer
docs_urlNone
author
requires_python>=3.9
licenseMIT
keywords numpy numba evolution pairwise-cosine
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
    <img width="300" alt="Logo" src="https://user-images.githubusercontent.com/3115640/203899211-fff1c9d8-10cd-4a84-88b5-518a591cd1e5.jpeg">
    <p align="center">/sʌb.pɛɹ/</p>
</p>

# SubPair  ![CI](https://github.com/lfrati/subpair/actions/workflows/test.yml/badge.svg)

> "All you need is love and _evolutionary matrix subset extraction_." - J. Lennon

Pairwise cosine distance is great to easily compare many vectors. However, you can end up with a very sizeable distance matrix. What if you would like to find a small subset of that matrix? Let's search it by evolution.

Given N elements and their (N,N) pairwise distance matrix we would like to get the subset of S elements such that the sum of elements in the corresponding (S,S) submatrix is minimal. See example below.

```
  [0  1  2  3  4] indeces 
      i  j     k    
      │  │     │          i j k   = [1, 2, 4]
   0  1  6  4  1                   
i──1  0  3  1  7       i  0 3 7     
j──6  3  0  2  3  -->  j  3 0 3  -->  7 + 3 + 3 = 13 👎
   4  1  2  0  1       k  7 3 0
k──1  7  3  1  0

         i  j  k    
         │  │  │          i j k  = [2, 3, 4]   
   0  1  6  4  1                   
   1  0  3  1  7       i  0 2 3     
i──6  3  0  2  3  -->  j  2 0 1  -->  2 + 1 + 3 = 6 👍
j──4  1  2  0  1       k  3 1 0
k──1  7  3  1  0
```

All the possible subsets are ${N}\choose{S}$ and for N = 1024, S = 20 (like in the tests) we would have to check ${1024}\choose{20}$ $= 5.479 \times 10^{41}$ of them. 

A few too many. Instead we are going to use an evolutionary approach to search for it.

# Example usage

The usage is quite straight forward since there are only a couple of functions exported `pairwise_cosine` and `extract`.

```python
>>> import matplotlib.pyplot as plt
>>> from subpair import pairwise_cosine
>>>
>>> X = np.random.rand(N, K).astype(np.float32)
>>> distances = pairwise_cosine(X) # (N,N)
>>> ...
>>> best, stats = extract(distances, P=200, S=S, K=50, M=3, O=2, its=3_000)
100%|█████████████████████████████████| 3000/3000 [00:03<00:00, 817.42it/s]
>>> plt.plot(stats["fits"]); plt.show()
```
<p align="left">
    <img width="500" alt="Logo" src="https://user-images.githubusercontent.com/3115640/204059389-730df61a-4e87-4023-b7c7-038b329dc6a6.png">
    <p>(We have sprinkled a few negative numbers to see if the algorithm can find them)</p>
</p>
Where the options of extract are parameters for the evolutionary algorithm:

``` 
distances (int, int) : N vectors of length L
        P (int)      : population size
        S (int)      : desired subset size <- determines size of output
        K (int)      : number of parents (P-K children)
        M (int)      : number of mutations
        O (int)      : fraction of crossovers e.g. O=2 -> 1/2, O=10 -> 1/10, (bigger=faster)
```

# Note

This repo contains both numpy and numba/CUDA versions of the pairwise cosine distance matrix calculation. But numpy is already _blazingly_ fast so the cuda version is provided mostly for inspiration. Our numpy version is very similar to sklearn's [metrics.pairwise.cosine_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html) but slightly faster. Sklearn's one has some extra nicities that our simplified version does not have.

```bash
> python flops.py # On Macbook pro M1 Max
N=513 K=2304 GOPs=1
  sklearn: 0.01s - 109.4 GFLOPS
    numpy: 0.00s - 162.4 GFLOPS

N=1027 K=2304 GOPs=2
  sklearn: 0.02s - 135.9 GFLOPS
    numpy: 0.01s - 192.4 GFLOPS

N=2055 K=2304 GOPs=10
  sklearn: 0.07s - 142.9 GFLOPS
    numpy: 0.06s - 166.0 GFLOPS

N=4111 K=2304 GOPs=39
  sklearn: 0.20s - 195.8 GFLOPS
    numpy: 0.16s - 248.6 GFLOPS

N=8223 K=2304 GOPs=156
  sklearn: 0.61s - 255.3 GFLOPS
    numpy: 0.54s - 289.5 GFLOPS

N=16447 K=2304 GOPs=623
  sklearn: 2.11s - 295.4 GFLOPS
    numpy: 1.79s - 347.9 GFLOPS
```

# Todo
- [ ] Add type info to minimize.py to allow for AOT compilation.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "subpair",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "numpy,numba,evolution,pairwise-cosine",
    "author": "",
    "author_email": "lfrati <lfrati.github@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/09/c6/fbb4394423f7cadddc9b78d5ef6804bd071f3635a7d5af0469c23f128a29/subpair-0.1.1.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n    <img width=\"300\" alt=\"Logo\" src=\"https://user-images.githubusercontent.com/3115640/203899211-fff1c9d8-10cd-4a84-88b5-518a591cd1e5.jpeg\">\n    <p align=\"center\">/s\u028cb.p\u025b\u0279/</p>\n</p>\n\n# SubPair  ![CI](https://github.com/lfrati/subpair/actions/workflows/test.yml/badge.svg)\n\n> \"All you need is love and _evolutionary matrix subset extraction_.\" - J. Lennon\n\nPairwise cosine distance is great to easily compare many vectors. However, you can end up with a very sizeable distance matrix. What if you would like to find a small subset of that matrix? Let's search it by evolution.\n\nGiven N elements and their (N,N) pairwise distance matrix we would like to get the subset of S elements such that the sum of elements in the corresponding (S,S) submatrix is minimal. See example below.\n\n```\n  [0  1  2  3  4] indeces \n      i  j     k    \n      \u2502  \u2502     \u2502          i j k   = [1, 2, 4]\n   0  1  6  4  1                   \ni\u2500\u25001  0  3  1  7       i  0 3 7     \nj\u2500\u25006  3  0  2  3  -->  j  3 0 3  -->  7 + 3 + 3 = 13 \ud83d\udc4e\n   4  1  2  0  1       k  7 3 0\nk\u2500\u25001  7  3  1  0\n\n         i  j  k    \n         \u2502  \u2502  \u2502          i j k  = [2, 3, 4]   \n   0  1  6  4  1                   \n   1  0  3  1  7       i  0 2 3     \ni\u2500\u25006  3  0  2  3  -->  j  2 0 1  -->  2 + 1 + 3 = 6 \ud83d\udc4d\nj\u2500\u25004  1  2  0  1       k  3 1 0\nk\u2500\u25001  7  3  1  0\n```\n\nAll the possible subsets are ${N}\\choose{S}$ and for N = 1024, S = 20 (like in the tests) we would have to check ${1024}\\choose{20}$ $= 5.479 \\times 10^{41}$ of them. \n\nA few too many. Instead we are going to use an evolutionary approach to search for it.\n\n# Example usage\n\nThe usage is quite straight forward since there are only a couple of functions exported `pairwise_cosine` and `extract`.\n\n```python\n>>> import matplotlib.pyplot as plt\n>>> from subpair import pairwise_cosine\n>>>\n>>> X = np.random.rand(N, K).astype(np.float32)\n>>> distances = pairwise_cosine(X) # (N,N)\n>>> ...\n>>> best, stats = extract(distances, P=200, S=S, K=50, M=3, O=2, its=3_000)\n100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 3000/3000 [00:03<00:00, 817.42it/s]\n>>> plt.plot(stats[\"fits\"]); plt.show()\n```\n<p align=\"left\">\n    <img width=\"500\" alt=\"Logo\" src=\"https://user-images.githubusercontent.com/3115640/204059389-730df61a-4e87-4023-b7c7-038b329dc6a6.png\">\n    <p>(We have sprinkled a few negative numbers to see if the algorithm can find them)</p>\n</p>\nWhere the options of extract are parameters for the evolutionary algorithm:\n\n``` \ndistances (int, int) : N vectors of length L\n        P (int)      : population size\n        S (int)      : desired subset size <- determines size of output\n        K (int)      : number of parents (P-K children)\n        M (int)      : number of mutations\n        O (int)      : fraction of crossovers e.g. O=2 -> 1/2, O=10 -> 1/10, (bigger=faster)\n```\n\n# Note\n\nThis repo contains both numpy and numba/CUDA versions of the pairwise cosine distance matrix calculation. But numpy is already _blazingly_ fast so the cuda version is provided mostly for inspiration. Our numpy version is very similar to sklearn's [metrics.pairwise.cosine_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html) but slightly faster. Sklearn's one has some extra nicities that our simplified version does not have.\n\n```bash\n> python flops.py # On Macbook pro M1 Max\nN=513 K=2304 GOPs=1\n  sklearn: 0.01s - 109.4 GFLOPS\n    numpy: 0.00s - 162.4 GFLOPS\n\nN=1027 K=2304 GOPs=2\n  sklearn: 0.02s - 135.9 GFLOPS\n    numpy: 0.01s - 192.4 GFLOPS\n\nN=2055 K=2304 GOPs=10\n  sklearn: 0.07s - 142.9 GFLOPS\n    numpy: 0.06s - 166.0 GFLOPS\n\nN=4111 K=2304 GOPs=39\n  sklearn: 0.20s - 195.8 GFLOPS\n    numpy: 0.16s - 248.6 GFLOPS\n\nN=8223 K=2304 GOPs=156\n  sklearn: 0.61s - 255.3 GFLOPS\n    numpy: 0.54s - 289.5 GFLOPS\n\nN=16447 K=2304 GOPs=623\n  sklearn: 2.11s - 295.4 GFLOPS\n    numpy: 1.79s - 347.9 GFLOPS\n```\n\n# Todo\n- [ ] Add type info to minimize.py to allow for AOT compilation.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fast pairwise cosine distance calculation and numba accelerated evolutionary matrix subset extraction",
    "version": "0.1.1",
    "split_keywords": [
        "numpy",
        "numba",
        "evolution",
        "pairwise-cosine"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "2794e4965a311286f3ea3f446d96aa23",
                "sha256": "a55ccb182fa3841589d9eed3edd3a746c42f107957898e55f84bf51abdf9f8ea"
            },
            "downloads": -1,
            "filename": "subpair-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2794e4965a311286f3ea3f446d96aa23",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 7628,
            "upload_time": "2022-12-06T22:06:12",
            "upload_time_iso_8601": "2022-12-06T22:06:12.676649Z",
            "url": "https://files.pythonhosted.org/packages/fb/ae/dd312e3233ac0137115fe0cf0372da35b43eac8db9fd60df5f1167e66aa2/subpair-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "f5e065442b2690dc08e9a3c6ef269172",
                "sha256": "e3b51a8b9423358bfac561301b96e4ef167c1c74bd65cb7dfd0a69bb737a8600"
            },
            "downloads": -1,
            "filename": "subpair-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f5e065442b2690dc08e9a3c6ef269172",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 7632,
            "upload_time": "2022-12-06T22:06:15",
            "upload_time_iso_8601": "2022-12-06T22:06:15.423712Z",
            "url": "https://files.pythonhosted.org/packages/09/c6/fbb4394423f7cadddc9b78d5ef6804bd071f3635a7d5af0469c23f128a29/subpair-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-06 22:06:15",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "subpair"
}
        
Elapsed time: 0.01598s