fast-ivf

Name	fast-ivf JSON
Version	1.0.1 JSON
	download
home_page	https://github.com/kmkolasinski/fast-ivf
Summary	Efficient implementation of IVF + IVFPQ Index with numpy and numba
upload_time	2023-10-31 21:28:23
maintainer
docs_url	None
author	Krzysztof Kolasinski
requires_python
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # FastIVF

Efficient implementation of IVF Index with numpy and numba

## Installation

* Install numpy and numba from conda to use intel mkl libraries for linear algebra operations
* To install package from the source code run `pip install .`
* To install from pip run `pip install fast-ivf`
* You may need to install tensorflow>=2.13, see `CompressedFastIVF` for details
* code tested with python==3.11
* see notebook [test-index](notebooks/test-index.ipynb) for Index usage examples
* see notebook [test-kmeans](notebooks/test-kmeans.ipynb) for K-means usage examples

## Features / limitations

* This is an experimental code which heavily relies on numba and numpy and may contain bugs
* IVF centroids are estimated with custom mini batch kmeans implementation
  * `MiniBatchKMeans` is used to estimate centroids of standard Inverted Index 
  * `SubspaceMiniBatchKMeans` is used to estimate centroids of Product Quantization Index
* K-means implementations support only l2 or cosine distances
* All indices currently support only cosine distance 

# Results on custom benchmark data

* Resources restricted to `OMP_NUM_THREADS=MKL_NUM_THREADS=OPENBLAS_NUM_THREADS=12` which was consuming 100% in our case for fast-ivf and faiss
* Train vectors: internal ~900k vectors of dim=1024, normalized to unit length
* Test vectors: same but 40k vectors
* Hyperparams: nprobe=10, ratio_threshold=0.5, no re-scoring is used for approximated indices (for mini-batch kmeans we use repository defaults), 
for CompressionFastIVF we use compression_ndim=128 (which gives 8 times compression ratio)
* We measure recall@10, as function which checks if `exact_i is in top_indices[:10]` for each test query, then we 
average the results over all test vectors
* For faiss I used similar parameters for nlist, m, nbits etc
* Reported time is computed from average of 5 runs, divided by 40k to get the time per single query
* As we use numba internally, each Fast-Index is initialized with warmup call to compile the code
* Note: CompressedFastIVF requires to train small neural network to compress embeddings to lower dimensionality, which increases the index build time
* For both libraries each search() call was consuming all 40k vectors, to fully utilize all vectorization

| Index             | Recall@10 | Query Time (ms) | Params                                                                                   |
|-------------------|-----------|-----------------|------------------------------------------------------------------------------------------|
| FastIVF           | 0.964     | 0.100           | `nlist=1024, nprobe=10, ratio_threshold=0.5`                                             |
| Faiss IVF         | 0.968     | 1.000           | `nlist=1024, nprobe=10`                                                                  |
| FastIVFPQ         | 0.802     | 0.100           | `nlist=1024, nprobe=10, ratio_threshold=0.5, pq_num_subvectors=32, pq_num_centroids=128` |
| Faiss IVFPQ       | 0.864     | 0.220           | `nlist=1024, nprobe=10, m=32, nbits=7`                                                   |
| CompressedFastIVF | 0.933     | 0.050           | `nlist=1024, nprobe=10, ratio_threshold=0.5, compression_ndim=128`                       |
| CompressedFastIVF | 0.889     | 0.040           | `nlist=1024, nprobe=10, ratio_threshold=0.5, compression_ndim=64`                        |




## Custom mini batch k-means implementation 

Efficient mini-batch kmeans implementations with numba and numpy

```python
from fast_ivf.kmeans import MiniBatchKMeans
import numpy as np

kmeans = MiniBatchKMeans(num_centroids=16, batch_size=32, metric="l2")
data = np.random.rand(5000, 64)
kmeans.train(data)
kmeans.add(data)
labels = kmeans.predict(data)
```

Efficient mini-batch kmeans implementations to train product quantization centroids

```python
from fast_ivf.kmeans import SubvectorsMiniBatchKMeans
import numpy as np

kmeans = SubvectorsMiniBatchKMeans(num_centroids=16, num_subvectors=8, batch_size=32, metric="l2")
data = np.random.rand(5000, 64)
kmeans.train(data)
kmeans.add(data)
labels = kmeans.predict(data)
```

## FastIVF

Similar to `faiss.IndexIVFFlat( faiss.IndexFlatIP(d), d, nlist, faiss.METRIC_INNER_PRODUCT)`


```python
from fast_ivf import FastIVF
from fast_ivf.core import normalize
import numpy as np

nlist = 1024
train_embeddings = normalize(np.random.rand(10000, 512).astype(np.float32))
index = FastIVF(512, nlist=nlist)
index.train(train_embeddings)

index.nprobe = 10
# greedy skip voronoi cells which are having score smaller than 0.5 of the largest score
# higher values lead to faster search but less accurate
index.ratio_threshold = 0.5

test_embeddings = normalize(np.random.rand(100, 512).astype(np.float32))
distances, indices = index.search(test_embeddings, k=100)

```

## FastIVFPQ

Similar to `faiss_index = faiss.IndexIVFPQ(faiss.IndexFlatIP(d), d, nlist, m, n_bits)`

```python
from fast_ivf import FastIVFPQ

nlist = 1024
# pq_num_centroids = 2 ** n_bits
# pq_num_subvectors = m
index = FastIVFPQ(512, nlist=nlist, pq_num_centroids=64, pq_num_subvectors=32)
index.train(train_embeddings)
index.nprobe = 10
index.ratio_threshold = 0.5
distances, indices = index.search(test_embeddings, k=100)

# compute exact scores for top 100 results, this is slower but more accurate
distances, indices = index.search(test_embeddings, k=100, rescore=True)

# calibrate scores by fitting a linear regression model to N=20 exact scores, if -1 then all scores are exactly computed
index.rescore_num_samples = 20
distances, indices = index.search(test_embeddings, k=100, rescore=True)

```

## CompressedFastIVF

Trains keras autoencoder to compress embeddings to lower dimensionality


```python
from fast_ivf import CompressedFastIVF

nlist = 1024
index = CompressedFastIVF(512, nlist=nlist, compression_ndim=128)
index.train(train_embeddings)
index.nprobe = 10
index.ratio_threshold = 0.5
distances, indices = index.search(test_embeddings, k=100)

# compute exact scores for top 100 results, this is slower but more accurate
distances, indices = index.search(test_embeddings, k=100, rescore=True)

```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kmkolasinski/fast-ivf",
    "name": "fast-ivf",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Krzysztof Kolasinski",
    "author_email": "kmkolasinski@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/41/fb/4bea60c9517b5650ccc4bb04ff88cae4e418999b3098711eec1fcf47217d/fast_ivf-1.0.1.tar.gz",
    "platform": null,
    "description": "# FastIVF\n\nEfficient implementation of IVF Index with numpy and numba\n\n## Installation\n\n* Install numpy and numba from conda to use intel mkl libraries for linear algebra operations\n* To install package from the source code run `pip install .`\n* To install from pip run `pip install fast-ivf`\n* You may need to install tensorflow>=2.13, see `CompressedFastIVF` for details\n* code tested with python==3.11\n* see notebook [test-index](notebooks/test-index.ipynb) for Index usage examples\n* see notebook [test-kmeans](notebooks/test-kmeans.ipynb) for K-means usage examples\n\n## Features / limitations\n\n* This is an experimental code which heavily relies on numba and numpy and may contain bugs\n* IVF centroids are estimated with custom mini batch kmeans implementation\n  * `MiniBatchKMeans` is used to estimate centroids of standard Inverted Index \n  * `SubspaceMiniBatchKMeans` is used to estimate centroids of Product Quantization Index\n* K-means implementations support only l2 or cosine distances\n* All indices currently support only cosine distance \n\n# Results on custom benchmark data\n\n* Resources restricted to `OMP_NUM_THREADS=MKL_NUM_THREADS=OPENBLAS_NUM_THREADS=12` which was consuming 100% in our case for fast-ivf and faiss\n* Train vectors: internal ~900k vectors of dim=1024, normalized to unit length\n* Test vectors: same but 40k vectors\n* Hyperparams: nprobe=10, ratio_threshold=0.5, no re-scoring is used for approximated indices (for mini-batch kmeans we use repository defaults), \nfor CompressionFastIVF we use compression_ndim=128 (which gives 8 times compression ratio)\n* We measure recall@10, as function which checks if `exact_i is in top_indices[:10]` for each test query, then we \naverage the results over all test vectors\n* For faiss I used similar parameters for nlist, m, nbits etc\n* Reported time is computed from average of 5 runs, divided by 40k to get the time per single query\n* As we use numba internally, each Fast-Index is initialized with warmup call to compile the code\n* Note: CompressedFastIVF requires to train small neural network to compress embeddings to lower dimensionality, which increases the index build time\n* For both libraries each search() call was consuming all 40k vectors, to fully utilize all vectorization\n\n| Index             | Recall@10 | Query Time (ms) | Params                                                                                   |\n|-------------------|-----------|-----------------|------------------------------------------------------------------------------------------|\n| FastIVF           | 0.964     | 0.100           | `nlist=1024, nprobe=10, ratio_threshold=0.5`                                             |\n| Faiss IVF         | 0.968     | 1.000           | `nlist=1024, nprobe=10`                                                                  |\n| FastIVFPQ         | 0.802     | 0.100           | `nlist=1024, nprobe=10, ratio_threshold=0.5, pq_num_subvectors=32, pq_num_centroids=128` |\n| Faiss IVFPQ       | 0.864     | 0.220           | `nlist=1024, nprobe=10, m=32, nbits=7`                                                   |\n| CompressedFastIVF | 0.933     | 0.050           | `nlist=1024, nprobe=10, ratio_threshold=0.5, compression_ndim=128`                       |\n| CompressedFastIVF | 0.889     | 0.040           | `nlist=1024, nprobe=10, ratio_threshold=0.5, compression_ndim=64`                        |\n\n\n\n\n## Custom mini batch k-means implementation \n\nEfficient mini-batch kmeans implementations with numba and numpy\n\n```python\nfrom fast_ivf.kmeans import MiniBatchKMeans\nimport numpy as np\n\nkmeans = MiniBatchKMeans(num_centroids=16, batch_size=32, metric=\"l2\")\ndata = np.random.rand(5000, 64)\nkmeans.train(data)\nkmeans.add(data)\nlabels = kmeans.predict(data)\n```\n\nEfficient mini-batch kmeans implementations to train product quantization centroids\n\n```python\nfrom fast_ivf.kmeans import SubvectorsMiniBatchKMeans\nimport numpy as np\n\nkmeans = SubvectorsMiniBatchKMeans(num_centroids=16, num_subvectors=8, batch_size=32, metric=\"l2\")\ndata = np.random.rand(5000, 64)\nkmeans.train(data)\nkmeans.add(data)\nlabels = kmeans.predict(data)\n```\n\n## FastIVF\n\nSimilar to `faiss.IndexIVFFlat( faiss.IndexFlatIP(d), d, nlist, faiss.METRIC_INNER_PRODUCT)`\n\n\n```python\nfrom fast_ivf import FastIVF\nfrom fast_ivf.core import normalize\nimport numpy as np\n\nnlist = 1024\ntrain_embeddings = normalize(np.random.rand(10000, 512).astype(np.float32))\nindex = FastIVF(512, nlist=nlist)\nindex.train(train_embeddings)\n\nindex.nprobe = 10\n# greedy skip voronoi cells which are having score smaller than 0.5 of the largest score\n# higher values lead to faster search but less accurate\nindex.ratio_threshold = 0.5\n\ntest_embeddings = normalize(np.random.rand(100, 512).astype(np.float32))\ndistances, indices = index.search(test_embeddings, k=100)\n\n```\n\n## FastIVFPQ\n\nSimilar to `faiss_index = faiss.IndexIVFPQ(faiss.IndexFlatIP(d), d, nlist, m, n_bits)`\n\n```python\nfrom fast_ivf import FastIVFPQ\n\nnlist = 1024\n# pq_num_centroids = 2 ** n_bits\n# pq_num_subvectors = m\nindex = FastIVFPQ(512, nlist=nlist, pq_num_centroids=64, pq_num_subvectors=32)\nindex.train(train_embeddings)\nindex.nprobe = 10\nindex.ratio_threshold = 0.5\ndistances, indices = index.search(test_embeddings, k=100)\n\n# compute exact scores for top 100 results, this is slower but more accurate\ndistances, indices = index.search(test_embeddings, k=100, rescore=True)\n\n# calibrate scores by fitting a linear regression model to N=20 exact scores, if -1 then all scores are exactly computed\nindex.rescore_num_samples = 20\ndistances, indices = index.search(test_embeddings, k=100, rescore=True)\n\n```\n\n## CompressedFastIVF\n\nTrains keras autoencoder to compress embeddings to lower dimensionality\n\n\n```python\nfrom fast_ivf import CompressedFastIVF\n\nnlist = 1024\nindex = CompressedFastIVF(512, nlist=nlist, compression_ndim=128)\nindex.train(train_embeddings)\nindex.nprobe = 10\nindex.ratio_threshold = 0.5\ndistances, indices = index.search(test_embeddings, k=100)\n\n# compute exact scores for top 100 results, this is slower but more accurate\ndistances, indices = index.search(test_embeddings, k=100, rescore=True)\n\n```\n\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Efficient implementation of IVF + IVFPQ Index with numpy and numba",
    "version": "1.0.1",
    "project_urls": {
        "Homepage": "https://github.com/kmkolasinski/fast-ivf"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d03b06f9507fd4d0be26443f4a127e65db8c1e2b64d137168141f58346130842",
                "md5": "fd29cd3358d8c68bd7ce3c118017d807",
                "sha256": "f7ecb3bbaa21d270b1e7d5320c36d5392aac6655194f0285b4fed85cd87d69db"
            },
            "downloads": -1,
            "filename": "fast_ivf-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fd29cd3358d8c68bd7ce3c118017d807",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 12590,
            "upload_time": "2023-10-31T21:28:22",
            "upload_time_iso_8601": "2023-10-31T21:28:22.604490Z",
            "url": "https://files.pythonhosted.org/packages/d0/3b/06f9507fd4d0be26443f4a127e65db8c1e2b64d137168141f58346130842/fast_ivf-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "41fb4bea60c9517b5650ccc4bb04ff88cae4e418999b3098711eec1fcf47217d",
                "md5": "32afa380bb420b54a63ae2c75638963f",
                "sha256": "4d8760eb595bf49dccc3b0d24c748b5ad6cd21a4a43b5aa796edf58410288a32"
            },
            "downloads": -1,
            "filename": "fast_ivf-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "32afa380bb420b54a63ae2c75638963f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14001,
            "upload_time": "2023-10-31T21:28:23",
            "upload_time_iso_8601": "2023-10-31T21:28:23.837646Z",
            "url": "https://files.pythonhosted.org/packages/41/fb/4bea60c9517b5650ccc4bb04ff88cae4e418999b3098711eec1fcf47217d/fast_ivf-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-31 21:28:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kmkolasinski",
    "github_project": "fast-ivf",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "fast-ivf"
}

Krzysztof Kolasinski