# DiskVectorIndex - Ultra-Low Memory Vector Search on Large Dataset
Indexing large datasets (100M+ embeddings) requires a lot of memory in most vector databases: For 100M documents/embeddings, most vector databases require about **500GB of memory**, driving the cost for your servers accordingly high.
This repository offers methods to be able to search on very large datasets (100M+) with just **300MB of memory**, making semantic search on such large datasets suitable for the Memory-Poor developers.
We provide various pre-build indices, that can be used to semantic search and powering your RAG applications.
## Pre-Build Indices
Below you find different pre-build indices. The embeddings are downloaded at the first call, the size is specified under Index Size. Most of the embeddings are memory mapped from disk, e.g. for the `Cohere/trec-rag-2024-index` corpus you need 15 GB of disk, but just 380 MB of memory to load the index.
| Name | Description | #Docs | Index Size (GB) | Memory Needed |
| --- | --- | :---: | :---: | :---: |
| [Cohere/trec-rag-2024-index](https://huggingface.co/datasets/Cohere/trec-rag-2024-index) | Segmented corpus for [TREC RAG 2024](https://trec-rag.github.io/annoucements/2024-corpus-finalization/) | 113,520,750 | 15GB | 380MB |
| fineweb-edu-10B-index (soon) | 10B token sample from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 9,267,429 | 1.4GB | 230MB |
| fineweb-edu-100B-index (soon) | 100B token sample from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 69,672,066 | 9.2GB | 380MB
| fineweb-edu-350B-index (soon) | 350B token sample from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 160,198,578 | 21GB | 380MB
| fineweb-edu-index (soon) | Full 1.3T token dataset [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 324,322,256 | 42GB | 285MB
Each index comes with the respective corpus, that is chunked into smaller parts. These chunks are downloaded on-demand and reused for further queries.
## Getting Started
Get your free **Cohere API key** from [cohere.com](https://cohere.com). You must set this API key as an environment variable:
```
export COHERE_API_KEY=your_api_key
```
Install the package:
```
pip install DiskVectorIndex
```
You can then search via:
```python
from DiskVectorIndex import DiskVectorIndex
index = DiskVectorIndex("Cohere/trec-rag-2024-index")
while True:
query = input("\n\nEnter a question: ")
docs = index.search(query, top_k=3)
for doc in docs:
print(doc)
print("=========")
```
You can also load a fully downloaded index from disk via:
```python
from DiskVectorIndex import DiskVectorIndex
index = DiskVectorIndex("path/to/index")
```
# How does it work?
The Cohere embeddings have been optimized to work well in compressed vector space, as detailed in our [Cohere int8 & binary Embeddings blog post](https://cohere.com/blog/int8-binary-embeddings). The embeddings have not only been trained to work in float32, which requires a lot of memory, but to also operate well with int8, binary and Product Quantization (PQ) compression.
The above indices uses Product Quantization (PQ) to go from originally 1024*4=4096 bytes per embedding to just 128 bytes per embedding, reducing your memory requirement 32x.
Further, we use [faiss](https://github.com/facebookresearch/faiss) with a memory mapped IVF: In this case, only a small fraction (between 32,768 and 131,072) embeddings must be loaded in memory.
# Need Semantic Search at Scale?
At [Cohere](https://cohere.com) we helped customers to run Semantic Search on tens of billions of embeddings, at a fraction of the cost. Feel free to reach out for [Nils Reimers](mailto:nils@cohere.com) if you need a solution that scales.
Raw data
{
"_id": null,
"home_page": "https://github.com/cohere-ai/DiskVectorIndex",
"name": "DiskVectorIndex",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Vector Database",
"author": "Nils Reimers",
"author_email": "nils@cohere.com",
"download_url": "https://files.pythonhosted.org/packages/47/30/31f764edf001e4fc1004ed4de7ce6d220745396158349ed272150e598402/DiskVectorIndex-0.0.2.tar.gz",
"platform": null,
"description": "# DiskVectorIndex - Ultra-Low Memory Vector Search on Large Dataset\n\nIndexing large datasets (100M+ embeddings) requires a lot of memory in most vector databases: For 100M documents/embeddings, most vector databases require about **500GB of memory**, driving the cost for your servers accordingly high.\n\nThis repository offers methods to be able to search on very large datasets (100M+) with just **300MB of memory**, making semantic search on such large datasets suitable for the Memory-Poor developers.\n\nWe provide various pre-build indices, that can be used to semantic search and powering your RAG applications.\n\n## Pre-Build Indices\n\nBelow you find different pre-build indices. The embeddings are downloaded at the first call, the size is specified under Index Size. Most of the embeddings are memory mapped from disk, e.g. for the `Cohere/trec-rag-2024-index` corpus you need 15 GB of disk, but just 380 MB of memory to load the index.\n\n| Name | Description | #Docs | Index Size (GB) | Memory Needed |\n| --- | --- | :---: | :---: | :---: | \n| [Cohere/trec-rag-2024-index](https://huggingface.co/datasets/Cohere/trec-rag-2024-index) | Segmented corpus for [TREC RAG 2024](https://trec-rag.github.io/annoucements/2024-corpus-finalization/) | 113,520,750 | 15GB | 380MB |\n| fineweb-edu-10B-index (soon) | 10B token sample from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 9,267,429 | 1.4GB | 230MB |\n| fineweb-edu-100B-index (soon) | 100B token sample from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 69,672,066 | 9.2GB | 380MB\n| fineweb-edu-350B-index (soon) | 350B token sample from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 160,198,578 | 21GB | 380MB\n| fineweb-edu-index (soon) | Full 1.3T token dataset [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) embedded and indexed on document level. | 324,322,256 | 42GB | 285MB\n\n\nEach index comes with the respective corpus, that is chunked into smaller parts. These chunks are downloaded on-demand and reused for further queries.\n\n## Getting Started\n\nGet your free **Cohere API key** from [cohere.com](https://cohere.com). You must set this API key as an environment variable: \n```\nexport COHERE_API_KEY=your_api_key\n```\n\nInstall the package:\n```\npip install DiskVectorIndex\n```\n\nYou can then search via:\n```python\nfrom DiskVectorIndex import DiskVectorIndex\n\nindex = DiskVectorIndex(\"Cohere/trec-rag-2024-index\")\n\nwhile True:\n query = input(\"\\n\\nEnter a question: \")\n docs = index.search(query, top_k=3)\n for doc in docs:\n print(doc)\n print(\"=========\")\n```\n\n\nYou can also load a fully downloaded index from disk via:\n```python\nfrom DiskVectorIndex import DiskVectorIndex\n\nindex = DiskVectorIndex(\"path/to/index\")\n```\n\n\n# How does it work?\nThe Cohere embeddings have been optimized to work well in compressed vector space, as detailed in our [Cohere int8 & binary Embeddings blog post](https://cohere.com/blog/int8-binary-embeddings). The embeddings have not only been trained to work in float32, which requires a lot of memory, but to also operate well with int8, binary and Product Quantization (PQ) compression.\n\nThe above indices uses Product Quantization (PQ) to go from originally 1024*4=4096 bytes per embedding to just 128 bytes per embedding, reducing your memory requirement 32x.\n\nFurther, we use [faiss](https://github.com/facebookresearch/faiss) with a memory mapped IVF: In this case, only a small fraction (between 32,768 and 131,072) embeddings must be loaded in memory. \n\n\n# Need Semantic Search at Scale?\n\nAt [Cohere](https://cohere.com) we helped customers to run Semantic Search on tens of billions of embeddings, at a fraction of the cost. Feel free to reach out for [Nils Reimers](mailto:nils@cohere.com) if you need a solution that scales.\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Efficient vector DB on large datasets from disk, using minimal memory.",
"version": "0.0.2",
"project_urls": {
"Download": "https://github.com/cohere-ai/DiskVectorIndex/",
"Homepage": "https://github.com/cohere-ai/DiskVectorIndex"
},
"split_keywords": [
"vector",
"database"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "473031f764edf001e4fc1004ed4de7ce6d220745396158349ed272150e598402",
"md5": "587bc2d06e674ad5e45aa065b1a03e85",
"sha256": "6d09667935623f90d315df0fb252b38ab59ed922e172722436d6d754c8edc2f5"
},
"downloads": -1,
"filename": "DiskVectorIndex-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "587bc2d06e674ad5e45aa065b1a03e85",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 9280,
"upload_time": "2024-07-02T19:24:24",
"upload_time_iso_8601": "2024-07-02T19:24:24.533493Z",
"url": "https://files.pythonhosted.org/packages/47/30/31f764edf001e4fc1004ed4de7ce6d220745396158349ed272150e598402/DiskVectorIndex-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-02 19:24:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "cohere-ai",
"github_project": "DiskVectorIndex",
"github_not_found": true,
"lcname": "diskvectorindex"
}