![SVS Logo](https://raw.githubusercontent.com/Rhobota/svs/main/logos/svs.png)
# Stupid Vector Store (SVS)
[![PyPI - Version](https://img.shields.io/pypi/v/svs.svg)](https://pypi.org/project/svs)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/svs.svg)](https://pypi.org/project/svs)
![Test Status](https://github.com/Rhobota/svs/actions/workflows/test.yml/badge.svg?branch=main)
[![Downloads](https://static.pepy.tech/badge/svs)](https://pepy.tech/project/svs)
- 🤔 What is SVS?
- Semantic search via deep-learning vector embeddings.
- A stupid-simple library for storing and retrieving your documents.
- 💩 Why is it _stupid_?
- Because it just uses [SQLite](https://www.sqlite.org/) and [NumPy](https://numpy.org/). Nothing fancy.
- That is our core design choice. We want something _stupid simple_, yet _reasonably fast_.
- 🧠 Is it possibly... _smart_ in any way though?
- Maybe.
- **It will squeeze the most juice from your machine: 🍊**
- Optimized SQL
- Cache-friendly memory access
- Fast in the places that matter 🚀
- All with a simple Python interface
- Supports storing arbitrary metadata with each document. 🗃️
- Supports storing and querying (optional) parent-child relationships between documents. 👪
- Fully hierarchical - parents can have parents, children can have children, whatever you need...
- Supports storing an (optional) graph structure over your documents.
- So you can do GraphRAG!
- Batteries _not_ included:
- This library only handles graph _storage_.
- You have to implement your own graph _algorithms_.
- Supports generic key/value storage, for those random things you don't know where else to put. 🤷
- Both **sync** and **asyncio** implementations:
- use the synchronous impl (`svs.KB`) for scripts, notebooks, etc
- use the asyncio impl (`svs.AsyncKB`) for web-services, etc
- 100% Python type hints!
## Overview
SVS is stupid yet can handle a million documents on commodity hardware, so it's probably perfect for you.
**Should you use SVS?** SVS is designed for the use-case where:
1. you have less than a million documents, and
2. you don't add/remove documents very often.
If that's you, then SVS will probably be the simples (and _stupidest_) way to manage your document vectors!
## Table of Contents
- [Installation](#installation)
- [Used By](#used-by)
- [Quickstart](#quickstart)
- [Speed & Benchmarks](#speed--benchmarks)
- [Debug Logging](#debug-logging)
- [License](#license)
## Installation
```console
pip install -U svs
```
## Used By
SVS is used in production by:
[![AutoAuto](https://raw.githubusercontent.com/Rhobota/svs/main/logos/autoauto.png)](https://www.autoauto.ai/)
## Quickstart
Here is the _most simple_ use-case; it just queries a pre-built knowledge base!
This particular example queries a knowledge base of "Dad Jokes" 🤩.
(taken from [./examples/quickstart.py](./examples/quickstart.py))
```python
import svs # <-- pip install -U svs
import os
from dotenv import load_dotenv; load_dotenv()
assert os.environ.get('OPENAI_API_KEY'), "You must set your OPENAI_API_KEY environment variable!"
#
# The database remembers which embeddings provider (e.g. OpenAI) was used.
#
# The "Dad Jokes" database below uses OpenAI embeddings, so that's why you had
# to set your OPENAI_API_KEY above!
#
# NOTE: The first time you run this script it will download this database,
# so expect that to take a few seconds...
#
DB_URL = 'https://github.com/Rhobota/svs/raw/main/examples/dad_jokes/dad_jokes.sqlite.gz'
def demo() -> None:
kb = svs.KB(DB_URL)
records = kb.retrieve('chicken', n = 10)
for record in records:
score = record['score']
text = record['doc']['text']
print(f" 😆 score={score:.4f}: {text}\n")
kb.close()
if __name__ == '__main__':
demo()
```
⚠️ **Want to see how that _Dad Jokes_ knowledge base was created?** See: [./examples/dad_jokes/Build Dad Jokes KB.ipynb](<./examples/dad_jokes/Build Dad Jokes KB.ipynb>)
## Speed & Benchmarks
**SQLite** and **NumPy** are fast, thus **SVS** is fast 🏎️. Our goal is to minimize the amount of work done at the Python-layer.
Also, your bottleneck will *certainly* be the remote API calls to get document embeddings (e.g. calling out to OpenAI's API to get embeddings will be the _slowest_ thing), so it's likely not critical to further optimize the Python-layer bits.
The following benchmarks were performed on 2018-era commodity hardware (Intel i3-8100):
| Number of Documents | Load into SQLite | Get Embeddings for All Documents (remote API call) | Cosine Similarity + Sort + Retrieve Top-100 Documents [^3] |
|:---------------------------------- |:---------------- |:-------------------------------------------------- |:-------------------------------------------------------------- |
| 10,548 jokes [^1] | 0.07 seconds | 80 seconds | 0.5 seconds (first query) + 0.011 seconds (subsequent queries) |
| 1,000,000 synthetic documents [^2] | 8 seconds | 2 hours [^4] | 2 minutes (first query) + 0.24 seconds (subsequent queries) |
[^1]: This benchmark is from the Dad Jokes KB from [this notebook](<./examples/dad_jokes/Build Dad Jokes KB.ipynb>).
[^2]: This benchmark is over one million synthetic documents, where those documents have an average length of 1,200 characters. Specifically, [this notebook](<./examples/One Million Documents Benchmark.ipynb>).
[^3]: This time does _not_ include the time it takes to obtain the query string's embedding from the external service (i.e. from OpenAI's API); rather, it captures the time it takes to (1) compute the cosine similarity of the query string with _all_ the documents' vectors (where embedding dimensionality is 1,536), then (2) sort the results, and then (3) retrieve the top-100 documents from the database. Note: The first query is _slow_ because it must load the vectors from disk into RAM, while subsequent queries are _fast_ since those vectors stay cached in RAM.
[^4]: This is an estimate based on the observed typical response times from OpenAI's embeddings API. For this test, we generate synthetic embeddings with dimensionality 1,536 to simulate the correct datasize and computation requirements as if we used "real" embeddings.
## Debug Logging
This library logs using Python's builtin `logging` module. It logs mostly to `INFO`, so here's a snippet of code you can put in _your_ app to see those traces:
```python
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
)
# ... now use SVS as you normally would, but you'll see extra log traces!
```
## License
`svs` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
Raw data
{
"_id": null,
"home_page": null,
"name": "svs",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "database, embeddings, search, semantic, store, stupid, vector",
"author": null,
"author_email": "Ryan Henning <ryan@rhobota.com>",
"download_url": "https://files.pythonhosted.org/packages/b2/58/9b2aab8e2fd79710538cfd97815eb3eea042b6fcd9113bcc1e8ec3b38c93/svs-0.6.1.tar.gz",
"platform": null,
"description": "![SVS Logo](https://raw.githubusercontent.com/Rhobota/svs/main/logos/svs.png)\n\n# Stupid Vector Store (SVS)\n\n[![PyPI - Version](https://img.shields.io/pypi/v/svs.svg)](https://pypi.org/project/svs)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/svs.svg)](https://pypi.org/project/svs)\n![Test Status](https://github.com/Rhobota/svs/actions/workflows/test.yml/badge.svg?branch=main)\n[![Downloads](https://static.pepy.tech/badge/svs)](https://pepy.tech/project/svs)\n\n- \ud83e\udd14 What is SVS?\n - Semantic search via deep-learning vector embeddings.\n - A stupid-simple library for storing and retrieving your documents.\n\n- \ud83d\udca9 Why is it _stupid_?\n - Because it just uses [SQLite](https://www.sqlite.org/) and [NumPy](https://numpy.org/). Nothing fancy.\n - That is our core design choice. We want something _stupid simple_, yet _reasonably fast_.\n\n- \ud83e\udde0 Is it possibly... _smart_ in any way though?\n - Maybe.\n - **It will squeeze the most juice from your machine: \ud83c\udf4a**\n - Optimized SQL\n - Cache-friendly memory access\n - Fast in the places that matter \ud83d\ude80\n - All with a simple Python interface\n - Supports storing arbitrary metadata with each document. \ud83d\uddc3\ufe0f\n - Supports storing and querying (optional) parent-child relationships between documents. \ud83d\udc6a\n - Fully hierarchical - parents can have parents, children can have children, whatever you need...\n - Supports storing an (optional) graph structure over your documents.\n - So you can do GraphRAG!\n - Batteries _not_ included:\n - This library only handles graph _storage_.\n - You have to implement your own graph _algorithms_.\n - Supports generic key/value storage, for those random things you don't know where else to put. \ud83e\udd37\n - Both **sync** and **asyncio** implementations:\n - use the synchronous impl (`svs.KB`) for scripts, notebooks, etc\n - use the asyncio impl (`svs.AsyncKB`) for web-services, etc\n - 100% Python type hints!\n\n## Overview\n\nSVS is stupid yet can handle a million documents on commodity hardware, so it's probably perfect for you.\n\n**Should you use SVS?** SVS is designed for the use-case where:\n 1. you have less than a million documents, and\n 2. you don't add/remove documents very often.\n\nIf that's you, then SVS will probably be the simples (and _stupidest_) way to manage your document vectors!\n\n## Table of Contents\n\n- [Installation](#installation)\n- [Used By](#used-by)\n- [Quickstart](#quickstart)\n- [Speed & Benchmarks](#speed--benchmarks)\n- [Debug Logging](#debug-logging)\n- [License](#license)\n\n## Installation\n\n```console\npip install -U svs\n```\n\n## Used By\n\nSVS is used in production by:\n\n[![AutoAuto](https://raw.githubusercontent.com/Rhobota/svs/main/logos/autoauto.png)](https://www.autoauto.ai/)\n\n## Quickstart\n\nHere is the _most simple_ use-case; it just queries a pre-built knowledge base!\nThis particular example queries a knowledge base of \"Dad Jokes\" \ud83e\udd29.\n\n(taken from [./examples/quickstart.py](./examples/quickstart.py))\n\n```python\nimport svs # <-- pip install -U svs\n\nimport os\nfrom dotenv import load_dotenv; load_dotenv()\nassert os.environ.get('OPENAI_API_KEY'), \"You must set your OPENAI_API_KEY environment variable!\"\n\n#\n# The database remembers which embeddings provider (e.g. OpenAI) was used.\n#\n# The \"Dad Jokes\" database below uses OpenAI embeddings, so that's why you had\n# to set your OPENAI_API_KEY above!\n#\n# NOTE: The first time you run this script it will download this database,\n# so expect that to take a few seconds...\n#\nDB_URL = 'https://github.com/Rhobota/svs/raw/main/examples/dad_jokes/dad_jokes.sqlite.gz'\n\n\ndef demo() -> None:\n kb = svs.KB(DB_URL)\n\n records = kb.retrieve('chicken', n = 10)\n\n for record in records:\n score = record['score']\n text = record['doc']['text']\n print(f\" \ud83d\ude06 score={score:.4f}: {text}\\n\")\n\n kb.close()\n\n\nif __name__ == '__main__':\n demo()\n```\n\n\u26a0\ufe0f **Want to see how that _Dad Jokes_ knowledge base was created?** See: [./examples/dad_jokes/Build Dad Jokes KB.ipynb](<./examples/dad_jokes/Build Dad Jokes KB.ipynb>)\n\n## Speed & Benchmarks\n\n**SQLite** and **NumPy** are fast, thus **SVS** is fast \ud83c\udfce\ufe0f. Our goal is to minimize the amount of work done at the Python-layer.\n\nAlso, your bottleneck will *certainly* be the remote API calls to get document embeddings (e.g. calling out to OpenAI's API to get embeddings will be the _slowest_ thing), so it's likely not critical to further optimize the Python-layer bits.\n\nThe following benchmarks were performed on 2018-era commodity hardware (Intel i3-8100):\n\n| Number of Documents | Load into SQLite | Get Embeddings for All Documents (remote API call) | Cosine Similarity + Sort + Retrieve Top-100 Documents [^3] |\n|:---------------------------------- |:---------------- |:-------------------------------------------------- |:-------------------------------------------------------------- |\n| 10,548 jokes [^1] | 0.07 seconds | 80 seconds | 0.5 seconds (first query) + 0.011 seconds (subsequent queries) |\n| 1,000,000 synthetic documents [^2] | 8 seconds | 2 hours [^4] | 2 minutes (first query) + 0.24 seconds (subsequent queries) |\n\n[^1]: This benchmark is from the Dad Jokes KB from [this notebook](<./examples/dad_jokes/Build Dad Jokes KB.ipynb>).\n\n[^2]: This benchmark is over one million synthetic documents, where those documents have an average length of 1,200 characters. Specifically, [this notebook](<./examples/One Million Documents Benchmark.ipynb>).\n\n[^3]: This time does _not_ include the time it takes to obtain the query string's embedding from the external service (i.e. from OpenAI's API); rather, it captures the time it takes to (1) compute the cosine similarity of the query string with _all_ the documents' vectors (where embedding dimensionality is 1,536), then (2) sort the results, and then (3) retrieve the top-100 documents from the database. Note: The first query is _slow_ because it must load the vectors from disk into RAM, while subsequent queries are _fast_ since those vectors stay cached in RAM.\n\n[^4]: This is an estimate based on the observed typical response times from OpenAI's embeddings API. For this test, we generate synthetic embeddings with dimensionality 1,536 to simulate the correct datasize and computation requirements as if we used \"real\" embeddings.\n\n## Debug Logging\n\nThis library logs using Python's builtin `logging` module. It logs mostly to `INFO`, so here's a snippet of code you can put in _your_ app to see those traces:\n\n```python\nimport logging\n\nlogging.basicConfig(\n level=logging.INFO,\n format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',\n)\n\n# ... now use SVS as you normally would, but you'll see extra log traces!\n```\n\n## License\n\n`svs` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.\n",
"bugtrack_url": null,
"license": null,
"summary": "Stupid Vector Store (SVS): a vector database for the rest of us",
"version": "0.6.1",
"project_urls": {
"Documentation": "https://github.com/Rhobota/svs#readme",
"Issues": "https://github.com/Rhobota/svs/issues",
"Source": "https://github.com/Rhobota/svs"
},
"split_keywords": [
"database",
" embeddings",
" search",
" semantic",
" store",
" stupid",
" vector"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "eec78bddcff60e1f429d386fd40e43acb3f1c413dd38dab93f90df6b3bd0d168",
"md5": "9168fe65f973ce72342f4d6b3f48be11",
"sha256": "1dc86ca3060fa16b6d6927813c3c70bd59bdbfe261b6c2821325b3ef7ba18a89"
},
"downloads": -1,
"filename": "svs-0.6.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9168fe65f973ce72342f4d6b3f48be11",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 21357,
"upload_time": "2024-11-09T01:46:07",
"upload_time_iso_8601": "2024-11-09T01:46:07.861579Z",
"url": "https://files.pythonhosted.org/packages/ee/c7/8bddcff60e1f429d386fd40e43acb3f1c413dd38dab93f90df6b3bd0d168/svs-0.6.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b2589b2aab8e2fd79710538cfd97815eb3eea042b6fcd9113bcc1e8ec3b38c93",
"md5": "d24baf38bf0543bc362260917c99d5c6",
"sha256": "d7b53bc1192cc59814677068f4e9d2d1184905e8ebd5b3f5d5fc18b562b65c60"
},
"downloads": -1,
"filename": "svs-0.6.1.tar.gz",
"has_sig": false,
"md5_digest": "d24baf38bf0543bc362260917c99d5c6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 24499774,
"upload_time": "2024-11-09T01:46:05",
"upload_time_iso_8601": "2024-11-09T01:46:05.232528Z",
"url": "https://files.pythonhosted.org/packages/b2/58/9b2aab8e2fd79710538cfd97815eb3eea042b6fcd9113bcc1e8ec3b38c93/svs-0.6.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-09 01:46:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Rhobota",
"github_project": "svs#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "svs"
}