sahomedb


Namesahomedb JSON
Version 0.4.0 PyPI version JSON
download
home_pagehttps://sahome.com/
SummaryFast embedded vector database with incremental HNSW indexing.
upload_time2024-08-18 14:24:38
maintainerNone
docs_urlNone
authorElijah Obs, Samson
requires_python>=3.8
licenseApache-2.0
keywords embedded vector database hnsw ann
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![SahomeDB Use Case](https://i.postimg.cc/SR0MJRFF/sahomedb.png)

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg?style=for-the-badge)](https://opensource.org/licenses/Apache-2.0) [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg?style=for-the-badge)](/docs/code_of_conduct.md) [![Discord](https://img.shields.io/discord/1182432298382131200?logo=discord&logoColor=%23ffffff&label=Discord&style=for-the-badge)](https://discord.gg/f7qHc9CK)
[![Crates.io](https://img.shields.io/crates/d/sahomedb?style=for-the-badge&label=Crates.io&color=%23f43f5e)](https://crates.io/crates/sahomedb)


# 👋 Meet SahomeDB

SahomeDB is a SQLite-inspired **lightweight** and **easy to use** embedded vector database. It is designed to be embedded directly inside your AI application. It is written in Rust and uses [Sled](https://docs.rs/sled) as its persistence storage engine to save vector collections to the disk.

SahomeDB implements **HNSW** (Hierachical Navigable Small World) as its indexing algorithm. It is a state-of-the-art algorithm that is used by many vector databases. It is fast and scales well to large datasets.

## Why SahomeDB?

SahomeDB is very flexible for use cases related with vector search such as using RAG (Retrieval-Augmented Generation) method with an LLM to generate a context-aware output. These are some of the reasons why you might want to use SahomeDB:

⭐️ **Embedded database**: SahomeDB doesn't require you to set up a separate server and manage it. You can embed it directly into your application and use its simple API like a regular library.

⭐️ **Optional persistence**: You can choose to persist the vector collection to the disk or keep it in memory. By default, whenever you use a collection, it will be loaded to the memory to ensure that the search performance is high.

⭐️ **Incremental operations**: SahomeDB allows you to add, remove, or modify vectors from collections without having to rebuild indexes. This allows for a more flexible and efficient approach on storing your vector data.

⭐ **Flexible schema**: Along with the vectors, you can store additional metadata for each vector. This is useful for storing information about the vectors such as the original text, image URL, or any other data that you want to associate with the vectors.

# ⚙️ Quickstart with Rust

To get started with SahomeDB in Rust, you need to add `sahomedb` to your `Cargo.toml`.You can do so by running the command below which will add the latest version of SahomeDB to your project.

```bash
cargo add sahomedb
```

After that, you can use the code snippet below as a reference to get started with SahomeDB. In short, use `Collection` to store your vector records or search similar vector and use `Database` to persist a vector collection to the disk.

In short, use `Collection` to store your vector records or search similar vector and use `Database` to persist a vector collection to the disk.

```rust
use sahomedb::prelude::*;

fn main() {
    // Vector dimension must be uniform.
    let dimension = 128;

    // Replace with your own data.
    let records = Record::many_random(dimension, 100);

    let mut config = Config::default();

    // Optionally set the distance function. Default to Euclidean.
    config.distance = Distance::Cosine;

    // Create a vector collection.
    let collection = Collection::build(&config, &records).unwrap();

    // Optionally save the collection to persist it.
    let mut db = Database::new("data/test").unwrap();
    db.save_collection("vectors", &collection).unwrap();

    // Search for the nearest neighbors.
    let query = Vector::random(dimension);
    let result = collection.search(&query, 5).unwrap();
    println!("Nearest ID: {}", result[0].id);
}
```
## Dealing with Metadata

In SahomeDB, you can store additional metadata for each vector which is useful to associate the vectors with other data. The code snippet below shows how to insert the `Metadata` to the `Record` or extract it.

```rust
use sahomedb::prelude::*;

fn main() {
    // Inserting a metadata value into a record.
    let data: &str = "This is an example.";
    let vector = Vector::random(128);
    let record = Record::new(&vector, &data.into());

    // Extracting the metadata value.
    let metadata = record.data.clone();
    let data = match metadata {
        Metadata::Text(value) => value,
        _ => panic!("Data is not a text."),
    };

    println!("{}", data);
}
```

# 🐍 Quickstart with Python
SahomeDB also provides a Python binding which allows you to add it directly to your project. You can install the Python library of SahomeDB by running the command below:

```bash
pip install sahomedb
```
```python
from sahomedb.prelude import *


if __name__ == "__main__":
    # Open the database.
    db = Database("data/example")

    # Replace with your own records.
    records = Record.many_random(dimension=128, len=100)

    # Create a vector collection.
    config = Config.create_default()
    collection = Collection.from_records(config, records)

    # Optionally, persist the collection to the database.
    db.save_collection("my_collection", collection)

    # Replace with your own query.
    query = Vector.random(128)

    # Search for the nearest neighbors.
    result = collection.search(query, n=5)

    # Print the result.
    print("Nearest neighbors ID: {}".format(result[0].id))
```

If you want to learn more about using SahomeDB for real-world applications, you can check out the this Google Colab notebook which demonstrates how to use SahomeDB to build a simple image similarity search engine: [Image Search Engine with SahomeDB](https://colab.research.google.com/drive/1R2tZ0dM3-BoFPzuOdtXHQOUbBHnYukWL?usp=sharing)

# 🎯 Benchmarks

SahomeDB  uses a built-in benchmarking suite using Rust's [Criterion](https://docs.rs/criterion) crate which we use to measure the performance of the vector database.

Currently, the benchmarks are focused on the performance of the collection's vector search functionality. We are working on adding more benchmarks to measure the performance of other operations.

If you are curious and want to run the benchmarks, you can use the following command which will download the benchmarking dataset and run the benchmarks:

```bash
cargo bench
```

## Memory Usage

SahomeDB uses HNSW which is known to be a memory hog compared to other indexing algorithms. We decided to use it because of its performance even when storing large datasets of vectors with high dimension.

In the future, we might consider adding more indexing algorithms to make SahomeDB more flexible and to cater to different use cases. If you have any suggestions of which indexing algorithms we should add, please let us know.

Anyway, if you are curious about the memory usage of SahomeDB, you can use the command below to run the memory usage measurement script. You can tweak the parameters in the `examples/measure-memory.rs` file to see how the memory usage changes.

```bash
cargo run --example measure-memory
```

## Quick Results

Even though the results may vary depending on the hardware and the dataset, we want to give you a quick idea of the performance of SahomeDB. Here are some quick results from the benchmarks:

**10,000 vectors with 128 dimensions**

- Search time: 0.15 ms
- Memory usage: 6 MB

**1,000,000 vectors with 128 dimensions**

- Search time: 1.5 ms
- Memory usage: 600 MB

These results are from a machine with an Apple M2 CPU and 16 GB of RAM. The dataset used for the benchmarks is a random dataset generated by the `Record::many_random` function or SIFT datasets with additional random `usize` as its metadata.

# 🤝 Contributing

The easiest way to contribute to this project is to star this project and share it with your friends. This will help us grow the community and make the project more visible to others.

If you want to go further and contribute your expertise, we will gladly welcome your code contributions. For more information and guidance about this, please see [contributing.md](/docs/contributing.md).


If you have deep experience in the space but don't have the free time to contribute codes, we also welcome advices, suggestions, or feature requests. We are also looking for advisors to help guide the project direction and roadmap.

This project is still in the early stages of development. We are actively working on it and we expect the API and functionality to change. We do not recommend using this in production yet.

If you are interested about the project in any way, please join us on [Discord](https://discord.gg/pU2HmnCY). Help us grow the community and make SahomeDB better 😁


## Code of Conduct

We are committed to creating a welcoming community. Any participant in our project is expected to act respectfully and to follow the [Code of Conduct](/docs/code_of_conduct.md).

## Disclaimer

This project is still in the early stages of development. We are actively working on it and we expect the API and functionality to change. We do not recommend using this in production yet.

            

Raw data

            {
    "_id": null,
    "home_page": "https://sahome.com/",
    "name": "sahomedb",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "embedded, vector, database, hnsw, ann",
    "author": "Elijah Obs, Samson",
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "![SahomeDB Use Case](https://i.postimg.cc/SR0MJRFF/sahomedb.png)\n\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg?style=for-the-badge)](https://opensource.org/licenses/Apache-2.0) [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg?style=for-the-badge)](/docs/code_of_conduct.md) [![Discord](https://img.shields.io/discord/1182432298382131200?logo=discord&logoColor=%23ffffff&label=Discord&style=for-the-badge)](https://discord.gg/f7qHc9CK)\n[![Crates.io](https://img.shields.io/crates/d/sahomedb?style=for-the-badge&label=Crates.io&color=%23f43f5e)](https://crates.io/crates/sahomedb)\n\n\n# \ud83d\udc4b Meet SahomeDB\n\nSahomeDB is a SQLite-inspired **lightweight** and **easy to use** embedded vector database. It is designed to be embedded directly inside your AI application. It is written in Rust and uses [Sled](https://docs.rs/sled) as its persistence storage engine to save vector collections to the disk.\n\nSahomeDB implements **HNSW** (Hierachical Navigable Small World) as its indexing algorithm. It is a state-of-the-art algorithm that is used by many vector databases. It is fast and scales well to large datasets.\n\n## Why SahomeDB?\n\nSahomeDB is very flexible for use cases related with vector search such as using RAG (Retrieval-Augmented Generation) method with an LLM to generate a context-aware output. These are some of the reasons why you might want to use SahomeDB:\n\n\u2b50\ufe0f **Embedded database**: SahomeDB doesn't require you to set up a separate server and manage it. You can embed it directly into your application and use its simple API like a regular library.\n\n\u2b50\ufe0f **Optional persistence**: You can choose to persist the vector collection to the disk or keep it in memory. By default, whenever you use a collection, it will be loaded to the memory to ensure that the search performance is high.\n\n\u2b50\ufe0f **Incremental operations**: SahomeDB allows you to add, remove, or modify vectors from collections without having to rebuild indexes. This allows for a more flexible and efficient approach on storing your vector data.\n\n\u2b50 **Flexible schema**: Along with the vectors, you can store additional metadata for each vector. This is useful for storing information about the vectors such as the original text, image URL, or any other data that you want to associate with the vectors.\n\n# \u2699\ufe0f Quickstart with Rust\n\nTo get started with SahomeDB in Rust, you need to add `sahomedb` to your `Cargo.toml`.You can do so by running the command below which will add the latest version of SahomeDB to your project.\n\n```bash\ncargo add sahomedb\n```\n\nAfter that, you can use the code snippet below as a reference to get started with SahomeDB. In short, use `Collection` to store your vector records or search similar vector and use `Database` to persist a vector collection to the disk.\n\nIn short, use `Collection` to store your vector records or search similar vector and use `Database` to persist a vector collection to the disk.\n\n```rust\nuse sahomedb::prelude::*;\n\nfn main() {\n    // Vector dimension must be uniform.\n    let dimension = 128;\n\n    // Replace with your own data.\n    let records = Record::many_random(dimension, 100);\n\n    let mut config = Config::default();\n\n    // Optionally set the distance function. Default to Euclidean.\n    config.distance = Distance::Cosine;\n\n    // Create a vector collection.\n    let collection = Collection::build(&config, &records).unwrap();\n\n    // Optionally save the collection to persist it.\n    let mut db = Database::new(\"data/test\").unwrap();\n    db.save_collection(\"vectors\", &collection).unwrap();\n\n    // Search for the nearest neighbors.\n    let query = Vector::random(dimension);\n    let result = collection.search(&query, 5).unwrap();\n    println!(\"Nearest ID: {}\", result[0].id);\n}\n```\n## Dealing with Metadata\n\nIn SahomeDB, you can store additional metadata for each vector which is useful to associate the vectors with other data. The code snippet below shows how to insert the `Metadata` to the `Record` or extract it.\n\n```rust\nuse sahomedb::prelude::*;\n\nfn main() {\n    // Inserting a metadata value into a record.\n    let data: &str = \"This is an example.\";\n    let vector = Vector::random(128);\n    let record = Record::new(&vector, &data.into());\n\n    // Extracting the metadata value.\n    let metadata = record.data.clone();\n    let data = match metadata {\n        Metadata::Text(value) => value,\n        _ => panic!(\"Data is not a text.\"),\n    };\n\n    println!(\"{}\", data);\n}\n```\n\n# \ud83d\udc0d Quickstart with Python\nSahomeDB also provides a Python binding which allows you to add it directly to your project. You can install the Python library of SahomeDB by running the command below:\n\n```bash\npip install sahomedb\n```\n```python\nfrom sahomedb.prelude import *\n\n\nif __name__ == \"__main__\":\n    # Open the database.\n    db = Database(\"data/example\")\n\n    # Replace with your own records.\n    records = Record.many_random(dimension=128, len=100)\n\n    # Create a vector collection.\n    config = Config.create_default()\n    collection = Collection.from_records(config, records)\n\n    # Optionally, persist the collection to the database.\n    db.save_collection(\"my_collection\", collection)\n\n    # Replace with your own query.\n    query = Vector.random(128)\n\n    # Search for the nearest neighbors.\n    result = collection.search(query, n=5)\n\n    # Print the result.\n    print(\"Nearest neighbors ID: {}\".format(result[0].id))\n```\n\nIf you want to learn more about using SahomeDB for real-world applications, you can check out the this Google Colab notebook which demonstrates how to use SahomeDB to build a simple image similarity search engine: [Image Search Engine with SahomeDB](https://colab.research.google.com/drive/1R2tZ0dM3-BoFPzuOdtXHQOUbBHnYukWL?usp=sharing)\n\n# \ud83c\udfaf Benchmarks\n\nSahomeDB  uses a built-in benchmarking suite using Rust's [Criterion](https://docs.rs/criterion) crate which we use to measure the performance of the vector database.\n\nCurrently, the benchmarks are focused on the performance of the collection's vector search functionality. We are working on adding more benchmarks to measure the performance of other operations.\n\nIf you are curious and want to run the benchmarks, you can use the following command which will download the benchmarking dataset and run the benchmarks:\n\n```bash\ncargo bench\n```\n\n## Memory Usage\n\nSahomeDB uses HNSW which is known to be a memory hog compared to other indexing algorithms. We decided to use it because of its performance even when storing large datasets of vectors with high dimension.\n\nIn the future, we might consider adding more indexing algorithms to make SahomeDB more flexible and to cater to different use cases. If you have any suggestions of which indexing algorithms we should add, please let us know.\n\nAnyway, if you are curious about the memory usage of SahomeDB, you can use the command below to run the memory usage measurement script. You can tweak the parameters in the `examples/measure-memory.rs` file to see how the memory usage changes.\n\n```bash\ncargo run --example measure-memory\n```\n\n## Quick Results\n\nEven though the results may vary depending on the hardware and the dataset, we want to give you a quick idea of the performance of SahomeDB. Here are some quick results from the benchmarks:\n\n**10,000 vectors with 128 dimensions**\n\n- Search time: 0.15 ms\n- Memory usage: 6 MB\n\n**1,000,000 vectors with 128 dimensions**\n\n- Search time: 1.5 ms\n- Memory usage: 600 MB\n\nThese results are from a machine with an Apple M2 CPU and 16 GB of RAM. The dataset used for the benchmarks is a random dataset generated by the `Record::many_random` function or SIFT datasets with additional random `usize` as its metadata.\n\n# \ud83e\udd1d Contributing\n\nThe easiest way to contribute to this project is to star this project and share it with your friends. This will help us grow the community and make the project more visible to others.\n\nIf you want to go further and contribute your expertise, we will gladly welcome your code contributions. For more information and guidance about this, please see [contributing.md](/docs/contributing.md).\n\n\nIf you have deep experience in the space but don't have the free time to contribute codes, we also welcome advices, suggestions, or feature requests. We are also looking for advisors to help guide the project direction and roadmap.\n\nThis project is still in the early stages of development. We are actively working on it and we expect the API and functionality to change. We do not recommend using this in production yet.\n\nIf you are interested about the project in any way, please join us on [Discord](https://discord.gg/pU2HmnCY). Help us grow the community and make SahomeDB better \ud83d\ude01\n\n\n## Code of Conduct\n\nWe are committed to creating a welcoming community. Any participant in our project is expected to act respectfully and to follow the [Code of Conduct](/docs/code_of_conduct.md).\n\n## Disclaimer\n\nThis project is still in the early stages of development. We are actively working on it and we expect the API and functionality to change. We do not recommend using this in production yet.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Fast embedded vector database with incremental HNSW indexing.",
    "version": "0.4.0",
    "project_urls": {
        "Homepage": "https://sahome.com/",
        "changelog": "https://github.com/Sahomey-Technologies/sahomedb/blob/main/docs/changelog.md",
        "issues": "https://github.com/Sahomey-Technologies/sahomedb/issues",
        "repository": "https://github.com/Sahomey-Technologies/sahomedb"
    },
    "split_keywords": [
        "embedded",
        " vector",
        " database",
        " hnsw",
        " ann"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4986ca6c3a311173250a6a1f7aa15653971d7b0c187241c65a533f8a83f0f74e",
                "md5": "d49c06c4d0ec3aa27ac56528010d9d81",
                "sha256": "c4320a3a9074a144f8b29487daa0c864c8c84b517821c4b4d0b5bb6bc33f5e09"
            },
            "downloads": -1,
            "filename": "sahomedb-0.4.0-cp310-cp310-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "d49c06c4d0ec3aa27ac56528010d9d81",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.8",
            "size": 647703,
            "upload_time": "2024-08-18T14:24:38",
            "upload_time_iso_8601": "2024-08-18T14:24:38.255078Z",
            "url": "https://files.pythonhosted.org/packages/49/86/ca6c3a311173250a6a1f7aa15653971d7b0c187241c65a533f8a83f0f74e/sahomedb-0.4.0-cp310-cp310-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-18 14:24:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Sahomey-Technologies",
    "github_project": "sahomedb",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "sahomedb"
}
        
Elapsed time: 4.54795s