vector-lake


Namevector-lake JSON
Version 0.0.4 PyPI version JSON
download
home_pagehttps://github.com/msoedov/vector_lake
SummaryS3 vector database for bigdata
upload_time2023-08-15 10:27:57
maintainerAlexander Miasoiedov
docs_urlNone
authorAlexander Miasoiedov
requires_python>=3.9,<4.0
licenseMIT
keywords vector database bigdata
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # VectorLake

VectorLake is a robust, vector database designed for low maintenance, cost, efficient storage and ANN querying of any size vector data distributed across S3 files.

<p>
<img alt="GitHub Contributors" src="https://img.shields.io/github/contributors/msoedov/vector_lake" />
<img alt="GitHub Last Commit" src="https://img.shields.io/github/last-commit/msoedov/vector_lake" />
<img alt="" src="https://img.shields.io/github/repo-size/msoedov/vector_lake" />
<img alt="GitHub Issues" src="https://img.shields.io/github/issues/msoedov/vector_lake" />
<img alt="GitHub Pull Requests" src="https://img.shields.io/github/issues-pr/msoedov/vector_lake" />
<img alt="Github License" src="https://img.shields.io/github/license/msoedov/vector_lake" />
</p>

## 🏷 Features

- Inspired by article [Which Vector Database Should I Use? A Comparison Cheatsheet](https://navidre.medium.com/which-vector-database-should-i-use-a-comparison-cheatsheet-cb330e55fca)

- VectorLake created with tradeoff to minimize db maintenance, cost and provide custom data partitioning strategies

- Native Big Data Support: Specifically designed to handle large datasets, making it ideal for big data projects.

- Vector Data Handling: Capable of storing and querying high-dimensional vectors, commonly used for embedding storage in machine learning projects.projects.

- Efficient Search: Efficient nearest neighbors search, ideal for querying similar vectors in high-dimensional spaces. This makes it especially useful for querying for similar vectors in a high-dimensional space.

- Data Persistence: Supports data persistence on disk, network volume and S3, enabling long-term storage and retrieval of indexed data.

- Customizable Partitioning: Trade-off design to minimize database maintenance, cost, and provide custom data partitioning strategies.

- Native support of LLM Agents.

- Feature store for experimental data.

## 📦 Installation

To get started with VectorLake, simply install the package using pip:

```shell
pip install vector_lake
```

## ⛓️ Quick Start

```python
import numpy as np
from vector_lake import VectorLake

db = VectorLake(location="s3://vector-lake", dimension=5, approx_shards=243)
N = 100  # for example
D = 5  # Dimensionality of each vector
embeddings = np.random.rand(N, D)

for em in embeddings:
    db.add(em, metadata={}, document="some document")
db.persist()

db = VectorLake(location="s3://vector-lake", dimension=5, approx_shards=243)
# re-init test
db.query([0.56325391, 0.1500543, 0.88579166, 0.73536349, 0.7719873])

```

### Custom feature partition

Custom partition to group features by custom category

```python
import numpy as np
from vector_lake.core.index import Partition

if __name__ == "__main__":
    db = Partition(location="s3://vector-lake", partition_key="feature", dimension=5)
    N = 100  # for example
    D = 5  # Dimensionality of each vector
    embeddings = np.random.rand(N, D)

    for em in embeddings:
        db.add(em, metadata={}, document="some document")
    db.persist()

    db = Partition(location="s3://vector-lake", key="feature", dimension=5)
    # re-init test
    db.buckets
    db.query([0.56325391, 0.1500543, 0.88579166, 0.73536349, 0.7719873])

```

### Local persistent volume

```python
import numpy as np
from vector_lake import VectorLake

db = VectorLake(location="/mnt/db", dimension=5, approx_shards=243)
N = 100  # for example
D = 5  # Dimensionality of each vector
embeddings = np.random.rand(N, D)

for em in embeddings:
    db.add(em, metadata={}, document="some document")
db.persist()

db = VectorLake(location="/mnt/db", dimension=5, approx_shards=243)
# re-init test
db.query([0.56325391, 0.1500543, 0.88579166, 0.73536349, 0.7719873])

```

## Langchain Retrieval

```python
from langchain.document_loaders import TextLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from vector_lake.langchain import VectorLakeStore

loader = TextLoader("Readme.md")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
db = VectorLakeStore.from_documents(documents=docs, embedding=embedding)

query = "What is Vector Lake?"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

```

## Why VectorLake?

VectorLake gives you the functionality of a simple, resilient vector database, but with very easy setup and low operational overhead. With it you've got a lightweight and reliable distributed vector store.

VectorLake leverages Hierarchical Navigable Small World (HNSW) for data partitioning across all vector data shards. This ensures that each modification to the system aligns with vector distance. You can learn more about the design here.

### Limitations

TBD

## 🛠️ Roadmap

## 👋 Contributing

Contributions to VectorLake are welcome! If you'd like to contribute, please follow these steps:

- Fork the repository on GitHub
- Create a new branch for your changes
- Commit your changes to the new branch
- Push your changes to the forked repository
- Open a pull request to the main VectorLake repository

Before contributing, please read the contributing guidelines.

## License

VectorLake is released under the MIT License.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/msoedov/vector_lake",
    "name": "vector-lake",
    "maintainer": "Alexander Miasoiedov",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "msoedov@gmail.com",
    "keywords": "vector,database,bigdata",
    "author": "Alexander Miasoiedov",
    "author_email": "msoedov@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8c/74/447e304ae7beed1bfff7ca3854fa438abeb8d2fdcb48771863c7113c5673/vector_lake-0.0.4.tar.gz",
    "platform": null,
    "description": "# VectorLake\n\nVectorLake is a robust, vector database designed for low maintenance, cost, efficient storage and ANN querying of any size vector data distributed across S3 files.\n\n<p>\n<img alt=\"GitHub Contributors\" src=\"https://img.shields.io/github/contributors/msoedov/vector_lake\" />\n<img alt=\"GitHub Last Commit\" src=\"https://img.shields.io/github/last-commit/msoedov/vector_lake\" />\n<img alt=\"\" src=\"https://img.shields.io/github/repo-size/msoedov/vector_lake\" />\n<img alt=\"GitHub Issues\" src=\"https://img.shields.io/github/issues/msoedov/vector_lake\" />\n<img alt=\"GitHub Pull Requests\" src=\"https://img.shields.io/github/issues-pr/msoedov/vector_lake\" />\n<img alt=\"Github License\" src=\"https://img.shields.io/github/license/msoedov/vector_lake\" />\n</p>\n\n## \ud83c\udff7 Features\n\n- Inspired by article [Which Vector Database Should I Use? A Comparison Cheatsheet](https://navidre.medium.com/which-vector-database-should-i-use-a-comparison-cheatsheet-cb330e55fca)\n\n- VectorLake created with tradeoff to minimize db maintenance, cost and provide custom data partitioning strategies\n\n- Native Big Data Support: Specifically designed to handle large datasets, making it ideal for big data projects.\n\n- Vector Data Handling: Capable of storing and querying high-dimensional vectors, commonly used for embedding storage in machine learning projects.projects.\n\n- Efficient Search: Efficient nearest neighbors search, ideal for querying similar vectors in high-dimensional spaces. This makes it especially useful for querying for similar vectors in a high-dimensional space.\n\n- Data Persistence: Supports data persistence on disk, network volume and S3, enabling long-term storage and retrieval of indexed data.\n\n- Customizable Partitioning: Trade-off design to minimize database maintenance, cost, and provide custom data partitioning strategies.\n\n- Native support of LLM Agents.\n\n- Feature store for experimental data.\n\n## \ud83d\udce6 Installation\n\nTo get started with VectorLake, simply install the package using pip:\n\n```shell\npip install vector_lake\n```\n\n## \u26d3\ufe0f Quick Start\n\n```python\nimport numpy as np\nfrom vector_lake import VectorLake\n\ndb = VectorLake(location=\"s3://vector-lake\", dimension=5, approx_shards=243)\nN = 100  # for example\nD = 5  # Dimensionality of each vector\nembeddings = np.random.rand(N, D)\n\nfor em in embeddings:\n    db.add(em, metadata={}, document=\"some document\")\ndb.persist()\n\ndb = VectorLake(location=\"s3://vector-lake\", dimension=5, approx_shards=243)\n# re-init test\ndb.query([0.56325391, 0.1500543, 0.88579166, 0.73536349, 0.7719873])\n\n```\n\n### Custom feature partition\n\nCustom partition to group features by custom category\n\n```python\nimport numpy as np\nfrom vector_lake.core.index import Partition\n\nif __name__ == \"__main__\":\n    db = Partition(location=\"s3://vector-lake\", partition_key=\"feature\", dimension=5)\n    N = 100  # for example\n    D = 5  # Dimensionality of each vector\n    embeddings = np.random.rand(N, D)\n\n    for em in embeddings:\n        db.add(em, metadata={}, document=\"some document\")\n    db.persist()\n\n    db = Partition(location=\"s3://vector-lake\", key=\"feature\", dimension=5)\n    # re-init test\n    db.buckets\n    db.query([0.56325391, 0.1500543, 0.88579166, 0.73536349, 0.7719873])\n\n```\n\n### Local persistent volume\n\n```python\nimport numpy as np\nfrom vector_lake import VectorLake\n\ndb = VectorLake(location=\"/mnt/db\", dimension=5, approx_shards=243)\nN = 100  # for example\nD = 5  # Dimensionality of each vector\nembeddings = np.random.rand(N, D)\n\nfor em in embeddings:\n    db.add(em, metadata={}, document=\"some document\")\ndb.persist()\n\ndb = VectorLake(location=\"/mnt/db\", dimension=5, approx_shards=243)\n# re-init test\ndb.query([0.56325391, 0.1500543, 0.88579166, 0.73536349, 0.7719873])\n\n```\n\n## Langchain Retrieval\n\n```python\nfrom langchain.document_loaders import TextLoader\nfrom langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings\nfrom langchain.text_splitter import CharacterTextSplitter\nfrom vector_lake.langchain import VectorLakeStore\n\nloader = TextLoader(\"Readme.md\")\ndocuments = loader.load()\n\n# split it into chunks\ntext_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)\ndocs = text_splitter.split_documents(documents)\n\n# create the open-source embedding function\nembedding = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")\ndb = VectorLakeStore.from_documents(documents=docs, embedding=embedding)\n\nquery = \"What is Vector Lake?\"\ndocs = db.similarity_search(query)\n\n# print results\nprint(docs[0].page_content)\n\n```\n\n## Why VectorLake?\n\nVectorLake gives you the functionality of a simple, resilient vector database, but with very easy setup and low operational overhead. With it you've got a lightweight and reliable distributed vector store.\n\nVectorLake leverages Hierarchical Navigable Small World (HNSW) for data partitioning across all vector data shards. This ensures that each modification to the system aligns with vector distance. You can learn more about the design here.\n\n### Limitations\n\nTBD\n\n## \ud83d\udee0\ufe0f Roadmap\n\n## \ud83d\udc4b Contributing\n\nContributions to VectorLake are welcome! If you'd like to contribute, please follow these steps:\n\n- Fork the repository on GitHub\n- Create a new branch for your changes\n- Commit your changes to the new branch\n- Push your changes to the forked repository\n- Open a pull request to the main VectorLake repository\n\nBefore contributing, please read the contributing guidelines.\n\n## License\n\nVectorLake is released under the MIT License.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "S3 vector database for bigdata",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/msoedov/vector_lake",
        "Repository": "https://github.com/msoedov/vector_lake"
    },
    "split_keywords": [
        "vector",
        "database",
        "bigdata"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8c74447e304ae7beed1bfff7ca3854fa438abeb8d2fdcb48771863c7113c5673",
                "md5": "08cb22a5907aa8f4214ab5090f6468f7",
                "sha256": "1eaccf2025e65200633b6f22527e048caf9c77f8cda1f35a061297256151dc5e"
            },
            "downloads": -1,
            "filename": "vector_lake-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "08cb22a5907aa8f4214ab5090f6468f7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 14389,
            "upload_time": "2023-08-15T10:27:57",
            "upload_time_iso_8601": "2023-08-15T10:27:57.971224Z",
            "url": "https://files.pythonhosted.org/packages/8c/74/447e304ae7beed1bfff7ca3854fa438abeb8d2fdcb48771863c7113c5673/vector_lake-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-15 10:27:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "msoedov",
    "github_project": "vector_lake",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "vector-lake"
}
        
Elapsed time: 0.79858s