mdb-toolkit


Namemdb-toolkit JSON
Version 0.9.0 PyPI version JSON
download
home_pagehttps://github.com/ranfysvalle02/mdb_toolkit
SummaryCustom MongoDB client with vector search capabilities, embeddings management, and more.
upload_time2025-01-26 17:50:53
maintainerNone
docs_urlNone
authorFabian Valle
requires_python>=3.7
licenseMIT
keywords mongodb vector search embeddings pymongo custom client
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # mdb_toolkit

![](https://github.com/ranfysvalle02/mdb_toolkit/raw/main/demo.png)

# Less Code, More Power  

MongoDB's flexibility and PyMongo's robust driver make it a popular choice for database management in Python applications. While PyMongo's `MongoClient` class provides rich functionality, there are scenarios where adding custom methods can simplify repetitive tasks or enhance the developer experience. 

---  
      
### **Why Customize MongoClient?**
- **Streamlined Operations**: Simplify frequent tasks like listing databases and collections.
- **Encapsulation**: Abstract additional functionality into a single, reusable class.
- **Extensibility**: Add new methods to tailor MongoDB operations to your project’s needs.

---

### **Setting Up the Environment**
Before diving into code, we’ll need a MongoDB instance to work with. A simple command to start a local MongoDB container:

```bash
docker run -d -p 27017:27017 --restart unless-stopped mongodb/mongodb-atlas-local
```

**OR** 

if you already have a MongoDB Atlas cluster, keep the MongoDB URI handy as you will need it :)

---

---

Integrating advanced search capabilities into your applications can often be complex and time-consuming. However, our latest MongoDB integration changes the game by **streamlining the process, reducing the amount of code you need to write, and making embedding effortless**. 

#### **1. Effortless Embedding Integration**
Embedding AI functionalities into your MongoDB database has never been simpler. Our custom `MongoClient` handles the generation and storage of embeddings seamlessly. This means you can focus on building features rather than managing the intricacies of embedding processes.

#### **2. Clean and Maintainable Codebase**
Say goodbye to cluttered and hard-to-maintain code! Our implementation consolidates essential operations—like creating search indexes, inserting documents with embeddings, and performing various types of searches—into a single, well-organized class. This not only reduces the number of lines you need to write but also enhances the readability and maintainability of your code.

#### **3. Versatile Search Capabilities**
Whether you need vector-based searches, keyword searches, or a combination of both, our integration has you covered. The `vector_search`, `keyword_search`, and `hybrid_search` methods provide flexible options to retrieve the most relevant documents efficiently. This versatility ensures that you can meet a wide range of search requirements with ease.

#### **4. Robust and Reliable Performance**
Built on MongoDB’s solid infrastructure, our client ensures reliable performance from index creation to search execution. With comprehensive logging and error handling, you can trust that your searches will run smoothly and any issues will be promptly identified and addressed.

#### **5. Quick and Easy Deployment**
Configuration is a breeze with support for environment variables and seamless integration with OpenAI’s embedding API. Whether you’re deploying locally or scaling up in the cloud, our setup is designed to fit effortlessly into your existing workflow, allowing you to get started quickly without unnecessary hassle.

---

# mdb_toolkit

**mdb_toolkit** is a custom MongoDB client that integrates seamlessly with OpenAI's embedding models to provide advanced vector-based search capabilities. It enables semantic searches, keyword searches, and hybrid searches within your MongoDB collections.

## Features

- **Vector-Based Search**: Perform semantic searches using OpenAI embeddings.
- **Keyword Search**: Execute traditional text-based searches with regular expressions.
- **Hybrid Search**: Combine semantic relevance with keyword filtering for precise results.
- **Easy Integration**: Simple setup with MongoDB and OpenAI APIs.
- **Comprehensive Logging**: Detailed logs for monitoring and debugging.

## Installation

Install `mdb_toolkit` using `pip`:

```bash
pip install mdb-toolkit
```

*Requires Python 3.7 or higher.*

## Example Usage

Here's a sample script demonstrating how to use `mdb_toolkit` to create a search index, insert documents, and perform various search operations.

```python
import logging
import openai
from typing import List

# Load .env file
from dotenv import load_dotenv
load_dotenv()

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Get Embedding Function
def get_embedding(text: str, model: str = "text-embedding-ada-002", dimensions: int = 256) -> List[float]:
    text = text.replace("\n", " ")
    try:
        response = openai.Embedding.create(
            input=[text],
            model=model
        )
        return response['data'][0]['embedding']
    except Exception as e:
        logger.error(f"Error generating embedding: {str(e)}")
        raise

# Example usage
from mdb_toolkit import CustomMongoClient
print("mdb_toolkit package imported successfully")

# Define database and collection names
database_name = "test_database"
collection_name = "test_collection"
index_name = "vs_1"  # Ensure this matches your intended index name
distance_metric = "cosine"

client = CustomMongoClient(
    "mongodb://localhost:27017/?directConnection=true&serverSelectionTimeoutMS=2000",
    get_embedding=get_embedding
)

# Create the search index
client._create_search_index(
    database_name=database_name,
    collection_name=collection_name,
    index_name=index_name,
    distance_metric=distance_metric,
)

# Wait for the search index to be READY
logger.info("Waiting for the search index to be READY...")
index_ready = client.wait_for_index_ready(
    database_name=database_name,
    collection_name=collection_name,
    index_name=index_name,
    max_attempts=10,
    wait_seconds=1
)

if index_ready:
    logger.info(f"Search index '{index_name}' is now READY and available!")
    print("Index is ready!")
else:
    logger.error("Index creation process exceeded wait limit or failed.")
    print("Index creation process exceeded wait limit.")
    exit()

# Insert documents
documents = [
    {
        "name": "Document 1",
        "content": "OpenAI develops artificial intelligence technologies.",
        "meta_data": {"category": "AI", "tags": ["openai", "ai", "technology"]},
    },
    {
        "name": "Document 2",
        "content": "MongoDB is a popular NoSQL database.",
        "meta_data": {"category": "Database", "tags": ["mongodb", "nosql", "database"]},
    },
    {
        "name": "Document 3",
        "content": "Python is a versatile programming language.",
        "meta_data": {"category": "Programming", "tags": ["python", "programming", "language"]},
    },
    {
        "name": "Document 4",
        "content": "Artificial intelligence and machine learning are transforming industries.",
        "meta_data": {"category": "AI", "tags": ["ai", "machine learning", "transformation"]},
    },
    {
        "name": "Document 5",
        "content": "OpenAI's ChatGPT is a language model for generating human-like text.",
        "meta_data": {"category": "AI", "tags": ["openai", "chatgpt", "language model"]},
    },
]

fields_to_embed = ["content"]  # Specify which fields to generate embeddings for

client.insert_documents(
    database_name=database_name,
    collection_name=collection_name,
    documents=documents,
    fields_to_embed=fields_to_embed,
)

# Perform searches
# 1. Vector-Based Search
vector_query = "Tell me about artificial intelligence advancements."
logger.info(f"Performing vector-based search with query: '{vector_query}'")
vector_results = client.vector_search(
    query=vector_query,
    limit=3,
    database_name=database_name,
    collection_name=collection_name,
    index_name=index_name
)
print("\n--- Vector-Based Search Results ---")
for doc in vector_results:
    print(f"Name: {doc.get('name')}\nContent: {doc.get('content')}\nMeta Data: {doc.get('meta_data')}\nScore: {doc.get('score')}\n")

# 2. Keyword Search
keyword_query = "Python"
logger.info(f"Performing keyword search with query: '{keyword_query}'")
keyword_results = client.keyword_search(
    query=keyword_query,
    limit=3,
    database_name=database_name,
    collection_name=collection_name
)
print("\n--- Keyword Search Results ---")
for doc in keyword_results:
    print(f"Name: {doc.get('name')}\nContent: {doc.get('content')}\nMeta Data: {doc.get('meta_data')}\n")

# 3. Hybrid Search
hybrid_vector_query = "Advancements in machine learning."
hybrid_keyword = "transforming"
logger.info(f"Performing hybrid search with vector query: '{hybrid_vector_query}' and keyword: '{hybrid_keyword}'")
hybrid_results = client.hybrid_search(
    query=hybrid_vector_query,
    keyword=hybrid_keyword,
    limit=3,
    database_name=database_name,
    collection_name=collection_name,
    index_name=index_name
)
print("\n--- Hybrid Search Results ---")
for doc in hybrid_results:
    print(f"Name: {doc.get('name')}\nContent: {doc.get('content')}\nMeta Data: {doc.get('meta_data')}\nScore: {doc.get('score')}\n")
```

## License

This project is licensed under the [MIT License](LICENSE).

## Contributing

Contributions are welcome! Please open an issue or submit a pull request for any enhancements or bug fixes.

## Support

If you encounter any issues or have questions, please open an issue on the [GitHub repository](git@github.com:ranfysvalle02/mdb_toolkit.git).

---

*Happy Coding!*

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ranfysvalle02/mdb_toolkit",
    "name": "mdb-toolkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "mongodb, vector search, embeddings, pymongo, custom client",
    "author": "Fabian Valle",
    "author_email": "Fabian Valle <oblivio.company@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a7/25/8a2d4bd1be857cf197ca2a8ff4733da54475e5dc0cebabef0d6f726f649b/mdb_toolkit-0.9.0.tar.gz",
    "platform": null,
    "description": "# mdb_toolkit\n\n![](https://github.com/ranfysvalle02/mdb_toolkit/raw/main/demo.png)\n\n# Less Code, More Power  \n\nMongoDB's flexibility and PyMongo's robust driver make it a popular choice for database management in Python applications. While PyMongo's `MongoClient` class provides rich functionality, there are scenarios where adding custom methods can simplify repetitive tasks or enhance the developer experience. \n\n---  \n      \n### **Why Customize MongoClient?**\n- **Streamlined Operations**: Simplify frequent tasks like listing databases and collections.\n- **Encapsulation**: Abstract additional functionality into a single, reusable class.\n- **Extensibility**: Add new methods to tailor MongoDB operations to your project\u2019s needs.\n\n---\n\n### **Setting Up the Environment**\nBefore diving into code, we\u2019ll need a MongoDB instance to work with. A simple command to start a local MongoDB container:\n\n```bash\ndocker run -d -p 27017:27017 --restart unless-stopped mongodb/mongodb-atlas-local\n```\n\n**OR** \n\nif you already have a MongoDB Atlas cluster, keep the MongoDB URI handy as you will need it :)\n\n---\n\n---\n\nIntegrating advanced search capabilities into your applications can often be complex and time-consuming. However, our latest MongoDB integration changes the game by **streamlining the process, reducing the amount of code you need to write, and making embedding effortless**. \n\n#### **1. Effortless Embedding Integration**\nEmbedding AI functionalities into your MongoDB database has never been simpler. Our custom `MongoClient` handles the generation and storage of embeddings seamlessly. This means you can focus on building features rather than managing the intricacies of embedding processes.\n\n#### **2. Clean and Maintainable Codebase**\nSay goodbye to cluttered and hard-to-maintain code! Our implementation consolidates essential operations\u2014like creating search indexes, inserting documents with embeddings, and performing various types of searches\u2014into a single, well-organized class. This not only reduces the number of lines you need to write but also enhances the readability and maintainability of your code.\n\n#### **3. Versatile Search Capabilities**\nWhether you need vector-based searches, keyword searches, or a combination of both, our integration has you covered. The `vector_search`, `keyword_search`, and `hybrid_search` methods provide flexible options to retrieve the most relevant documents efficiently. This versatility ensures that you can meet a wide range of search requirements with ease.\n\n#### **4. Robust and Reliable Performance**\nBuilt on MongoDB\u2019s solid infrastructure, our client ensures reliable performance from index creation to search execution. With comprehensive logging and error handling, you can trust that your searches will run smoothly and any issues will be promptly identified and addressed.\n\n#### **5. Quick and Easy Deployment**\nConfiguration is a breeze with support for environment variables and seamless integration with OpenAI\u2019s embedding API. Whether you\u2019re deploying locally or scaling up in the cloud, our setup is designed to fit effortlessly into your existing workflow, allowing you to get started quickly without unnecessary hassle.\n\n---\n\n# mdb_toolkit\n\n**mdb_toolkit** is a custom MongoDB client that integrates seamlessly with OpenAI's embedding models to provide advanced vector-based search capabilities. It enables semantic searches, keyword searches, and hybrid searches within your MongoDB collections.\n\n## Features\n\n- **Vector-Based Search**: Perform semantic searches using OpenAI embeddings.\n- **Keyword Search**: Execute traditional text-based searches with regular expressions.\n- **Hybrid Search**: Combine semantic relevance with keyword filtering for precise results.\n- **Easy Integration**: Simple setup with MongoDB and OpenAI APIs.\n- **Comprehensive Logging**: Detailed logs for monitoring and debugging.\n\n## Installation\n\nInstall `mdb_toolkit` using `pip`:\n\n```bash\npip install mdb-toolkit\n```\n\n*Requires Python 3.7 or higher.*\n\n## Example Usage\n\nHere's a sample script demonstrating how to use `mdb_toolkit` to create a search index, insert documents, and perform various search operations.\n\n```python\nimport logging\nimport openai\nfrom typing import List\n\n# Load .env file\nfrom dotenv import load_dotenv\nload_dotenv()\n\n# Set up logging\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\n# Get Embedding Function\ndef get_embedding(text: str, model: str = \"text-embedding-ada-002\", dimensions: int = 256) -> List[float]:\n    text = text.replace(\"\\n\", \" \")\n    try:\n        response = openai.Embedding.create(\n            input=[text],\n            model=model\n        )\n        return response['data'][0]['embedding']\n    except Exception as e:\n        logger.error(f\"Error generating embedding: {str(e)}\")\n        raise\n\n# Example usage\nfrom mdb_toolkit import CustomMongoClient\nprint(\"mdb_toolkit package imported successfully\")\n\n# Define database and collection names\ndatabase_name = \"test_database\"\ncollection_name = \"test_collection\"\nindex_name = \"vs_1\"  # Ensure this matches your intended index name\ndistance_metric = \"cosine\"\n\nclient = CustomMongoClient(\n    \"mongodb://localhost:27017/?directConnection=true&serverSelectionTimeoutMS=2000\",\n    get_embedding=get_embedding\n)\n\n# Create the search index\nclient._create_search_index(\n    database_name=database_name,\n    collection_name=collection_name,\n    index_name=index_name,\n    distance_metric=distance_metric,\n)\n\n# Wait for the search index to be READY\nlogger.info(\"Waiting for the search index to be READY...\")\nindex_ready = client.wait_for_index_ready(\n    database_name=database_name,\n    collection_name=collection_name,\n    index_name=index_name,\n    max_attempts=10,\n    wait_seconds=1\n)\n\nif index_ready:\n    logger.info(f\"Search index '{index_name}' is now READY and available!\")\n    print(\"Index is ready!\")\nelse:\n    logger.error(\"Index creation process exceeded wait limit or failed.\")\n    print(\"Index creation process exceeded wait limit.\")\n    exit()\n\n# Insert documents\ndocuments = [\n    {\n        \"name\": \"Document 1\",\n        \"content\": \"OpenAI develops artificial intelligence technologies.\",\n        \"meta_data\": {\"category\": \"AI\", \"tags\": [\"openai\", \"ai\", \"technology\"]},\n    },\n    {\n        \"name\": \"Document 2\",\n        \"content\": \"MongoDB is a popular NoSQL database.\",\n        \"meta_data\": {\"category\": \"Database\", \"tags\": [\"mongodb\", \"nosql\", \"database\"]},\n    },\n    {\n        \"name\": \"Document 3\",\n        \"content\": \"Python is a versatile programming language.\",\n        \"meta_data\": {\"category\": \"Programming\", \"tags\": [\"python\", \"programming\", \"language\"]},\n    },\n    {\n        \"name\": \"Document 4\",\n        \"content\": \"Artificial intelligence and machine learning are transforming industries.\",\n        \"meta_data\": {\"category\": \"AI\", \"tags\": [\"ai\", \"machine learning\", \"transformation\"]},\n    },\n    {\n        \"name\": \"Document 5\",\n        \"content\": \"OpenAI's ChatGPT is a language model for generating human-like text.\",\n        \"meta_data\": {\"category\": \"AI\", \"tags\": [\"openai\", \"chatgpt\", \"language model\"]},\n    },\n]\n\nfields_to_embed = [\"content\"]  # Specify which fields to generate embeddings for\n\nclient.insert_documents(\n    database_name=database_name,\n    collection_name=collection_name,\n    documents=documents,\n    fields_to_embed=fields_to_embed,\n)\n\n# Perform searches\n# 1. Vector-Based Search\nvector_query = \"Tell me about artificial intelligence advancements.\"\nlogger.info(f\"Performing vector-based search with query: '{vector_query}'\")\nvector_results = client.vector_search(\n    query=vector_query,\n    limit=3,\n    database_name=database_name,\n    collection_name=collection_name,\n    index_name=index_name\n)\nprint(\"\\n--- Vector-Based Search Results ---\")\nfor doc in vector_results:\n    print(f\"Name: {doc.get('name')}\\nContent: {doc.get('content')}\\nMeta Data: {doc.get('meta_data')}\\nScore: {doc.get('score')}\\n\")\n\n# 2. Keyword Search\nkeyword_query = \"Python\"\nlogger.info(f\"Performing keyword search with query: '{keyword_query}'\")\nkeyword_results = client.keyword_search(\n    query=keyword_query,\n    limit=3,\n    database_name=database_name,\n    collection_name=collection_name\n)\nprint(\"\\n--- Keyword Search Results ---\")\nfor doc in keyword_results:\n    print(f\"Name: {doc.get('name')}\\nContent: {doc.get('content')}\\nMeta Data: {doc.get('meta_data')}\\n\")\n\n# 3. Hybrid Search\nhybrid_vector_query = \"Advancements in machine learning.\"\nhybrid_keyword = \"transforming\"\nlogger.info(f\"Performing hybrid search with vector query: '{hybrid_vector_query}' and keyword: '{hybrid_keyword}'\")\nhybrid_results = client.hybrid_search(\n    query=hybrid_vector_query,\n    keyword=hybrid_keyword,\n    limit=3,\n    database_name=database_name,\n    collection_name=collection_name,\n    index_name=index_name\n)\nprint(\"\\n--- Hybrid Search Results ---\")\nfor doc in hybrid_results:\n    print(f\"Name: {doc.get('name')}\\nContent: {doc.get('content')}\\nMeta Data: {doc.get('meta_data')}\\nScore: {doc.get('score')}\\n\")\n```\n\n## License\n\nThis project is licensed under the [MIT License](LICENSE).\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request for any enhancements or bug fixes.\n\n## Support\n\nIf you encounter any issues or have questions, please open an issue on the [GitHub repository](git@github.com:ranfysvalle02/mdb_toolkit.git).\n\n---\n\n*Happy Coding!*\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Custom MongoDB client with vector search capabilities, embeddings management, and more.",
    "version": "0.9.0",
    "project_urls": {
        "Homepage": "https://github.com/ranfysvalle02/mdb_toolkit",
        "issue_tracker": "https://github.com/ranfysvalle02/mdb_toolkit/issues"
    },
    "split_keywords": [
        "mongodb",
        " vector search",
        " embeddings",
        " pymongo",
        " custom client"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b314265838463278e320521c5cc93cfed867cac8dcd5ff78444b7a7654f55f35",
                "md5": "974b864a82bc6d29d6b9ed219b4e78a4",
                "sha256": "f425541f77fb88a6d72574e96b6edc86375be4fa0c31ca812a950b30fd94809c"
            },
            "downloads": -1,
            "filename": "mdb_toolkit-0.9.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "974b864a82bc6d29d6b9ed219b4e78a4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 25854,
            "upload_time": "2025-01-26T17:50:51",
            "upload_time_iso_8601": "2025-01-26T17:50:51.532268Z",
            "url": "https://files.pythonhosted.org/packages/b3/14/265838463278e320521c5cc93cfed867cac8dcd5ff78444b7a7654f55f35/mdb_toolkit-0.9.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a7258a2d4bd1be857cf197ca2a8ff4733da54475e5dc0cebabef0d6f726f649b",
                "md5": "560c665bc9242adb615e16efa13e6bed",
                "sha256": "0f9ec4a4a3859c62b15ab8fc9866392878be8f39cd2eab1e809d91b3c19db64f"
            },
            "downloads": -1,
            "filename": "mdb_toolkit-0.9.0.tar.gz",
            "has_sig": false,
            "md5_digest": "560c665bc9242adb615e16efa13e6bed",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 15232,
            "upload_time": "2025-01-26T17:50:53",
            "upload_time_iso_8601": "2025-01-26T17:50:53.937996Z",
            "url": "https://files.pythonhosted.org/packages/a7/25/8a2d4bd1be857cf197ca2a8ff4733da54475e5dc0cebabef0d6f726f649b/mdb_toolkit-0.9.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-26 17:50:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ranfysvalle02",
    "github_project": "mdb_toolkit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mdb-toolkit"
}
        
Elapsed time: 0.46050s