DenSpa


NameDenSpa JSON
Version 0.0.3 PyPI version JSON
download
home_pageNone
SummaryDenSpa is an open-source package designed for hybrid search, enabling seamless integration into RAG frameworks.
upload_time2025-03-26 16:41:47
maintainerNone
docs_urlNone
authorIvan Kankeu
requires_python>=3.10
licenseNone
keywords hybrid search semantic search keyword search vector search faiss bm25 langchain
VCS
bugtrack_url
requirements langchain faiss-cpu nltk scikit-learn numpy scipy spacy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DenSpa

**DenSpa** is an open-source package designed for hybrid search, enabling seamless integration into Retrieval-Augmented Generation (RAG) frameworks. The package combines **dense** and **sparse vector embeddings** to perform efficient searches on document corpora.  

- **Dense-vector-based search** leverages the FAISS vector database to manage and query collections.  
- **Sparse-vector-based search** utilizes a custom implementation of BM25, enhanced with pre-processing techniques like stemming to optimize the index.  

## Installation

To get started, clone this repository and install the dependencies or use pip:  
```bash
pip install DenSpa
```

## Quick Start

### Initializing the Vector Search Engine
You can easily initialize the vector search engine using the following code:  

```python
from denspa import VectorSearch
from langchain.embeddings import HuggingFaceEmbeddings
import os

embedding_function = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

INDEX_PATH = "database/index"
if not os.path.exists(INDEX_PATH):
    os.makedirs(INDEX_PATH)

vecsea = VectorSearch(
    folder_path=INDEX_PATH,
    index_name="denspa",
    embedding_function=embedding_function,
    bm25_options={"k1": 1.25, "b": 0}
)
```

### Indexing Documents
DenSpa supports indexing documents in both **English** and **German**. Documents can be added to the search engine like this:

```python
from langchain.docstore.document import Document

documents = [Document(page_content="There are many variations...", metadata={"source": "lecture.pdf"})]

# Indexing with FAISS
vecsea.add_documents(documents, lang="en", engine="faiss")

# Indexing with BM25
vecsea.add_documents(documents, lang="en", engine="bm25")

# Save the index locally
vecsea.save_local()
```

### Deleting Indexed Documents
To remove a specific document from the index, use the `removeByMetadata` function:

```python
vecsea.removeByMetadata({"source": "lecture.pdf"})
vecsea.save_local()
```

### Deleting Indexes
To remove the indexes, use the `delete_local` function:

```python
vecsea.delete_local()
```

## Search Methods

DenSpa currently supports **three search methods**:  

1. **FAISS**: Semantic search that uses dense vectors for similarity.  
2. **BM25**: Keyword-based search leveraging sparse vectors.  
3. **Hybrid Search**: A cascade method combining FAISS and BM25. Hybrid search first retrieves the top results using FAISS (high recall) and then applies BM25 on the top-3*k results to refine the selection for the final top-k results (higher precision).  

Example usage:  
```python
results = vecsea.similarity_search_with_score(
    query="Quantum mechanics",
    k=3,
    method="bm25",
    lang="en"
)
```

## Features
- **Dense and Sparse Search**: Utilize semantic embeddings and keyword-based indexing for versatile search capabilities.  
- **Hybrid Search Strategy**: Combine the strengths of both FAISS and BM25 for balanced recall and precision.  
- **Customizable**: Easily configure embeddings, BM25 parameters, and storage paths.  
- **Language Support**: Works with English and German document corpora.  

## Contributions
Contributions are welcome! Please feel free to open an issue or submit a pull request if you have suggestions or improvements.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "DenSpa",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "hybrid search, semantic search, keyword search, vector search, FAISS BM25, LangChain",
    "author": "Ivan Kankeu",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/52/b1/6c67864c8bc5b56a15718c320727e90ffcf7bc4d05c9d3cb28c99979666e/denspa-0.0.3.tar.gz",
    "platform": null,
    "description": "# DenSpa\n\n**DenSpa** is an open-source package designed for hybrid search, enabling seamless integration into Retrieval-Augmented Generation (RAG) frameworks. The package combines **dense** and **sparse vector embeddings** to perform efficient searches on document corpora.  \n\n- **Dense-vector-based search** leverages the FAISS vector database to manage and query collections.  \n- **Sparse-vector-based search** utilizes a custom implementation of BM25, enhanced with pre-processing techniques like stemming to optimize the index.  \n\n## Installation\n\nTo get started, clone this repository and install the dependencies or use pip:  \n```bash\npip install DenSpa\n```\n\n## Quick Start\n\n### Initializing the Vector Search Engine\nYou can easily initialize the vector search engine using the following code:  \n\n```python\nfrom denspa import VectorSearch\nfrom langchain.embeddings import HuggingFaceEmbeddings\nimport os\n\nembedding_function = HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-mpnet-base-v2\")\n\nINDEX_PATH = \"database/index\"\nif not os.path.exists(INDEX_PATH):\n    os.makedirs(INDEX_PATH)\n\nvecsea = VectorSearch(\n    folder_path=INDEX_PATH,\n    index_name=\"denspa\",\n    embedding_function=embedding_function,\n    bm25_options={\"k1\": 1.25, \"b\": 0}\n)\n```\n\n### Indexing Documents\nDenSpa supports indexing documents in both **English** and **German**. Documents can be added to the search engine like this:\n\n```python\nfrom langchain.docstore.document import Document\n\ndocuments = [Document(page_content=\"There are many variations...\", metadata={\"source\": \"lecture.pdf\"})]\n\n# Indexing with FAISS\nvecsea.add_documents(documents, lang=\"en\", engine=\"faiss\")\n\n# Indexing with BM25\nvecsea.add_documents(documents, lang=\"en\", engine=\"bm25\")\n\n# Save the index locally\nvecsea.save_local()\n```\n\n### Deleting Indexed Documents\nTo remove a specific document from the index, use the `removeByMetadata` function:\n\n```python\nvecsea.removeByMetadata({\"source\": \"lecture.pdf\"})\nvecsea.save_local()\n```\n\n### Deleting Indexes\nTo remove the indexes, use the `delete_local` function:\n\n```python\nvecsea.delete_local()\n```\n\n## Search Methods\n\nDenSpa currently supports **three search methods**:  \n\n1. **FAISS**: Semantic search that uses dense vectors for similarity.  \n2. **BM25**: Keyword-based search leveraging sparse vectors.  \n3. **Hybrid Search**: A cascade method combining FAISS and BM25. Hybrid search first retrieves the top results using FAISS (high recall) and then applies BM25 on the top-3*k results to refine the selection for the final top-k results (higher precision).  \n\nExample usage:  \n```python\nresults = vecsea.similarity_search_with_score(\n    query=\"Quantum mechanics\",\n    k=3,\n    method=\"bm25\",\n    lang=\"en\"\n)\n```\n\n## Features\n- **Dense and Sparse Search**: Utilize semantic embeddings and keyword-based indexing for versatile search capabilities.  \n- **Hybrid Search Strategy**: Combine the strengths of both FAISS and BM25 for balanced recall and precision.  \n- **Customizable**: Easily configure embeddings, BM25 parameters, and storage paths.  \n- **Language Support**: Works with English and German document corpora.  \n\n## Contributions\nContributions are welcome! Please feel free to open an issue or submit a pull request if you have suggestions or improvements.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "DenSpa is an open-source package designed for hybrid search, enabling seamless integration into RAG frameworks.",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/Kankeu/denspa",
        "Issues": "https://github.com/Kankeu/denspa/issues"
    },
    "split_keywords": [
        "hybrid search",
        " semantic search",
        " keyword search",
        " vector search",
        " faiss bm25",
        " langchain"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d9b0dad7ab67df467855bce74180d7efb23c590c1bc31d35e968e076902b78d8",
                "md5": "27ab3a59bd4be03544d85ca0322e96d3",
                "sha256": "fd34f4922bfbbf5f7579bb81b0187bdde72f4c9640e79cf4a0e15e1371a7f35a"
            },
            "downloads": -1,
            "filename": "denspa-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "27ab3a59bd4be03544d85ca0322e96d3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 10903,
            "upload_time": "2025-03-26T16:41:46",
            "upload_time_iso_8601": "2025-03-26T16:41:46.219264Z",
            "url": "https://files.pythonhosted.org/packages/d9/b0/dad7ab67df467855bce74180d7efb23c590c1bc31d35e968e076902b78d8/denspa-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "52b16c67864c8bc5b56a15718c320727e90ffcf7bc4d05c9d3cb28c99979666e",
                "md5": "714a3f4676c9768c0d4ac77712284df9",
                "sha256": "69945a34437e46164c87d3a96fc5b6eb022131ab8013a4a0b2ea4c0d54b73f24"
            },
            "downloads": -1,
            "filename": "denspa-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "714a3f4676c9768c0d4ac77712284df9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 8438,
            "upload_time": "2025-03-26T16:41:47",
            "upload_time_iso_8601": "2025-03-26T16:41:47.649581Z",
            "url": "https://files.pythonhosted.org/packages/52/b1/6c67864c8bc5b56a15718c320727e90ffcf7bc4d05c9d3cb28c99979666e/denspa-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-26 16:41:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Kankeu",
    "github_project": "denspa",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "langchain",
            "specs": [
                [
                    ">=",
                    "0.3.16"
                ]
            ]
        },
        {
            "name": "faiss-cpu",
            "specs": [
                [
                    ">=",
                    "1.10.0"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.9.1"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.6.1"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "<=",
                    "1.26.4"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.15.1"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.5.2"
                ]
            ]
        }
    ],
    "lcname": "denspa"
}
        
Elapsed time: 1.74580s