voyage-embedders-haystack


Namevoyage-embedders-haystack JSON
Version 1.2.0 PyPI version JSON
download
home_pageNone
SummaryHaystack 2.x component to embed strings and Documents using VoyageAI Embedding models.
upload_time2024-02-02 11:14:18
maintainerAshwin Mathur
docs_urlNone
authorAshwin Mathur
requires_python>=3.8
licenseNone
keywords haystack voyageai
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI](https://img.shields.io/pypi/v/voyage-embedders-haystack)](https://pypi.org/project/voyage-embedders-haystack/) 
![PyPI - Downloads](https://img.shields.io/pypi/dm/voyage-embedders-haystack?color=blue&logo=pypi&logoColor=gold) 
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/voyage-embedders-haystack?logo=python&logoColor=gold) 
[![GitHub](https://img.shields.io/github/license/awinml/voyage-embedders-haystack?color=green)](LICENSE) 
[![Actions status](https://github.com/awinml/voyage-embedders-haystack/workflows/Test/badge.svg)](https://github.com/awinml/voyage-embedders-haystack/actions)
[![Coverage Status](https://coveralls.io/repos/github/awinml/voyage-embedders-haystack/badge.svg?branch=main)](https://coveralls.io/github/awinml/voyage-embedders-haystack?branch=main)

[![Types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy) 
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Code Style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) 



<h1 align="center"> <a href="https://github.com/awinml/voyage-embedders-haystack"> Voyage Embedders - Haystack </a> </h1>

Custom component for [Haystack](https://github.com/deepset-ai/haystack) (2.x) for creating embeddings using the [VoyageAI Embedding Models](https://voyageai.com/).

Voyage’s embedding models, `voyage-2` and `voyage-2-code`, are state-of-the-art in retrieval accuracy. These models outperform top performing embedding models like `intfloat/e5-mistral-7b-instruct` and `OpenAI/text-embedding-3-large` on the [MTEB Benchmark](https://github.com/embeddings-benchmark/mteb). `voyage-2` is current ranked second on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).


#### What's New

- **[v1.2.0 - 02/02/24]:**
    - **Breaking Change:** `VoyageDocumentEmbedder` and `VoyageTextEmbedder` now accept the `model` parameter instead of `model_name`.
    - The embedders have been use the new `voyageai.Client.embed()` method instead of the deprecated `get_embedding` and `get_embeddings` methods of the global namespace.
    - Support for the new `truncate` parameter has been added.
    - Default embedding model has been changed to "voyage-2" from the deprecated "voyage-01".
    - The embedders now return the total number of tokens used as part of the `"total_tokens"` in the metadata.

- **[v1.1.0 - 13/12/23]:** Added support for `input_type` parameter in `VoyageTextEmbedder` and `VoyageDocument Embedder`.

- **[v1.0.0 - 21/11/23]:** Added `VoyageTextEmbedder` and `VoyageDocument Embedder` to embed strings and documents.


## Installation

```bash
pip install voyage-embedders-haystack
```

## Usage

You can use Voyage Embedding models with two components: [VoyageTextEmbedder](https://github.com/awinml/voyage-embedders-haystack/blob/main/src/voyage_embedders/voyage_text_embedder.py) and [VoyageDocumentEmbedder](https://github.com/awinml/voyage-embedders-haystack/blob/main/src/voyage_embedders/voyage_document_embedder.py).

To create semantic embeddings for documents, use `VoyageDocumentEmbedder` in your indexing pipeline. For generating embeddings for queries, use `VoyageTextEmbedder`. Once you've selected the suitable component for your specific use case, initialize the component with the model name and VoyageAI API key. You can also
set the environment variable "VOYAGE_API_KEY" instead of passing the api key as an argument.

Information about the supported models, can be found on the [Embeddings Documentation.](https://docs.voyageai.com/embeddings/)

To get an API key, please see the [Voyage AI website.](https://www.voyageai.com/)


## Example

Below is the example Semantic Search pipeline that uses the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace. You can find more examples in the [`examples`](https://github.com/awinml/voyage-embedders-haystack/tree/main/examples) folder.  

Load the dataset:

```python
# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Import Voyage Embedders
from voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder
from voyage_embedders.voyage_text_embedder import VoyageTextEmbedder

# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")

docs = [
    Document(
        content=doc["text"],
        meta={
            "title": doc["title"],
            "url": doc["url"],
        },
    )
    for doc in dataset
]
```

Index the documents to the `InMemoryDocumentStore` using the `VoyageDocumentEmbedder` and `DocumentWriter`:

```python
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
doc_embedder = VoyageDocumentEmbedder(
    model="voyage-2",
    input_type="document",
)

# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name="DocWriter")
indexing_pipeline.connect(connect_from="DocEmbedder", connect_to="DocWriter")

indexing_pipeline.run({"DocEmbedder": {"documents": docs}})

print(f"Number of documents in Document Store: {len(doc_store.filter_documents())}")
print(f"First Document: {doc_store.filter_documents()[0]}")
print(f"Embedding of first Document: {doc_store.filter_documents()[0].embedding}")
```

Query the Semantic Search Pipeline using the `InMemoryEmbeddingRetriever` and `VoyageTextEmbedder`:
```python
text_embedder = VoyageTextEmbedder(model="voyage-2", input_type="query")

# Query Pipeline
query_pipeline = Pipeline()
query_pipeline.add_component("TextEmbedder", text_embedder)
query_pipeline.add_component("Retriever", InMemoryEmbeddingRetriever(document_store=doc_store))
query_pipeline.connect("TextEmbedder", "Retriever")


# Search
results = query_pipeline.run({"TextEmbedder": {"text": "Which year did the Joker movie release?"}})

# Print text from top result
top_result = results["Retriever"]["documents"][0].content
print("The top search result is:")
print(top_result)
```

## Contributing

Pull requests are welcome. For major changes, please open an issue first
to discuss what you would like to change.

## Author

[Ashwin Mathur](https://github.com/awinml)

## License

`voyage-embedders-haystack` is distributed under the terms of the [Apache-2.0 license](https://github.com/awinml/voyage-embedders-haystack/blob/main/LICENSE).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "voyage-embedders-haystack",
    "maintainer": "Ashwin Mathur",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "Haystack,VoyageAI",
    "author": "Ashwin Mathur",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/15/a0/5ee581db3145113953d33a9341c0c787b6102843a35c9bdc33d4ceb933d7/voyage_embedders_haystack-1.2.0.tar.gz",
    "platform": null,
    "description": "[![PyPI](https://img.shields.io/pypi/v/voyage-embedders-haystack)](https://pypi.org/project/voyage-embedders-haystack/) \n![PyPI - Downloads](https://img.shields.io/pypi/dm/voyage-embedders-haystack?color=blue&logo=pypi&logoColor=gold) \n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/voyage-embedders-haystack?logo=python&logoColor=gold) \n[![GitHub](https://img.shields.io/github/license/awinml/voyage-embedders-haystack?color=green)](LICENSE) \n[![Actions status](https://github.com/awinml/voyage-embedders-haystack/workflows/Test/badge.svg)](https://github.com/awinml/voyage-embedders-haystack/actions)\n[![Coverage Status](https://coveralls.io/repos/github/awinml/voyage-embedders-haystack/badge.svg?branch=main)](https://coveralls.io/github/awinml/voyage-embedders-haystack?branch=main)\n\n[![Types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy) \n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![Code Style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) \n\n\n\n<h1 align=\"center\"> <a href=\"https://github.com/awinml/voyage-embedders-haystack\"> Voyage Embedders - Haystack </a> </h1>\n\nCustom component for [Haystack](https://github.com/deepset-ai/haystack) (2.x) for creating embeddings using the [VoyageAI Embedding Models](https://voyageai.com/).\n\nVoyage\u2019s embedding models, `voyage-2` and `voyage-2-code`, are state-of-the-art in retrieval accuracy. These models outperform top performing embedding models like `intfloat/e5-mistral-7b-instruct` and `OpenAI/text-embedding-3-large` on the [MTEB Benchmark](https://github.com/embeddings-benchmark/mteb). `voyage-2` is current ranked second on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).\n\n\n#### What's New\n\n- **[v1.2.0 - 02/02/24]:**\n    - **Breaking Change:** `VoyageDocumentEmbedder` and `VoyageTextEmbedder` now accept the `model` parameter instead of `model_name`.\n    - The embedders have been use the new `voyageai.Client.embed()` method instead of the deprecated `get_embedding` and `get_embeddings` methods of the global namespace.\n    - Support for the new `truncate` parameter has been added.\n    - Default embedding model has been changed to \"voyage-2\" from the deprecated \"voyage-01\".\n    - The embedders now return the total number of tokens used as part of the `\"total_tokens\"` in the metadata.\n\n- **[v1.1.0 - 13/12/23]:** Added support for `input_type` parameter in `VoyageTextEmbedder` and `VoyageDocument Embedder`.\n\n- **[v1.0.0 - 21/11/23]:** Added `VoyageTextEmbedder` and `VoyageDocument Embedder` to embed strings and documents.\n\n\n## Installation\n\n```bash\npip install voyage-embedders-haystack\n```\n\n## Usage\n\nYou can use Voyage Embedding models with two components: [VoyageTextEmbedder](https://github.com/awinml/voyage-embedders-haystack/blob/main/src/voyage_embedders/voyage_text_embedder.py) and [VoyageDocumentEmbedder](https://github.com/awinml/voyage-embedders-haystack/blob/main/src/voyage_embedders/voyage_document_embedder.py).\n\nTo create semantic embeddings for documents, use `VoyageDocumentEmbedder` in your indexing pipeline. For generating embeddings for queries, use `VoyageTextEmbedder`. Once you've selected the suitable component for your specific use case, initialize the component with the model name and VoyageAI API key. You can also\nset the environment variable \"VOYAGE_API_KEY\" instead of passing the api key as an argument.\n\nInformation about the supported models, can be found on the [Embeddings Documentation.](https://docs.voyageai.com/embeddings/)\n\nTo get an API key, please see the [Voyage AI website.](https://www.voyageai.com/)\n\n\n## Example\n\nBelow is the example Semantic Search pipeline that uses the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace. You can find more examples in the [`examples`](https://github.com/awinml/voyage-embedders-haystack/tree/main/examples) folder.  \n\nLoad the dataset:\n\n```python\n# Install HuggingFace Datasets using \"pip install datasets\"\nfrom datasets import load_dataset\nfrom haystack import Pipeline\nfrom haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever\nfrom haystack.components.writers import DocumentWriter\nfrom haystack.dataclasses import Document\nfrom haystack.document_stores.in_memory import InMemoryDocumentStore\n\n# Import Voyage Embedders\nfrom voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder\nfrom voyage_embedders.voyage_text_embedder import VoyageTextEmbedder\n\n# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace\ndataset = load_dataset(\"pszemraj/simple_wikipedia\", split=\"validation[:100]\")\n\ndocs = [\n    Document(\n        content=doc[\"text\"],\n        meta={\n            \"title\": doc[\"title\"],\n            \"url\": doc[\"url\"],\n        },\n    )\n    for doc in dataset\n]\n```\n\nIndex the documents to the `InMemoryDocumentStore` using the `VoyageDocumentEmbedder` and `DocumentWriter`:\n\n```python\ndoc_store = InMemoryDocumentStore(embedding_similarity_function=\"cosine\")\ndoc_embedder = VoyageDocumentEmbedder(\n    model=\"voyage-2\",\n    input_type=\"document\",\n)\n\n# Indexing Pipeline\nindexing_pipeline = Pipeline()\nindexing_pipeline.add_component(instance=doc_embedder, name=\"DocEmbedder\")\nindexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name=\"DocWriter\")\nindexing_pipeline.connect(connect_from=\"DocEmbedder\", connect_to=\"DocWriter\")\n\nindexing_pipeline.run({\"DocEmbedder\": {\"documents\": docs}})\n\nprint(f\"Number of documents in Document Store: {len(doc_store.filter_documents())}\")\nprint(f\"First Document: {doc_store.filter_documents()[0]}\")\nprint(f\"Embedding of first Document: {doc_store.filter_documents()[0].embedding}\")\n```\n\nQuery the Semantic Search Pipeline using the `InMemoryEmbeddingRetriever` and `VoyageTextEmbedder`:\n```python\ntext_embedder = VoyageTextEmbedder(model=\"voyage-2\", input_type=\"query\")\n\n# Query Pipeline\nquery_pipeline = Pipeline()\nquery_pipeline.add_component(\"TextEmbedder\", text_embedder)\nquery_pipeline.add_component(\"Retriever\", InMemoryEmbeddingRetriever(document_store=doc_store))\nquery_pipeline.connect(\"TextEmbedder\", \"Retriever\")\n\n\n# Search\nresults = query_pipeline.run({\"TextEmbedder\": {\"text\": \"Which year did the Joker movie release?\"}})\n\n# Print text from top result\ntop_result = results[\"Retriever\"][\"documents\"][0].content\nprint(\"The top search result is:\")\nprint(top_result)\n```\n\n## Contributing\n\nPull requests are welcome. For major changes, please open an issue first\nto discuss what you would like to change.\n\n## Author\n\n[Ashwin Mathur](https://github.com/awinml)\n\n## License\n\n`voyage-embedders-haystack` is distributed under the terms of the [Apache-2.0 license](https://github.com/awinml/voyage-embedders-haystack/blob/main/LICENSE).\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Haystack 2.x component to embed strings and Documents using VoyageAI Embedding models.",
    "version": "1.2.0",
    "project_urls": {
        "Documentation": "https://github.com/awinml/voyage-embedders-haystack#readme",
        "Issues": "https://github.com/awinml/voyage-embedders-haystack/issues",
        "Source": "https://github.com/awinml/voyage-embedders-haystack"
    },
    "split_keywords": [
        "haystack",
        "voyageai"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6710e5235226d5cf8abc0a57f8cfd9652abc9ef4c36cf9c17a5bf59ab4642bfa",
                "md5": "88b39d60aa9de3e17708aaf6c7939ec1",
                "sha256": "d90ee71b1c940e50746b9bf5de7fd0c1f8d2418eb28f8088aae98cb5be3f7735"
            },
            "downloads": -1,
            "filename": "voyage_embedders_haystack-1.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "88b39d60aa9de3e17708aaf6c7939ec1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 12877,
            "upload_time": "2024-02-02T11:14:19",
            "upload_time_iso_8601": "2024-02-02T11:14:19.716214Z",
            "url": "https://files.pythonhosted.org/packages/67/10/e5235226d5cf8abc0a57f8cfd9652abc9ef4c36cf9c17a5bf59ab4642bfa/voyage_embedders_haystack-1.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "15a05ee581db3145113953d33a9341c0c787b6102843a35c9bdc33d4ceb933d7",
                "md5": "21fb196811356368d407693a671c8486",
                "sha256": "88ab646ff754d8a81233ba6b92e1d48f4ddf1584963f7912067d7abae72eb150"
            },
            "downloads": -1,
            "filename": "voyage_embedders_haystack-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "21fb196811356368d407693a671c8486",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 17741,
            "upload_time": "2024-02-02T11:14:18",
            "upload_time_iso_8601": "2024-02-02T11:14:18.208675Z",
            "url": "https://files.pythonhosted.org/packages/15/a0/5ee581db3145113953d33a9341c0c787b6102843a35c9bdc33d4ceb933d7/voyage_embedders_haystack-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-02 11:14:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "awinml",
    "github_project": "voyage-embedders-haystack#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "voyage-embedders-haystack"
}
        
Elapsed time: 1.91492s