Name | voyage-embedders-haystack JSON |
Version |
1.2.0
JSON |
| download |
home_page | None |
Summary | Haystack 2.x component to embed strings and Documents using VoyageAI Embedding models. |
upload_time | 2024-02-02 11:14:18 |
maintainer | Ashwin Mathur |
docs_url | None |
author | Ashwin Mathur |
requires_python | >=3.8 |
license | None |
keywords |
haystack
voyageai
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
[![PyPI](https://img.shields.io/pypi/v/voyage-embedders-haystack)](https://pypi.org/project/voyage-embedders-haystack/)
![PyPI - Downloads](https://img.shields.io/pypi/dm/voyage-embedders-haystack?color=blue&logo=pypi&logoColor=gold)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/voyage-embedders-haystack?logo=python&logoColor=gold)
[![GitHub](https://img.shields.io/github/license/awinml/voyage-embedders-haystack?color=green)](LICENSE)
[![Actions status](https://github.com/awinml/voyage-embedders-haystack/workflows/Test/badge.svg)](https://github.com/awinml/voyage-embedders-haystack/actions)
[![Coverage Status](https://coveralls.io/repos/github/awinml/voyage-embedders-haystack/badge.svg?branch=main)](https://coveralls.io/github/awinml/voyage-embedders-haystack?branch=main)
[![Types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Code Style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
<h1 align="center"> <a href="https://github.com/awinml/voyage-embedders-haystack"> Voyage Embedders - Haystack </a> </h1>
Custom component for [Haystack](https://github.com/deepset-ai/haystack) (2.x) for creating embeddings using the [VoyageAI Embedding Models](https://voyageai.com/).
Voyage’s embedding models, `voyage-2` and `voyage-2-code`, are state-of-the-art in retrieval accuracy. These models outperform top performing embedding models like `intfloat/e5-mistral-7b-instruct` and `OpenAI/text-embedding-3-large` on the [MTEB Benchmark](https://github.com/embeddings-benchmark/mteb). `voyage-2` is current ranked second on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
#### What's New
- **[v1.2.0 - 02/02/24]:**
- **Breaking Change:** `VoyageDocumentEmbedder` and `VoyageTextEmbedder` now accept the `model` parameter instead of `model_name`.
- The embedders have been use the new `voyageai.Client.embed()` method instead of the deprecated `get_embedding` and `get_embeddings` methods of the global namespace.
- Support for the new `truncate` parameter has been added.
- Default embedding model has been changed to "voyage-2" from the deprecated "voyage-01".
- The embedders now return the total number of tokens used as part of the `"total_tokens"` in the metadata.
- **[v1.1.0 - 13/12/23]:** Added support for `input_type` parameter in `VoyageTextEmbedder` and `VoyageDocument Embedder`.
- **[v1.0.0 - 21/11/23]:** Added `VoyageTextEmbedder` and `VoyageDocument Embedder` to embed strings and documents.
## Installation
```bash
pip install voyage-embedders-haystack
```
## Usage
You can use Voyage Embedding models with two components: [VoyageTextEmbedder](https://github.com/awinml/voyage-embedders-haystack/blob/main/src/voyage_embedders/voyage_text_embedder.py) and [VoyageDocumentEmbedder](https://github.com/awinml/voyage-embedders-haystack/blob/main/src/voyage_embedders/voyage_document_embedder.py).
To create semantic embeddings for documents, use `VoyageDocumentEmbedder` in your indexing pipeline. For generating embeddings for queries, use `VoyageTextEmbedder`. Once you've selected the suitable component for your specific use case, initialize the component with the model name and VoyageAI API key. You can also
set the environment variable "VOYAGE_API_KEY" instead of passing the api key as an argument.
Information about the supported models, can be found on the [Embeddings Documentation.](https://docs.voyageai.com/embeddings/)
To get an API key, please see the [Voyage AI website.](https://www.voyageai.com/)
## Example
Below is the example Semantic Search pipeline that uses the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace. You can find more examples in the [`examples`](https://github.com/awinml/voyage-embedders-haystack/tree/main/examples) folder.
Load the dataset:
```python
# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
# Import Voyage Embedders
from voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder
from voyage_embedders.voyage_text_embedder import VoyageTextEmbedder
# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
docs = [
Document(
content=doc["text"],
meta={
"title": doc["title"],
"url": doc["url"],
},
)
for doc in dataset
]
```
Index the documents to the `InMemoryDocumentStore` using the `VoyageDocumentEmbedder` and `DocumentWriter`:
```python
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
doc_embedder = VoyageDocumentEmbedder(
model="voyage-2",
input_type="document",
)
# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name="DocWriter")
indexing_pipeline.connect(connect_from="DocEmbedder", connect_to="DocWriter")
indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
print(f"Number of documents in Document Store: {len(doc_store.filter_documents())}")
print(f"First Document: {doc_store.filter_documents()[0]}")
print(f"Embedding of first Document: {doc_store.filter_documents()[0].embedding}")
```
Query the Semantic Search Pipeline using the `InMemoryEmbeddingRetriever` and `VoyageTextEmbedder`:
```python
text_embedder = VoyageTextEmbedder(model="voyage-2", input_type="query")
# Query Pipeline
query_pipeline = Pipeline()
query_pipeline.add_component("TextEmbedder", text_embedder)
query_pipeline.add_component("Retriever", InMemoryEmbeddingRetriever(document_store=doc_store))
query_pipeline.connect("TextEmbedder", "Retriever")
# Search
results = query_pipeline.run({"TextEmbedder": {"text": "Which year did the Joker movie release?"}})
# Print text from top result
top_result = results["Retriever"]["documents"][0].content
print("The top search result is:")
print(top_result)
```
## Contributing
Pull requests are welcome. For major changes, please open an issue first
to discuss what you would like to change.
## Author
[Ashwin Mathur](https://github.com/awinml)
## License
`voyage-embedders-haystack` is distributed under the terms of the [Apache-2.0 license](https://github.com/awinml/voyage-embedders-haystack/blob/main/LICENSE).
Raw data
{
"_id": null,
"home_page": null,
"name": "voyage-embedders-haystack",
"maintainer": "Ashwin Mathur",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "Haystack,VoyageAI",
"author": "Ashwin Mathur",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/15/a0/5ee581db3145113953d33a9341c0c787b6102843a35c9bdc33d4ceb933d7/voyage_embedders_haystack-1.2.0.tar.gz",
"platform": null,
"description": "[![PyPI](https://img.shields.io/pypi/v/voyage-embedders-haystack)](https://pypi.org/project/voyage-embedders-haystack/) \n![PyPI - Downloads](https://img.shields.io/pypi/dm/voyage-embedders-haystack?color=blue&logo=pypi&logoColor=gold) \n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/voyage-embedders-haystack?logo=python&logoColor=gold) \n[![GitHub](https://img.shields.io/github/license/awinml/voyage-embedders-haystack?color=green)](LICENSE) \n[![Actions status](https://github.com/awinml/voyage-embedders-haystack/workflows/Test/badge.svg)](https://github.com/awinml/voyage-embedders-haystack/actions)\n[![Coverage Status](https://coveralls.io/repos/github/awinml/voyage-embedders-haystack/badge.svg?branch=main)](https://coveralls.io/github/awinml/voyage-embedders-haystack?branch=main)\n\n[![Types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy) \n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![Code Style - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) \n\n\n\n<h1 align=\"center\"> <a href=\"https://github.com/awinml/voyage-embedders-haystack\"> Voyage Embedders - Haystack </a> </h1>\n\nCustom component for [Haystack](https://github.com/deepset-ai/haystack) (2.x) for creating embeddings using the [VoyageAI Embedding Models](https://voyageai.com/).\n\nVoyage\u2019s embedding models, `voyage-2` and `voyage-2-code`, are state-of-the-art in retrieval accuracy. These models outperform top performing embedding models like `intfloat/e5-mistral-7b-instruct` and `OpenAI/text-embedding-3-large` on the [MTEB Benchmark](https://github.com/embeddings-benchmark/mteb). `voyage-2` is current ranked second on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).\n\n\n#### What's New\n\n- **[v1.2.0 - 02/02/24]:**\n - **Breaking Change:** `VoyageDocumentEmbedder` and `VoyageTextEmbedder` now accept the `model` parameter instead of `model_name`.\n - The embedders have been use the new `voyageai.Client.embed()` method instead of the deprecated `get_embedding` and `get_embeddings` methods of the global namespace.\n - Support for the new `truncate` parameter has been added.\n - Default embedding model has been changed to \"voyage-2\" from the deprecated \"voyage-01\".\n - The embedders now return the total number of tokens used as part of the `\"total_tokens\"` in the metadata.\n\n- **[v1.1.0 - 13/12/23]:** Added support for `input_type` parameter in `VoyageTextEmbedder` and `VoyageDocument Embedder`.\n\n- **[v1.0.0 - 21/11/23]:** Added `VoyageTextEmbedder` and `VoyageDocument Embedder` to embed strings and documents.\n\n\n## Installation\n\n```bash\npip install voyage-embedders-haystack\n```\n\n## Usage\n\nYou can use Voyage Embedding models with two components: [VoyageTextEmbedder](https://github.com/awinml/voyage-embedders-haystack/blob/main/src/voyage_embedders/voyage_text_embedder.py) and [VoyageDocumentEmbedder](https://github.com/awinml/voyage-embedders-haystack/blob/main/src/voyage_embedders/voyage_document_embedder.py).\n\nTo create semantic embeddings for documents, use `VoyageDocumentEmbedder` in your indexing pipeline. For generating embeddings for queries, use `VoyageTextEmbedder`. Once you've selected the suitable component for your specific use case, initialize the component with the model name and VoyageAI API key. You can also\nset the environment variable \"VOYAGE_API_KEY\" instead of passing the api key as an argument.\n\nInformation about the supported models, can be found on the [Embeddings Documentation.](https://docs.voyageai.com/embeddings/)\n\nTo get an API key, please see the [Voyage AI website.](https://www.voyageai.com/)\n\n\n## Example\n\nBelow is the example Semantic Search pipeline that uses the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace. You can find more examples in the [`examples`](https://github.com/awinml/voyage-embedders-haystack/tree/main/examples) folder. \n\nLoad the dataset:\n\n```python\n# Install HuggingFace Datasets using \"pip install datasets\"\nfrom datasets import load_dataset\nfrom haystack import Pipeline\nfrom haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever\nfrom haystack.components.writers import DocumentWriter\nfrom haystack.dataclasses import Document\nfrom haystack.document_stores.in_memory import InMemoryDocumentStore\n\n# Import Voyage Embedders\nfrom voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder\nfrom voyage_embedders.voyage_text_embedder import VoyageTextEmbedder\n\n# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace\ndataset = load_dataset(\"pszemraj/simple_wikipedia\", split=\"validation[:100]\")\n\ndocs = [\n Document(\n content=doc[\"text\"],\n meta={\n \"title\": doc[\"title\"],\n \"url\": doc[\"url\"],\n },\n )\n for doc in dataset\n]\n```\n\nIndex the documents to the `InMemoryDocumentStore` using the `VoyageDocumentEmbedder` and `DocumentWriter`:\n\n```python\ndoc_store = InMemoryDocumentStore(embedding_similarity_function=\"cosine\")\ndoc_embedder = VoyageDocumentEmbedder(\n model=\"voyage-2\",\n input_type=\"document\",\n)\n\n# Indexing Pipeline\nindexing_pipeline = Pipeline()\nindexing_pipeline.add_component(instance=doc_embedder, name=\"DocEmbedder\")\nindexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name=\"DocWriter\")\nindexing_pipeline.connect(connect_from=\"DocEmbedder\", connect_to=\"DocWriter\")\n\nindexing_pipeline.run({\"DocEmbedder\": {\"documents\": docs}})\n\nprint(f\"Number of documents in Document Store: {len(doc_store.filter_documents())}\")\nprint(f\"First Document: {doc_store.filter_documents()[0]}\")\nprint(f\"Embedding of first Document: {doc_store.filter_documents()[0].embedding}\")\n```\n\nQuery the Semantic Search Pipeline using the `InMemoryEmbeddingRetriever` and `VoyageTextEmbedder`:\n```python\ntext_embedder = VoyageTextEmbedder(model=\"voyage-2\", input_type=\"query\")\n\n# Query Pipeline\nquery_pipeline = Pipeline()\nquery_pipeline.add_component(\"TextEmbedder\", text_embedder)\nquery_pipeline.add_component(\"Retriever\", InMemoryEmbeddingRetriever(document_store=doc_store))\nquery_pipeline.connect(\"TextEmbedder\", \"Retriever\")\n\n\n# Search\nresults = query_pipeline.run({\"TextEmbedder\": {\"text\": \"Which year did the Joker movie release?\"}})\n\n# Print text from top result\ntop_result = results[\"Retriever\"][\"documents\"][0].content\nprint(\"The top search result is:\")\nprint(top_result)\n```\n\n## Contributing\n\nPull requests are welcome. For major changes, please open an issue first\nto discuss what you would like to change.\n\n## Author\n\n[Ashwin Mathur](https://github.com/awinml)\n\n## License\n\n`voyage-embedders-haystack` is distributed under the terms of the [Apache-2.0 license](https://github.com/awinml/voyage-embedders-haystack/blob/main/LICENSE).\n",
"bugtrack_url": null,
"license": null,
"summary": "Haystack 2.x component to embed strings and Documents using VoyageAI Embedding models.",
"version": "1.2.0",
"project_urls": {
"Documentation": "https://github.com/awinml/voyage-embedders-haystack#readme",
"Issues": "https://github.com/awinml/voyage-embedders-haystack/issues",
"Source": "https://github.com/awinml/voyage-embedders-haystack"
},
"split_keywords": [
"haystack",
"voyageai"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6710e5235226d5cf8abc0a57f8cfd9652abc9ef4c36cf9c17a5bf59ab4642bfa",
"md5": "88b39d60aa9de3e17708aaf6c7939ec1",
"sha256": "d90ee71b1c940e50746b9bf5de7fd0c1f8d2418eb28f8088aae98cb5be3f7735"
},
"downloads": -1,
"filename": "voyage_embedders_haystack-1.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "88b39d60aa9de3e17708aaf6c7939ec1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 12877,
"upload_time": "2024-02-02T11:14:19",
"upload_time_iso_8601": "2024-02-02T11:14:19.716214Z",
"url": "https://files.pythonhosted.org/packages/67/10/e5235226d5cf8abc0a57f8cfd9652abc9ef4c36cf9c17a5bf59ab4642bfa/voyage_embedders_haystack-1.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "15a05ee581db3145113953d33a9341c0c787b6102843a35c9bdc33d4ceb933d7",
"md5": "21fb196811356368d407693a671c8486",
"sha256": "88ab646ff754d8a81233ba6b92e1d48f4ddf1584963f7912067d7abae72eb150"
},
"downloads": -1,
"filename": "voyage_embedders_haystack-1.2.0.tar.gz",
"has_sig": false,
"md5_digest": "21fb196811356368d407693a671c8486",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 17741,
"upload_time": "2024-02-02T11:14:18",
"upload_time_iso_8601": "2024-02-02T11:14:18.208675Z",
"url": "https://files.pythonhosted.org/packages/15/a0/5ee581db3145113953d33a9341c0c787b6102843a35c9bdc33d4ceb933d7/voyage_embedders_haystack-1.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-02 11:14:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "awinml",
"github_project": "voyage-embedders-haystack#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "voyage-embedders-haystack"
}