Name | llama-cpp-haystack JSON |
Version |
0.4.3
JSON |
| download |
home_page | None |
Summary | An integration between the llama.cpp LLM framework and Haystack |
upload_time | 2024-12-19 09:09:20 |
maintainer | None |
docs_url | None |
author | Ashwin Mathur |
requires_python | >=3.8 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
hatch
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# llama-cpp-haystack
[![PyPI - Version](https://img.shields.io/pypi/v/llama-cpp-haystack.svg)](https://pypi.org/project/llama-cpp-haystack)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/llama-cpp-haystack.svg)](https://pypi.org/project/llama-cpp-haystack)
-----
Custom component for [Haystack](https://github.com/deepset-ai/haystack) (2.x) for running LLMs using the [Llama.cpp](https://github.com/ggerganov/llama.cpp) LLM framework. This implementation leverages the [Python Bindings for llama.cpp](https://github.com/abetlen/llama-cpp-python).
**Table of Contents**
- [Installation](#installation)
- [Usage](#usage)
- [Example](#example)
- [License](#license)
## Installation
```bash
pip install llama-cpp-haystack
```
The default install behaviour is to build `llama.cpp` for CPU only on Linux and Windows and use Metal on MacOS.
To install using the other backends, first install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) using the instructions on their [installation documentation](https://github.com/abetlen/llama-cpp-python#installation) and then install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp).
For example, to use `llama-cpp-haystack` with the cuBLAS backend:
```bash
export LLAMA_CUBLAS=1
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
pip install llama-cpp-haystack
```
## Usage
You can utilize the `LlamaCppGenerator` to load models quantized using llama.cpp (GGUF) for text generation.
Information about the supported models and model parameters can be found on the llama.cpp [documentation](https://llama-cpp-python.readthedocs.io/en/latest).
The GGUF versions of popular models can be downloaded from [HuggingFace](https://huggingface.co/models?library=gguf).
### Passing additional model parameters
The `model_path`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments.
The `model_kwargs` parameter can be used to pass additional arguments when initializing the model. In case of duplication, these kwargs override `model_path`, `n_ctx`, and `n_batch` init parameters.
See Llama.cpp's [model documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments.
For example, to offload the model to GPU during initialization:
```python
from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
generator = LlamaCppGenerator(
model_path="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
model_kwargs={"n_gpu_layers": -1}
)
generator.warm_up()
input = "Who is the best American actor?"
prompt = f"GPT4 Correct User: {input} <|end_of_turn|> GPT4 Correct Assistant:"
result = generator.run(prompt, generation_kwargs={"max_tokens": 128})
generated_text = result["replies"][0]
print(generated_text)
```
### Passing generation parameters
The `generation_kwargs` parameter can be used to pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, etc to the model during inference.
See Llama.cpp's [`create_completion` documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) for more information on the available generation arguments.
For example, to set the `max_tokens` and `temperature`:
```python
from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
generator = LlamaCppGenerator(
model_path="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generator.warm_up()
input = "Who is the best American actor?"
prompt = f"GPT4 Correct User: {input} <|end_of_turn|> GPT4 Correct Assistant:"
result = generator.run(prompt)
generated_text = result["replies"][0]
print(generated_text)
```
The `generation_kwargs` can also be passed to the `run` method of the generator directly:
```python
from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
generator = LlamaCppGenerator(
model_path="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
)
generator.warm_up()
input = "Who is the best American actor?"
prompt = f"GPT4 Correct User: {input} <|end_of_turn|> GPT4 Correct Assistant:"
result = generator.run(
prompt,
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generated_text = result["replies"][0]
print(generated_text)
```
## Example
Below is the example Retrieval Augmented Generation pipeline that uses the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace. You can find more examples in the [`examples`](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp/examples) folder.
Load the dataset:
```python
# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Document, Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.document_stores import InMemoryDocumentStore
# Import LlamaCppGenerator
from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
docs = [
Document(
content=doc["text"],
meta={
"title": doc["title"],
"url": doc["url"],
},
)
for doc in dataset
]
```
Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`:
```python
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name="DocWriter")
indexing_pipeline.connect(connect_from="DocEmbedder", connect_to="DocWriter")
indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
```
Create the Retrieval Augmented Generation (RAG) pipeline and add the `LlamaCppGenerator` to it:
```python
# Prompt Template for the https://huggingface.co/openchat/openchat-3.5-1210 LLM
prompt_template = """GPT4 Correct User: Answer the question using the provided context.
Question: {{question}}
Context:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
<|end_of_turn|>
GPT4 Correct Assistant:
"""
rag_pipeline = Pipeline()
text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
# Load the LLM using LlamaCppGenerator
model_path = "openchat-3.5-1210.Q3_K_S.gguf"
generator = LlamaCppGenerator(model_path=model_path, n_ctx=4096, n_batch=128)
rag_pipeline.add_component(
instance=text_embedder,
name="text_embedder",
)
rag_pipeline.add_component(instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=generator, name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("text_embedder", "retriever")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")
```
Run the pipeline:
```python
question = "Which year did the Joker movie release?"
result = rag_pipeline.run(
{
"text_embedder": {"text": question},
"prompt_builder": {"question": question},
"llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
"answer_builder": {"query": question},
}
)
generated_answer = result["answer_builder"]["answers"][0]
print(generated_answer.data)
# The Joker movie was released on October 4, 2019.
```
## License
`llama-cpp-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.
Raw data
{
"_id": null,
"home_page": null,
"name": "llama-cpp-haystack",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Ashwin Mathur",
"author_email": "deepset GmbH <info@deepset.ai>",
"download_url": "https://files.pythonhosted.org/packages/3f/4d/dc0e40124e95acb636f0617583f592729c7662a13f6eb180eb363d4398bf/llama_cpp_haystack-0.4.3.tar.gz",
"platform": null,
"description": "# llama-cpp-haystack\n\n[![PyPI - Version](https://img.shields.io/pypi/v/llama-cpp-haystack.svg)](https://pypi.org/project/llama-cpp-haystack)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/llama-cpp-haystack.svg)](https://pypi.org/project/llama-cpp-haystack)\n\n-----\n\nCustom component for [Haystack](https://github.com/deepset-ai/haystack) (2.x) for running LLMs using the [Llama.cpp](https://github.com/ggerganov/llama.cpp) LLM framework. This implementation leverages the [Python Bindings for llama.cpp](https://github.com/abetlen/llama-cpp-python).\n\n**Table of Contents**\n\n- [Installation](#installation)\n- [Usage](#usage)\n- [Example](#example)\n- [License](#license)\n\n## Installation\n\n```bash\npip install llama-cpp-haystack\n```\n\nThe default install behaviour is to build `llama.cpp` for CPU only on Linux and Windows and use Metal on MacOS.\n\nTo install using the other backends, first install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) using the instructions on their [installation documentation](https://github.com/abetlen/llama-cpp-python#installation) and then install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp).\n\n\nFor example, to use `llama-cpp-haystack` with the cuBLAS backend:\n\n```bash\nexport LLAMA_CUBLAS=1\nCMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" pip install llama-cpp-python\npip install llama-cpp-haystack\n```\n\n## Usage\n\nYou can utilize the `LlamaCppGenerator` to load models quantized using llama.cpp (GGUF) for text generation.\n\nInformation about the supported models and model parameters can be found on the llama.cpp [documentation](https://llama-cpp-python.readthedocs.io/en/latest).\n\nThe GGUF versions of popular models can be downloaded from [HuggingFace](https://huggingface.co/models?library=gguf).\n\n### Passing additional model parameters\n\nThe `model_path`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. \n\nThe `model_kwargs` parameter can be used to pass additional arguments when initializing the model. In case of duplication, these kwargs override `model_path`, `n_ctx`, and `n_batch` init parameters.\n\nSee Llama.cpp's [model documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments.\n\nFor example, to offload the model to GPU during initialization:\n\n```python\nfrom haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator\n\ngenerator = LlamaCppGenerator(\n model_path=\"/content/openchat-3.5-1210.Q3_K_S.gguf\", \n n_ctx=512,\n n_batch=128,\n model_kwargs={\"n_gpu_layers\": -1}\n)\ngenerator.warm_up()\n\ninput = \"Who is the best American actor?\"\nprompt = f\"GPT4 Correct User: {input} <|end_of_turn|> GPT4 Correct Assistant:\"\n\nresult = generator.run(prompt, generation_kwargs={\"max_tokens\": 128})\ngenerated_text = result[\"replies\"][0]\n\nprint(generated_text)\n```\n### Passing generation parameters\n\nThe `generation_kwargs` parameter can be used to pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, etc to the model during inference. \n\nSee Llama.cpp's [`create_completion` documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) for more information on the available generation arguments.\n\nFor example, to set the `max_tokens` and `temperature`:\n\n```python\nfrom haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator\n\ngenerator = LlamaCppGenerator(\n model_path=\"/content/openchat-3.5-1210.Q3_K_S.gguf\",\n n_ctx=512,\n n_batch=128,\n generation_kwargs={\"max_tokens\": 128, \"temperature\": 0.1},\n)\ngenerator.warm_up()\n\ninput = \"Who is the best American actor?\"\nprompt = f\"GPT4 Correct User: {input} <|end_of_turn|> GPT4 Correct Assistant:\"\n\nresult = generator.run(prompt)\ngenerated_text = result[\"replies\"][0]\n\nprint(generated_text)\n```\nThe `generation_kwargs` can also be passed to the `run` method of the generator directly:\n\n```python\nfrom haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator\n\ngenerator = LlamaCppGenerator(\n model_path=\"/content/openchat-3.5-1210.Q3_K_S.gguf\",\n n_ctx=512,\n n_batch=128,\n)\ngenerator.warm_up()\n\ninput = \"Who is the best American actor?\"\nprompt = f\"GPT4 Correct User: {input} <|end_of_turn|> GPT4 Correct Assistant:\"\n\nresult = generator.run(\n prompt,\n generation_kwargs={\"max_tokens\": 128, \"temperature\": 0.1},\n)\ngenerated_text = result[\"replies\"][0]\n\nprint(generated_text)\n```\n\n## Example\n\nBelow is the example Retrieval Augmented Generation pipeline that uses the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace. You can find more examples in the [`examples`](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp/examples) folder.\n\n\nLoad the dataset:\n\n```python\n# Install HuggingFace Datasets using \"pip install datasets\"\nfrom datasets import load_dataset\nfrom haystack import Document, Pipeline\nfrom haystack.components.builders.answer_builder import AnswerBuilder\nfrom haystack.components.builders.prompt_builder import PromptBuilder\nfrom haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder\nfrom haystack.components.retrievers import InMemoryEmbeddingRetriever\nfrom haystack.components.writers import DocumentWriter\nfrom haystack.document_stores import InMemoryDocumentStore\n\n# Import LlamaCppGenerator\nfrom haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator\n\n# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace\ndataset = load_dataset(\"pszemraj/simple_wikipedia\", split=\"validation[:100]\")\n\ndocs = [\n Document(\n content=doc[\"text\"],\n meta={\n \"title\": doc[\"title\"],\n \"url\": doc[\"url\"],\n },\n )\n for doc in dataset\n]\n```\n\nIndex the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`:\n\n```python\ndoc_store = InMemoryDocumentStore(embedding_similarity_function=\"cosine\")\ndoc_embedder = SentenceTransformersDocumentEmbedder(model=\"sentence-transformers/all-MiniLM-L6-v2\")\n\n# Indexing Pipeline\nindexing_pipeline = Pipeline()\nindexing_pipeline.add_component(instance=doc_embedder, name=\"DocEmbedder\")\nindexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name=\"DocWriter\")\nindexing_pipeline.connect(connect_from=\"DocEmbedder\", connect_to=\"DocWriter\")\n\nindexing_pipeline.run({\"DocEmbedder\": {\"documents\": docs}})\n```\n\nCreate the Retrieval Augmented Generation (RAG) pipeline and add the `LlamaCppGenerator` to it:\n\n```python\n# Prompt Template for the https://huggingface.co/openchat/openchat-3.5-1210 LLM\nprompt_template = \"\"\"GPT4 Correct User: Answer the question using the provided context.\nQuestion: {{question}}\nContext:\n{% for doc in documents %}\n {{ doc.content }}\n{% endfor %}\n<|end_of_turn|>\nGPT4 Correct Assistant:\n\"\"\"\n\nrag_pipeline = Pipeline()\n\ntext_embedder = SentenceTransformersTextEmbedder(model=\"sentence-transformers/all-MiniLM-L6-v2\")\n\n# Load the LLM using LlamaCppGenerator\nmodel_path = \"openchat-3.5-1210.Q3_K_S.gguf\"\ngenerator = LlamaCppGenerator(model_path=model_path, n_ctx=4096, n_batch=128)\n\nrag_pipeline.add_component(\n instance=text_embedder,\n name=\"text_embedder\",\n)\nrag_pipeline.add_component(instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), name=\"retriever\")\nrag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name=\"prompt_builder\")\nrag_pipeline.add_component(instance=generator, name=\"llm\")\nrag_pipeline.add_component(instance=AnswerBuilder(), name=\"answer_builder\")\n\nrag_pipeline.connect(\"text_embedder\", \"retriever\")\nrag_pipeline.connect(\"retriever\", \"prompt_builder.documents\")\nrag_pipeline.connect(\"prompt_builder\", \"llm\")\nrag_pipeline.connect(\"llm.replies\", \"answer_builder.replies\")\nrag_pipeline.connect(\"retriever\", \"answer_builder.documents\")\n```\n\nRun the pipeline:\n\n```python\nquestion = \"Which year did the Joker movie release?\"\nresult = rag_pipeline.run(\n {\n \"text_embedder\": {\"text\": question},\n \"prompt_builder\": {\"question\": question},\n \"llm\": {\"generation_kwargs\": {\"max_tokens\": 128, \"temperature\": 0.1}},\n \"answer_builder\": {\"query\": question},\n }\n)\n\ngenerated_answer = result[\"answer_builder\"][\"answers\"][0]\nprint(generated_answer.data)\n# The Joker movie was released on October 4, 2019.\n```\n\n## License\n\n`llama-cpp-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.\n",
"bugtrack_url": null,
"license": null,
"summary": "An integration between the llama.cpp LLM framework and Haystack",
"version": "0.4.3",
"project_urls": {
"Documentation": "https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp#readme",
"Issues": "https://github.com/deepset-ai/haystack-core-integrations/issues",
"Source": "https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5bf036b6c9a719869246a62038b69afe9f004060cf3b888173fb143a69df8c59",
"md5": "c84e5a86269ee50836de72ba09ffa42b",
"sha256": "6174a794bdd5a4a01b6b9c7b8faff243b3031692ae3ad7869930821ee7199a04"
},
"downloads": -1,
"filename": "llama_cpp_haystack-0.4.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c84e5a86269ee50836de72ba09ffa42b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 12328,
"upload_time": "2024-12-19T09:09:21",
"upload_time_iso_8601": "2024-12-19T09:09:21.438295Z",
"url": "https://files.pythonhosted.org/packages/5b/f0/36b6c9a719869246a62038b69afe9f004060cf3b888173fb143a69df8c59/llama_cpp_haystack-0.4.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "3f4ddc0e40124e95acb636f0617583f592729c7662a13f6eb180eb363d4398bf",
"md5": "c866fe0d0ccc8a3e6c6aa83072de42ea",
"sha256": "0357edf07e76a74bd87d80d124287d99720514f3ed2a5553b503b4cd4dd6d0ab"
},
"downloads": -1,
"filename": "llama_cpp_haystack-0.4.3.tar.gz",
"has_sig": false,
"md5_digest": "c866fe0d0ccc8a3e6c6aa83072de42ea",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 18408,
"upload_time": "2024-12-19T09:09:20",
"upload_time_iso_8601": "2024-12-19T09:09:20.485959Z",
"url": "https://files.pythonhosted.org/packages/3f/4d/dc0e40124e95acb636f0617583f592729c7662a13f6eb180eb363d4398bf/llama_cpp_haystack-0.4.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-19 09:09:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "deepset-ai",
"github_project": "haystack-core-integrations",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "hatch",
"specs": []
}
],
"lcname": "llama-cpp-haystack"
}