Name | llama-index-llms-llama-cpp JSON |
Version |
0.5.1
JSON |
| download |
home_page | None |
Summary | llama-index llms llama cpp integration |
upload_time | 2025-09-08 20:49:44 |
maintainer | None |
docs_url | None |
author | None |
requires_python | <4.0,>=3.9 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# LlamaIndex Llms Integration: Llama Cpp
## Installation
To get the best performance out of `LlamaCPP`, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is [here](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal).
Full MACOS instructions are also [here](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/).
In general:
- Use `CuBLAS` if you have CUDA and an NVidia GPU
- Use `METAL` if you are running on an M1/M2 MacBook
- Use `CLBLAST` if you are running on an AMD/Intel GPU
Them, install the required llama-index packages:
```bash
pip install llama-index-embeddings-huggingface
pip install llama-index-llms-llama-cpp
```
## Basic Usage
### Initialize LlamaCPP
Set up the model URL and initialize the LlamaCPP LLM:
```python
from llama_index.llms.llama_cpp import LlamaCPP
from transformers import AutoTokenizer
model_url = "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
def messages_to_prompt(messages):
messages = [{"role": m.role.value, "content": m.content} for m in messages]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
return prompt
def completion_to_prompt(completion):
messages = [{"role": "user", "content": completion}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
return prompt
llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
model_url=model_url,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path=None,
temperature=0.1,
max_new_tokens=256,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=16384,
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": -1},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)
```
### Generate Completions
Use the `complete` method to generate a response:
```python
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)
```
### Stream Completions
You can also stream completions for a prompt:
```python
response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
print(response.delta, end="", flush=True)
```
### Set Up Query Engine with LlamaCPP
Change the global tokenizer to match the LLM:
```python
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer
set_global_tokenizer(
AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct").encode
)
```
### Use Hugging Face Embeddings
Set up the embedding model and load documents:
```python
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
documents = SimpleDirectoryReader(
"../../../examples/paul_graham_essay/data"
).load_data()
```
### Create Vector Store Index
Create a vector store index from the loaded documents:
```python
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
```
### Set Up Query Engine
Set up the query engine with the LlamaCPP LLM:
```python
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What did the author do growing up?")
print(response)
```
### LLM Implementation example
https://docs.llamaindex.ai/en/stable/examples/llm/llama_cpp/
Raw data
{
"_id": null,
"home_page": null,
"name": "llama-index-llms-llama-cpp",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "Your Name <you@example.com>",
"download_url": "https://files.pythonhosted.org/packages/b4/66/0f60c34b6004852bb65dcb300f1bb0805f7a88e51099b424d878223c02cc/llama_index_llms_llama_cpp-0.5.1.tar.gz",
"platform": null,
"description": "# LlamaIndex Llms Integration: Llama Cpp\n\n## Installation\n\nTo get the best performance out of `LlamaCPP`, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is [here](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal).\n\nFull MACOS instructions are also [here](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/).\n\nIn general:\n\n- Use `CuBLAS` if you have CUDA and an NVidia GPU\n- Use `METAL` if you are running on an M1/M2 MacBook\n- Use `CLBLAST` if you are running on an AMD/Intel GPU\n\nThem, install the required llama-index packages:\n\n```bash\npip install llama-index-embeddings-huggingface\npip install llama-index-llms-llama-cpp\n```\n\n## Basic Usage\n\n### Initialize LlamaCPP\n\nSet up the model URL and initialize the LlamaCPP LLM:\n\n```python\nfrom llama_index.llms.llama_cpp import LlamaCPP\nfrom transformers import AutoTokenizer\n\nmodel_url = \"https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf\"\n\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\")\n\n\ndef messages_to_prompt(messages):\n messages = [{\"role\": m.role.value, \"content\": m.content} for m in messages]\n prompt = tokenizer.apply_chat_template(\n messages, tokenize=False, add_generation_prompt=True\n )\n return prompt\n\n\ndef completion_to_prompt(completion):\n messages = [{\"role\": \"user\", \"content\": completion}]\n prompt = tokenizer.apply_chat_template(\n messages, tokenize=False, add_generation_prompt=True\n )\n return prompt\n\n\nllm = LlamaCPP(\n # You can pass in the URL to a GGML model to download it automatically\n model_url=model_url,\n # optionally, you can set the path to a pre-downloaded model instead of model_url\n model_path=None,\n temperature=0.1,\n max_new_tokens=256,\n # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room\n context_window=16384,\n # kwargs to pass to __call__()\n generate_kwargs={},\n # kwargs to pass to __init__()\n # set to at least 1 to use GPU\n model_kwargs={\"n_gpu_layers\": -1},\n # transform inputs into Llama2 format\n messages_to_prompt=messages_to_prompt,\n completion_to_prompt=completion_to_prompt,\n verbose=True,\n)\n```\n\n### Generate Completions\n\nUse the `complete` method to generate a response:\n\n```python\nresponse = llm.complete(\"Hello! Can you tell me a poem about cats and dogs?\")\nprint(response.text)\n```\n\n### Stream Completions\n\nYou can also stream completions for a prompt:\n\n```python\nresponse_iter = llm.stream_complete(\"Can you write me a poem about fast cars?\")\nfor response in response_iter:\n print(response.delta, end=\"\", flush=True)\n```\n\n### Set Up Query Engine with LlamaCPP\n\nChange the global tokenizer to match the LLM:\n\n```python\nfrom llama_index.core import set_global_tokenizer\nfrom transformers import AutoTokenizer\n\nset_global_tokenizer(\n AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\").encode\n)\n```\n\n### Use Hugging Face Embeddings\n\nSet up the embedding model and load documents:\n\n```python\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\n\nembed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\ndocuments = SimpleDirectoryReader(\n \"../../../examples/paul_graham_essay/data\"\n).load_data()\n```\n\n### Create Vector Store Index\n\nCreate a vector store index from the loaded documents:\n\n```python\nindex = VectorStoreIndex.from_documents(documents, embed_model=embed_model)\n```\n\n### Set Up Query Engine\n\nSet up the query engine with the LlamaCPP LLM:\n\n```python\nquery_engine = index.as_query_engine(llm=llm)\nresponse = query_engine.query(\"What did the author do growing up?\")\nprint(response)\n```\n\n### LLM Implementation example\n\nhttps://docs.llamaindex.ai/en/stable/examples/llm/llama_cpp/\n",
"bugtrack_url": null,
"license": null,
"summary": "llama-index llms llama cpp integration",
"version": "0.5.1",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "c98986e350ccde0f383a543c02bbf75b708a2acd616468a1002ee2e1b9d51911",
"md5": "18b2d65aa28e176669706dd8e74cc1f2",
"sha256": "2721423a41eee6fa706bb3f85acb6f5feadc948265dcc724d105a00198bbf5e4"
},
"downloads": -1,
"filename": "llama_index_llms_llama_cpp-0.5.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "18b2d65aa28e176669706dd8e74cc1f2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 8360,
"upload_time": "2025-09-08T20:49:43",
"upload_time_iso_8601": "2025-09-08T20:49:43.964566Z",
"url": "https://files.pythonhosted.org/packages/c9/89/86e350ccde0f383a543c02bbf75b708a2acd616468a1002ee2e1b9d51911/llama_index_llms_llama_cpp-0.5.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b4660f60c34b6004852bb65dcb300f1bb0805f7a88e51099b424d878223c02cc",
"md5": "997f5fbdc7e41e1a428d1747d950e4db",
"sha256": "f8d952622da20f1817775f0b40db0575397e5ba83e1103a769a211f105c41ce4"
},
"downloads": -1,
"filename": "llama_index_llms_llama_cpp-0.5.1.tar.gz",
"has_sig": false,
"md5_digest": "997f5fbdc7e41e1a428d1747d950e4db",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 8179,
"upload_time": "2025-09-08T20:49:44",
"upload_time_iso_8601": "2025-09-08T20:49:44.682462Z",
"url": "https://files.pythonhosted.org/packages/b4/66/0f60c34b6004852bb65dcb300f1bb0805f7a88e51099b424d878223c02cc/llama_index_llms_llama_cpp-0.5.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-08 20:49:44",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "llama-index-llms-llama-cpp"
}