llama-index-llms-llama-cpp


Namellama-index-llms-llama-cpp JSON
Version 0.5.1 PyPI version JSON
download
home_pageNone
Summaryllama-index llms llama cpp integration
upload_time2025-09-08 20:49:44
maintainerNone
docs_urlNone
authorNone
requires_python<4.0,>=3.9
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # LlamaIndex Llms Integration: Llama Cpp

## Installation

To get the best performance out of `LlamaCPP`, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is [here](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal).

Full MACOS instructions are also [here](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/).

In general:

- Use `CuBLAS` if you have CUDA and an NVidia GPU
- Use `METAL` if you are running on an M1/M2 MacBook
- Use `CLBLAST` if you are running on an AMD/Intel GPU

Them, install the required llama-index packages:

```bash
pip install llama-index-embeddings-huggingface
pip install llama-index-llms-llama-cpp
```

## Basic Usage

### Initialize LlamaCPP

Set up the model URL and initialize the LlamaCPP LLM:

```python
from llama_index.llms.llama_cpp import LlamaCPP
from transformers import AutoTokenizer

model_url = "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf"

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")


def messages_to_prompt(messages):
    messages = [{"role": m.role.value, "content": m.content} for m in messages]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


def completion_to_prompt(completion):
    messages = [{"role": "user", "content": completion}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=16384,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)
```

### Generate Completions

Use the `complete` method to generate a response:

```python
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)
```

### Stream Completions

You can also stream completions for a prompt:

```python
response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
    print(response.delta, end="", flush=True)
```

### Set Up Query Engine with LlamaCPP

Change the global tokenizer to match the LLM:

```python
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct").encode
)
```

### Use Hugging Face Embeddings

Set up the embedding model and load documents:

```python
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
documents = SimpleDirectoryReader(
    "../../../examples/paul_graham_essay/data"
).load_data()
```

### Create Vector Store Index

Create a vector store index from the loaded documents:

```python
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
```

### Set Up Query Engine

Set up the query engine with the LlamaCPP LLM:

```python
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What did the author do growing up?")
print(response)
```

### LLM Implementation example

https://docs.llamaindex.ai/en/stable/examples/llm/llama_cpp/

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llama-index-llms-llama-cpp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Your Name <you@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/b4/66/0f60c34b6004852bb65dcb300f1bb0805f7a88e51099b424d878223c02cc/llama_index_llms_llama_cpp-0.5.1.tar.gz",
    "platform": null,
    "description": "# LlamaIndex Llms Integration: Llama Cpp\n\n## Installation\n\nTo get the best performance out of `LlamaCPP`, it is recommended to install the package so that it is compiled with GPU support. A full guide for installing this way is [here](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal).\n\nFull MACOS instructions are also [here](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/).\n\nIn general:\n\n- Use `CuBLAS` if you have CUDA and an NVidia GPU\n- Use `METAL` if you are running on an M1/M2 MacBook\n- Use `CLBLAST` if you are running on an AMD/Intel GPU\n\nThem, install the required llama-index packages:\n\n```bash\npip install llama-index-embeddings-huggingface\npip install llama-index-llms-llama-cpp\n```\n\n## Basic Usage\n\n### Initialize LlamaCPP\n\nSet up the model URL and initialize the LlamaCPP LLM:\n\n```python\nfrom llama_index.llms.llama_cpp import LlamaCPP\nfrom transformers import AutoTokenizer\n\nmodel_url = \"https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q3_k_m.gguf\"\n\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\")\n\n\ndef messages_to_prompt(messages):\n    messages = [{\"role\": m.role.value, \"content\": m.content} for m in messages]\n    prompt = tokenizer.apply_chat_template(\n        messages, tokenize=False, add_generation_prompt=True\n    )\n    return prompt\n\n\ndef completion_to_prompt(completion):\n    messages = [{\"role\": \"user\", \"content\": completion}]\n    prompt = tokenizer.apply_chat_template(\n        messages, tokenize=False, add_generation_prompt=True\n    )\n    return prompt\n\n\nllm = LlamaCPP(\n    # You can pass in the URL to a GGML model to download it automatically\n    model_url=model_url,\n    # optionally, you can set the path to a pre-downloaded model instead of model_url\n    model_path=None,\n    temperature=0.1,\n    max_new_tokens=256,\n    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room\n    context_window=16384,\n    # kwargs to pass to __call__()\n    generate_kwargs={},\n    # kwargs to pass to __init__()\n    # set to at least 1 to use GPU\n    model_kwargs={\"n_gpu_layers\": -1},\n    # transform inputs into Llama2 format\n    messages_to_prompt=messages_to_prompt,\n    completion_to_prompt=completion_to_prompt,\n    verbose=True,\n)\n```\n\n### Generate Completions\n\nUse the `complete` method to generate a response:\n\n```python\nresponse = llm.complete(\"Hello! Can you tell me a poem about cats and dogs?\")\nprint(response.text)\n```\n\n### Stream Completions\n\nYou can also stream completions for a prompt:\n\n```python\nresponse_iter = llm.stream_complete(\"Can you write me a poem about fast cars?\")\nfor response in response_iter:\n    print(response.delta, end=\"\", flush=True)\n```\n\n### Set Up Query Engine with LlamaCPP\n\nChange the global tokenizer to match the LLM:\n\n```python\nfrom llama_index.core import set_global_tokenizer\nfrom transformers import AutoTokenizer\n\nset_global_tokenizer(\n    AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\").encode\n)\n```\n\n### Use Hugging Face Embeddings\n\nSet up the embedding model and load documents:\n\n```python\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\n\nembed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\ndocuments = SimpleDirectoryReader(\n    \"../../../examples/paul_graham_essay/data\"\n).load_data()\n```\n\n### Create Vector Store Index\n\nCreate a vector store index from the loaded documents:\n\n```python\nindex = VectorStoreIndex.from_documents(documents, embed_model=embed_model)\n```\n\n### Set Up Query Engine\n\nSet up the query engine with the LlamaCPP LLM:\n\n```python\nquery_engine = index.as_query_engine(llm=llm)\nresponse = query_engine.query(\"What did the author do growing up?\")\nprint(response)\n```\n\n### LLM Implementation example\n\nhttps://docs.llamaindex.ai/en/stable/examples/llm/llama_cpp/\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "llama-index llms llama cpp integration",
    "version": "0.5.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c98986e350ccde0f383a543c02bbf75b708a2acd616468a1002ee2e1b9d51911",
                "md5": "18b2d65aa28e176669706dd8e74cc1f2",
                "sha256": "2721423a41eee6fa706bb3f85acb6f5feadc948265dcc724d105a00198bbf5e4"
            },
            "downloads": -1,
            "filename": "llama_index_llms_llama_cpp-0.5.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "18b2d65aa28e176669706dd8e74cc1f2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 8360,
            "upload_time": "2025-09-08T20:49:43",
            "upload_time_iso_8601": "2025-09-08T20:49:43.964566Z",
            "url": "https://files.pythonhosted.org/packages/c9/89/86e350ccde0f383a543c02bbf75b708a2acd616468a1002ee2e1b9d51911/llama_index_llms_llama_cpp-0.5.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b4660f60c34b6004852bb65dcb300f1bb0805f7a88e51099b424d878223c02cc",
                "md5": "997f5fbdc7e41e1a428d1747d950e4db",
                "sha256": "f8d952622da20f1817775f0b40db0575397e5ba83e1103a769a211f105c41ce4"
            },
            "downloads": -1,
            "filename": "llama_index_llms_llama_cpp-0.5.1.tar.gz",
            "has_sig": false,
            "md5_digest": "997f5fbdc7e41e1a428d1747d950e4db",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 8179,
            "upload_time": "2025-09-08T20:49:44",
            "upload_time_iso_8601": "2025-09-08T20:49:44.682462Z",
            "url": "https://files.pythonhosted.org/packages/b4/66/0f60c34b6004852bb65dcb300f1bb0805f7a88e51099b424d878223c02cc/llama_index_llms_llama_cpp-0.5.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-08 20:49:44",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "llama-index-llms-llama-cpp"
}
        
Elapsed time: 2.22007s