llama-index-llms-openvino


Namellama-index-llms-openvino JSON
Version 0.3.2 PyPI version JSON
download
home_pageNone
Summaryllama-index llms openvino integration
upload_time2024-10-08 22:35:07
maintainerNone
docs_urlNone
authorYour Name
requires_python<4.0,>=3.9
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # LlamaIndex Llms Integration: Openvino

## Installation

To install the required packages, run:

```bash
%pip install llama-index-llms-openvino transformers huggingface_hub
!pip install llama-index
```

## Setup

### Define Functions for Prompt Handling

You will need functions to convert messages and completions into prompts:

```python
from llama_index.llms.openvino import OpenVINOLLM


def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>\n{message.content}</s>\n"
        elif message.role == "user":
            prompt += f"<|user|>\n{message.content}</s>\n"
        elif message.role == "assistant":
            prompt += f"<|assistant|>\n{message.content}</s>\n"

    # Ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>\n"):
        prompt = "<|system|>\n</s>\n" + prompt

    # Add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt


def completion_to_prompt(completion):
    return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"
```

### Model Loading

Models can be loaded by specifying parameters using the `OpenVINOLLM` method. If you have an Intel GPU, specify `device_map="gpu"` to run inference on it:

```python
ov_config = {
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

ov_llm = OpenVINOLLM(
    model_id_or_path="HuggingFaceH4/zephyr-7b-beta",
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="cpu",
)

response = ov_llm.complete("What is the meaning of life?")
print(str(response))
```

### Inference with Local OpenVINO Model

Export your model to the OpenVINO IR format using the CLI and load it from a local folder. It’s recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint:

```bash
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir
```

You can then load the model from the specified directory:

```python
ov_llm = OpenVINOLLM(
    model_id_or_path="ov_model_dir",
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"ov_config": ov_config},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    device_map="gpu",
)
```

### Additional Optimization

You can get additional inference speed improvements with dynamic quantization of activations and KV-cache quantization. Enable these options with `ov_config` as follows:

```python
ov_config = {
    "KV_CACHE_PRECISION": "u8",
    "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}
```

## Streaming Responses

To use the streaming capabilities, you can use the `stream_complete` and `stream_chat` methods:

### Using `stream_complete`

```python
response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
    print(r.delta, end="")
```

### Using `stream_chat`

```python
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]

resp = ov_llm.stream_chat(messages)
for r in resp:
    print(r.delta, end="")
```

### LLM Implementation example

https://docs.llamaindex.ai/en/stable/examples/llm/openvino/

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llama-index-llms-openvino",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Your Name",
    "author_email": "you@example.com",
    "download_url": "https://files.pythonhosted.org/packages/c3/c4/f3d10705942955900297410341228ee0d4d312000c8b96e4f2ccb581062a/llama_index_llms_openvino-0.3.2.tar.gz",
    "platform": null,
    "description": "# LlamaIndex Llms Integration: Openvino\n\n## Installation\n\nTo install the required packages, run:\n\n```bash\n%pip install llama-index-llms-openvino transformers huggingface_hub\n!pip install llama-index\n```\n\n## Setup\n\n### Define Functions for Prompt Handling\n\nYou will need functions to convert messages and completions into prompts:\n\n```python\nfrom llama_index.llms.openvino import OpenVINOLLM\n\n\ndef messages_to_prompt(messages):\n    prompt = \"\"\n    for message in messages:\n        if message.role == \"system\":\n            prompt += f\"<|system|>\\n{message.content}</s>\\n\"\n        elif message.role == \"user\":\n            prompt += f\"<|user|>\\n{message.content}</s>\\n\"\n        elif message.role == \"assistant\":\n            prompt += f\"<|assistant|>\\n{message.content}</s>\\n\"\n\n    # Ensure we start with a system prompt, insert blank if needed\n    if not prompt.startswith(\"<|system|>\\n\"):\n        prompt = \"<|system|>\\n</s>\\n\" + prompt\n\n    # Add final assistant prompt\n    prompt = prompt + \"<|assistant|>\\n\"\n\n    return prompt\n\n\ndef completion_to_prompt(completion):\n    return f\"<|system|>\\n</s>\\n<|user|>\\n{completion}</s>\\n<|assistant|>\\n\"\n```\n\n### Model Loading\n\nModels can be loaded by specifying parameters using the `OpenVINOLLM` method. If you have an Intel GPU, specify `device_map=\"gpu\"` to run inference on it:\n\n```python\nov_config = {\n    \"PERFORMANCE_HINT\": \"LATENCY\",\n    \"NUM_STREAMS\": \"1\",\n    \"CACHE_DIR\": \"\",\n}\n\nov_llm = OpenVINOLLM(\n    model_id_or_path=\"HuggingFaceH4/zephyr-7b-beta\",\n    context_window=3900,\n    max_new_tokens=256,\n    model_kwargs={\"ov_config\": ov_config},\n    generate_kwargs={\"temperature\": 0.7, \"top_k\": 50, \"top_p\": 0.95},\n    messages_to_prompt=messages_to_prompt,\n    completion_to_prompt=completion_to_prompt,\n    device_map=\"cpu\",\n)\n\nresponse = ov_llm.complete(\"What is the meaning of life?\")\nprint(str(response))\n```\n\n### Inference with Local OpenVINO Model\n\nExport your model to the OpenVINO IR format using the CLI and load it from a local folder. It\u2019s recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint:\n\n```bash\n!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir\n!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir\n!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir\n```\n\nYou can then load the model from the specified directory:\n\n```python\nov_llm = OpenVINOLLM(\n    model_id_or_path=\"ov_model_dir\",\n    context_window=3900,\n    max_new_tokens=256,\n    model_kwargs={\"ov_config\": ov_config},\n    generate_kwargs={\"temperature\": 0.7, \"top_k\": 50, \"top_p\": 0.95},\n    messages_to_prompt=messages_to_prompt,\n    completion_to_prompt=completion_to_prompt,\n    device_map=\"gpu\",\n)\n```\n\n### Additional Optimization\n\nYou can get additional inference speed improvements with dynamic quantization of activations and KV-cache quantization. Enable these options with `ov_config` as follows:\n\n```python\nov_config = {\n    \"KV_CACHE_PRECISION\": \"u8\",\n    \"DYNAMIC_QUANTIZATION_GROUP_SIZE\": \"32\",\n    \"PERFORMANCE_HINT\": \"LATENCY\",\n    \"NUM_STREAMS\": \"1\",\n    \"CACHE_DIR\": \"\",\n}\n```\n\n## Streaming Responses\n\nTo use the streaming capabilities, you can use the `stream_complete` and `stream_chat` methods:\n\n### Using `stream_complete`\n\n```python\nresponse = ov_llm.stream_complete(\"Who is Paul Graham?\")\nfor r in response:\n    print(r.delta, end=\"\")\n```\n\n### Using `stream_chat`\n\n```python\nfrom llama_index.core.llms import ChatMessage\n\nmessages = [\n    ChatMessage(\n        role=\"system\", content=\"You are a pirate with a colorful personality\"\n    ),\n    ChatMessage(role=\"user\", content=\"What is your name\"),\n]\n\nresp = ov_llm.stream_chat(messages)\nfor r in resp:\n    print(r.delta, end=\"\")\n```\n\n### LLM Implementation example\n\nhttps://docs.llamaindex.ai/en/stable/examples/llm/openvino/\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "llama-index llms openvino integration",
    "version": "0.3.2",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "03dfae55a2698384aebc73e1ab4f36a7bf45c07f7b3f437acaad1283d53c68e1",
                "md5": "020bb9dcbb7b850e27ce190043e26b05",
                "sha256": "1fc575a5b63d6e8a5471b0d38462f59ae421ee3b5384cbfb3e8a2dbd17cf9893"
            },
            "downloads": -1,
            "filename": "llama_index_llms_openvino-0.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "020bb9dcbb7b850e27ce190043e26b05",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 5234,
            "upload_time": "2024-10-08T22:35:05",
            "upload_time_iso_8601": "2024-10-08T22:35:05.629052Z",
            "url": "https://files.pythonhosted.org/packages/03/df/ae55a2698384aebc73e1ab4f36a7bf45c07f7b3f437acaad1283d53c68e1/llama_index_llms_openvino-0.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c3c4f3d10705942955900297410341228ee0d4d312000c8b96e4f2ccb581062a",
                "md5": "4f8f6ecfe62bf8ef907d9586f80af8e3",
                "sha256": "2ecb03a2d3c56e6d5b317208dbad757309a512437785d3fcfa826b8061d78fed"
            },
            "downloads": -1,
            "filename": "llama_index_llms_openvino-0.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "4f8f6ecfe62bf8ef907d9586f80af8e3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 4598,
            "upload_time": "2024-10-08T22:35:07",
            "upload_time_iso_8601": "2024-10-08T22:35:07.194836Z",
            "url": "https://files.pythonhosted.org/packages/c3/c4/f3d10705942955900297410341228ee0d4d312000c8b96e4f2ccb581062a/llama_index_llms_openvino-0.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-08 22:35:07",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "llama-index-llms-openvino"
}
        
Elapsed time: 0.33410s