Name | llama-index-llms-openvino JSON |
Version |
0.3.2
JSON |
| download |
home_page | None |
Summary | llama-index llms openvino integration |
upload_time | 2024-10-08 22:35:07 |
maintainer | None |
docs_url | None |
author | Your Name |
requires_python | <4.0,>=3.9 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# LlamaIndex Llms Integration: Openvino
## Installation
To install the required packages, run:
```bash
%pip install llama-index-llms-openvino transformers huggingface_hub
!pip install llama-index
```
## Setup
### Define Functions for Prompt Handling
You will need functions to convert messages and completions into prompts:
```python
from llama_index.llms.openvino import OpenVINOLLM
def messages_to_prompt(messages):
prompt = ""
for message in messages:
if message.role == "system":
prompt += f"<|system|>\n{message.content}</s>\n"
elif message.role == "user":
prompt += f"<|user|>\n{message.content}</s>\n"
elif message.role == "assistant":
prompt += f"<|assistant|>\n{message.content}</s>\n"
# Ensure we start with a system prompt, insert blank if needed
if not prompt.startswith("<|system|>\n"):
prompt = "<|system|>\n</s>\n" + prompt
# Add final assistant prompt
prompt = prompt + "<|assistant|>\n"
return prompt
def completion_to_prompt(completion):
return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"
```
### Model Loading
Models can be loaded by specifying parameters using the `OpenVINOLLM` method. If you have an Intel GPU, specify `device_map="gpu"` to run inference on it:
```python
ov_config = {
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}
ov_llm = OpenVINOLLM(
model_id_or_path="HuggingFaceH4/zephyr-7b-beta",
context_window=3900,
max_new_tokens=256,
model_kwargs={"ov_config": ov_config},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="cpu",
)
response = ov_llm.complete("What is the meaning of life?")
print(str(response))
```
### Inference with Local OpenVINO Model
Export your model to the OpenVINO IR format using the CLI and load it from a local folder. It’s recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint:
```bash
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir
!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir
```
You can then load the model from the specified directory:
```python
ov_llm = OpenVINOLLM(
model_id_or_path="ov_model_dir",
context_window=3900,
max_new_tokens=256,
model_kwargs={"ov_config": ov_config},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="gpu",
)
```
### Additional Optimization
You can get additional inference speed improvements with dynamic quantization of activations and KV-cache quantization. Enable these options with `ov_config` as follows:
```python
ov_config = {
"KV_CACHE_PRECISION": "u8",
"DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1",
"CACHE_DIR": "",
}
```
## Streaming Responses
To use the streaming capabilities, you can use the `stream_complete` and `stream_chat` methods:
### Using `stream_complete`
```python
response = ov_llm.stream_complete("Who is Paul Graham?")
for r in response:
print(r.delta, end="")
```
### Using `stream_chat`
```python
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="What is your name"),
]
resp = ov_llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")
```
### LLM Implementation example
https://docs.llamaindex.ai/en/stable/examples/llm/openvino/
Raw data
{
"_id": null,
"home_page": null,
"name": "llama-index-llms-openvino",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Your Name",
"author_email": "you@example.com",
"download_url": "https://files.pythonhosted.org/packages/c3/c4/f3d10705942955900297410341228ee0d4d312000c8b96e4f2ccb581062a/llama_index_llms_openvino-0.3.2.tar.gz",
"platform": null,
"description": "# LlamaIndex Llms Integration: Openvino\n\n## Installation\n\nTo install the required packages, run:\n\n```bash\n%pip install llama-index-llms-openvino transformers huggingface_hub\n!pip install llama-index\n```\n\n## Setup\n\n### Define Functions for Prompt Handling\n\nYou will need functions to convert messages and completions into prompts:\n\n```python\nfrom llama_index.llms.openvino import OpenVINOLLM\n\n\ndef messages_to_prompt(messages):\n prompt = \"\"\n for message in messages:\n if message.role == \"system\":\n prompt += f\"<|system|>\\n{message.content}</s>\\n\"\n elif message.role == \"user\":\n prompt += f\"<|user|>\\n{message.content}</s>\\n\"\n elif message.role == \"assistant\":\n prompt += f\"<|assistant|>\\n{message.content}</s>\\n\"\n\n # Ensure we start with a system prompt, insert blank if needed\n if not prompt.startswith(\"<|system|>\\n\"):\n prompt = \"<|system|>\\n</s>\\n\" + prompt\n\n # Add final assistant prompt\n prompt = prompt + \"<|assistant|>\\n\"\n\n return prompt\n\n\ndef completion_to_prompt(completion):\n return f\"<|system|>\\n</s>\\n<|user|>\\n{completion}</s>\\n<|assistant|>\\n\"\n```\n\n### Model Loading\n\nModels can be loaded by specifying parameters using the `OpenVINOLLM` method. If you have an Intel GPU, specify `device_map=\"gpu\"` to run inference on it:\n\n```python\nov_config = {\n \"PERFORMANCE_HINT\": \"LATENCY\",\n \"NUM_STREAMS\": \"1\",\n \"CACHE_DIR\": \"\",\n}\n\nov_llm = OpenVINOLLM(\n model_id_or_path=\"HuggingFaceH4/zephyr-7b-beta\",\n context_window=3900,\n max_new_tokens=256,\n model_kwargs={\"ov_config\": ov_config},\n generate_kwargs={\"temperature\": 0.7, \"top_k\": 50, \"top_p\": 0.95},\n messages_to_prompt=messages_to_prompt,\n completion_to_prompt=completion_to_prompt,\n device_map=\"cpu\",\n)\n\nresponse = ov_llm.complete(\"What is the meaning of life?\")\nprint(str(response))\n```\n\n### Inference with Local OpenVINO Model\n\nExport your model to the OpenVINO IR format using the CLI and load it from a local folder. It\u2019s recommended to apply 8 or 4-bit weight quantization to reduce inference latency and model footprint:\n\n```bash\n!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta ov_model_dir\n!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int8 ov_model_dir\n!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 ov_model_dir\n```\n\nYou can then load the model from the specified directory:\n\n```python\nov_llm = OpenVINOLLM(\n model_id_or_path=\"ov_model_dir\",\n context_window=3900,\n max_new_tokens=256,\n model_kwargs={\"ov_config\": ov_config},\n generate_kwargs={\"temperature\": 0.7, \"top_k\": 50, \"top_p\": 0.95},\n messages_to_prompt=messages_to_prompt,\n completion_to_prompt=completion_to_prompt,\n device_map=\"gpu\",\n)\n```\n\n### Additional Optimization\n\nYou can get additional inference speed improvements with dynamic quantization of activations and KV-cache quantization. Enable these options with `ov_config` as follows:\n\n```python\nov_config = {\n \"KV_CACHE_PRECISION\": \"u8\",\n \"DYNAMIC_QUANTIZATION_GROUP_SIZE\": \"32\",\n \"PERFORMANCE_HINT\": \"LATENCY\",\n \"NUM_STREAMS\": \"1\",\n \"CACHE_DIR\": \"\",\n}\n```\n\n## Streaming Responses\n\nTo use the streaming capabilities, you can use the `stream_complete` and `stream_chat` methods:\n\n### Using `stream_complete`\n\n```python\nresponse = ov_llm.stream_complete(\"Who is Paul Graham?\")\nfor r in response:\n print(r.delta, end=\"\")\n```\n\n### Using `stream_chat`\n\n```python\nfrom llama_index.core.llms import ChatMessage\n\nmessages = [\n ChatMessage(\n role=\"system\", content=\"You are a pirate with a colorful personality\"\n ),\n ChatMessage(role=\"user\", content=\"What is your name\"),\n]\n\nresp = ov_llm.stream_chat(messages)\nfor r in resp:\n print(r.delta, end=\"\")\n```\n\n### LLM Implementation example\n\nhttps://docs.llamaindex.ai/en/stable/examples/llm/openvino/\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "llama-index llms openvino integration",
"version": "0.3.2",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "03dfae55a2698384aebc73e1ab4f36a7bf45c07f7b3f437acaad1283d53c68e1",
"md5": "020bb9dcbb7b850e27ce190043e26b05",
"sha256": "1fc575a5b63d6e8a5471b0d38462f59ae421ee3b5384cbfb3e8a2dbd17cf9893"
},
"downloads": -1,
"filename": "llama_index_llms_openvino-0.3.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "020bb9dcbb7b850e27ce190043e26b05",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 5234,
"upload_time": "2024-10-08T22:35:05",
"upload_time_iso_8601": "2024-10-08T22:35:05.629052Z",
"url": "https://files.pythonhosted.org/packages/03/df/ae55a2698384aebc73e1ab4f36a7bf45c07f7b3f437acaad1283d53c68e1/llama_index_llms_openvino-0.3.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c3c4f3d10705942955900297410341228ee0d4d312000c8b96e4f2ccb581062a",
"md5": "4f8f6ecfe62bf8ef907d9586f80af8e3",
"sha256": "2ecb03a2d3c56e6d5b317208dbad757309a512437785d3fcfa826b8061d78fed"
},
"downloads": -1,
"filename": "llama_index_llms_openvino-0.3.2.tar.gz",
"has_sig": false,
"md5_digest": "4f8f6ecfe62bf8ef907d9586f80af8e3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 4598,
"upload_time": "2024-10-08T22:35:07",
"upload_time_iso_8601": "2024-10-08T22:35:07.194836Z",
"url": "https://files.pythonhosted.org/packages/c3/c4/f3d10705942955900297410341228ee0d4d312000c8b96e4f2ccb581062a/llama_index_llms_openvino-0.3.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-08 22:35:07",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "llama-index-llms-openvino"
}