<div align="center">
<img width="400" height="350" src="/img/laser.webp">
</div>
<h1 align="center">
<em>SuperLaser</em>
</h1>
<h5 align="center">
⚠️<em>Not yet ready for primetime</em> ⚠️
</h5>
**SuperLaser** provides a comprehensive suite of tools and scripts designed for deploying LLMs onto [RunPod's](https://github.com/runpod) pod and serverless infrastructure. Additionally, the deployment utilizes a containerized [vLLM](https://github.com/vllm-project/vllm) engine during runtime, ensuring memory-efficient and high-performance inference capabilities.
# Features <img align="center" width="30" height="29" src="https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExOTBqaWNrcGxnaTdzMGRzNTN0bGI2d3A4YWkxajhsb2F5MW84Z2dxaCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/26tOZ42Mg6pbTUPHW/giphy.gif">
- **Scalable Deployment**: Easily scale your LLM inference tasks with vLLM and RunPod serverless capabilities.
- **Cost-Effective**: Optimize resource and hardware usage: tensor parallelism and other GPU assets.
- **Uses OpenAI's API**: Use the SuperLaser client for with chat, non-chat, and streaming options.
# Install <img align="center" width="30" height="29" src="https://media.giphy.com/media/sULKEgDMX8LcI/giphy.gif">
```bash
pip install superlaser
```
Before you begin, ensure you have:
- A RunPod account.
# RunPod Config <img align="right" width="75" height="75" src="./img/runpod-logo.png">
First step is to obtain an API key from RunPod. Go to your account's console, in the `Settings` section, click on `API Keys`.
After obtaining a key, set it as an environment variable:
```bash
export RUNPOD_API_KEY=<YOUR-API-KEY>
```
### Configure Template
Before spinning up a serverless endpoint, let's first configure a template that we'll pass into the endpoint during staging. The template allows you to select a serverless or pod asset, your docker image name, and the container's and volume's disk space.
Configure your template with the following attributes:
```py
import os
from superlaser import RunpodHandler as runpod
api_key = os.environ.get("RUNPOD_API_KEY")
template_data = runpod.set_template(
serverless="true",
template_name="superlaser-inf", # Give a name to your template
container_image="runpod/worker-vllm:0.3.1-cuda12.1.0", # Docker image stub
model_name="mistralai/Mistral-7B-v0.1", # Hugging Face model stub
max_model_length=340, # Maximum number of tokens for the engine to handle per request.
container_disk=15,
volume_disk=15,
)
```
### Create Template on RunPod
```py
template = runpod(api_key, data=template_data)
print(template().text)
```
### Configure Endpoint
After your template is created, it will return a data dicitionary that includes your template ID. We will pass this template id when configuring the serverless endpoint in the section below:
```py
endpoint_data = runpod.set_endpoint(
gpu_ids="AMPERE_24", # options for gpuIds are "AMPERE_16,AMPERE_24,AMPERE_48,AMPERE_80,ADA_24"
idle_timeout=5,
name="vllm_endpoint",
scaler_type="QUEUE_DELAY",
scaler_value=1,
template_id="template-id",
workers_max=1,
workers_min=0,
)
```
### Start Endpoint on RunPod
```py
endpoint = runpod(api_key, data=endpoint_data)
print(endpoint().text)
```
# Call Endpoint <img align="right" width="225" height="75" src="./img/vllm-logo.png">
After your endpoint is staged, it will return a dictionary with your endpoint ID. Pass this endpoint ID to the `OpenAI` client and start making API requests!
```py
from openai import OpenAI
endpoint_id = "you-endpoint-id"
client = OpenAI(
api_key=api_key,
base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1",
)
```
#### Chat w/ Streaming
```py
stream = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
messages=[{"role": "user", "content": "To be or not to be"}],
temperature=0,
max_tokens=100,
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
```
#### Completion w/ Streaming
```py
stream = client.completions.create(
model="meta-llama/Llama-2-7b-hf",
prompt="To be or not to be",
temperature=0,
max_tokens=100,
stream=True,
)
for response in stream:
print(response.choices[0].text or "", end="", flush=True)
```
Raw data
{
"_id": null,
"home_page": "",
"name": "superlaser",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "inference,server,LLM,NLP,MLOps,deployment,vllm,runpod,cicd",
"author": "InquestGeronimo",
"author_email": "rcostanl@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/f2/b7/1d55b8ec9d802199d8b25397002b60155e0b453cc64b8d3f5b8056ddcae5/superlaser-0.0.6.tar.gz",
"platform": null,
"description": "\n<div align=\"center\">\n <img width=\"400\" height=\"350\" src=\"/img/laser.webp\">\n</div>\n\n<h1 align=\"center\">\n <em>SuperLaser</em>\n</h1>\n\n<h5 align=\"center\">\n \u26a0\ufe0f<em>Not yet ready for primetime</em> \u26a0\ufe0f\n</h5>\n\n**SuperLaser** provides a comprehensive suite of tools and scripts designed for deploying LLMs onto [RunPod's](https://github.com/runpod) pod and serverless infrastructure. Additionally, the deployment utilizes a containerized [vLLM](https://github.com/vllm-project/vllm) engine during runtime, ensuring memory-efficient and high-performance inference capabilities.\n\n# Features <img align=\"center\" width=\"30\" height=\"29\" src=\"https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExOTBqaWNrcGxnaTdzMGRzNTN0bGI2d3A4YWkxajhsb2F5MW84Z2dxaCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/26tOZ42Mg6pbTUPHW/giphy.gif\">\n\n- **Scalable Deployment**: Easily scale your LLM inference tasks with vLLM and RunPod serverless capabilities.\n- **Cost-Effective**: Optimize resource and hardware usage: tensor parallelism and other GPU assets.\n- **Uses OpenAI's API**: Use the SuperLaser client for with chat, non-chat, and streaming options.\n\n# Install <img align=\"center\" width=\"30\" height=\"29\" src=\"https://media.giphy.com/media/sULKEgDMX8LcI/giphy.gif\">\n\n```bash\npip install superlaser\n```\nBefore you begin, ensure you have:\n\n- A RunPod account.\n# RunPod Config <img align=\"right\" width=\"75\" height=\"75\" src=\"./img/runpod-logo.png\">\n\nFirst step is to obtain an API key from RunPod. Go to your account's console, in the `Settings` section, click on `API Keys`.\n\nAfter obtaining a key, set it as an environment variable:\n\n```bash\nexport RUNPOD_API_KEY=<YOUR-API-KEY>\n```\n### Configure Template\n\nBefore spinning up a serverless endpoint, let's first configure a template that we'll pass into the endpoint during staging. The template allows you to select a serverless or pod asset, your docker image name, and the container's and volume's disk space.\n\nConfigure your template with the following attributes:\n\n```py\nimport os\nfrom superlaser import RunpodHandler as runpod\n\napi_key = os.environ.get(\"RUNPOD_API_KEY\")\n\ntemplate_data = runpod.set_template(\n serverless=\"true\", \n template_name=\"superlaser-inf\", # Give a name to your template\n container_image=\"runpod/worker-vllm:0.3.1-cuda12.1.0\", # Docker image stub\n model_name=\"mistralai/Mistral-7B-v0.1\", # Hugging Face model stub\n max_model_length=340, # Maximum number of tokens for the engine to handle per request.\n container_disk=15, \n volume_disk=15,\n)\n```\n### Create Template on RunPod\n\n```py\ntemplate = runpod(api_key, data=template_data)\nprint(template().text)\n```\n### Configure Endpoint\n\nAfter your template is created, it will return a data dicitionary that includes your template ID. We will pass this template id when configuring the serverless endpoint in the section below:\n\n```py\nendpoint_data = runpod.set_endpoint(\n gpu_ids=\"AMPERE_24\", # options for gpuIds are \"AMPERE_16,AMPERE_24,AMPERE_48,AMPERE_80,ADA_24\"\n idle_timeout=5,\n name=\"vllm_endpoint\",\n scaler_type=\"QUEUE_DELAY\",\n scaler_value=1,\n template_id=\"template-id\",\n workers_max=1,\n workers_min=0,\n)\n```\n\n### Start Endpoint on RunPod\n\n```py\nendpoint = runpod(api_key, data=endpoint_data)\nprint(endpoint().text)\n```\n\n# Call Endpoint <img align=\"right\" width=\"225\" height=\"75\" src=\"./img/vllm-logo.png\">\n\nAfter your endpoint is staged, it will return a dictionary with your endpoint ID. Pass this endpoint ID to the `OpenAI` client and start making API requests!\n\n```py\nfrom openai import OpenAI\n\nendpoint_id = \"you-endpoint-id\"\n\nclient = OpenAI(\n api_key=api_key,\n base_url=f\"https://api.runpod.ai/v2/{endpoint_id}/openai/v1\",\n)\n```\n\n#### Chat w/ Streaming\n\n```py\nstream = client.chat.completions.create(\n model=\"mistralai/Mistral-7B-Instruct-v0.1\",\n messages=[{\"role\": \"user\", \"content\": \"To be or not to be\"}],\n temperature=0,\n max_tokens=100,\n stream=True,\n)\n\nfor chunk in stream:\n print(chunk.choices[0].delta.content or \"\", end=\"\", flush=True)\n```\n\n#### Completion w/ Streaming\n\n```py\nstream = client.completions.create(\n model=\"meta-llama/Llama-2-7b-hf\",\n prompt=\"To be or not to be\",\n temperature=0,\n max_tokens=100,\n stream=True,\n)\n\nfor response in stream:\n print(response.choices[0].text or \"\", end=\"\", flush=True)\n```\n\n",
"bugtrack_url": null,
"license": "Apache-2.0 license",
"summary": "An MLOps library for LLM deployment w/ the vLLM engine on RunPod's infra.",
"version": "0.0.6",
"project_urls": null,
"split_keywords": [
"inference",
"server",
"llm",
"nlp",
"mlops",
"deployment",
"vllm",
"runpod",
"cicd"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fef22fed79e68cf12918f12ea761227bd1a4161a2c541a5d0795e01af0431b6b",
"md5": "12f30a32c161a40e4f7719d45b8297c7",
"sha256": "4defd824787c03c8c146b13736999a9ebadecca3b79c4e33411d2b7c4b014e0f"
},
"downloads": -1,
"filename": "superlaser-0.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "12f30a32c161a40e4f7719d45b8297c7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 12011,
"upload_time": "2024-03-02T20:54:33",
"upload_time_iso_8601": "2024-03-02T20:54:33.492188Z",
"url": "https://files.pythonhosted.org/packages/fe/f2/2fed79e68cf12918f12ea761227bd1a4161a2c541a5d0795e01af0431b6b/superlaser-0.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f2b71d55b8ec9d802199d8b25397002b60155e0b453cc64b8d3f5b8056ddcae5",
"md5": "6d0ce9414cacf5c9cdfa5bc6c76dcaca",
"sha256": "4812ab3cc423903021853ef1bd8373a54cc68078b4bb085d8bb29daed12ebd26"
},
"downloads": -1,
"filename": "superlaser-0.0.6.tar.gz",
"has_sig": false,
"md5_digest": "6d0ce9414cacf5c9cdfa5bc6c76dcaca",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 12431,
"upload_time": "2024-03-02T20:54:35",
"upload_time_iso_8601": "2024-03-02T20:54:35.315139Z",
"url": "https://files.pythonhosted.org/packages/f2/b7/1d55b8ec9d802199d8b25397002b60155e0b453cc64b8d3f5b8056ddcae5/superlaser-0.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-02 20:54:35",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "superlaser"
}