ovllm

Name	ovllm JSON
Version	0.3.0 JSON
	download
home_page	None
Summary	One-line vLLM wrapper with gorgeous DSPy integration
upload_time	2025-08-04 22:42:36
maintainer	None
docs_url	None
author	None
requires_python	<3.13,>=3.10
license	MIT
keywords	llm vllm dspy ai machine-learning deep-learning language-models
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # OvLLM 🚀

> This is the very first 'working' version. Expect things to change and improve a lot. Still, what is documented here works on my machine.

**One-line vLLM for everyone**

OvLLM is a Python library that makes running local LLMs as easy as a function call, while leveraging the incredible performance of [vLLM](https://github.com/vllm-project/vllm). It's designed for simplicity without sacrificing power, featuring native [DSPy](https://github.com/stanfordnlp/dspy) integration with proper output formatting and automatic request batching for maximum GPU efficiency.

At its core, `ovllm` ensures you can get started instantly and build complex pipelines without worrying about the underlying engine's state management, memory, or batching logic.

-----

## ✨ Features

  - **Zero-Config Startup**: Works out of the box with a sensible default model that runs on most systems.
  - **One-Line Model Swapping**: Hot-swap models on the GPU with a single command: `llmtogpu("any-huggingface-model")`.
  - **Automatic Request Batching**: Transparently groups concurrent requests for optimal GPU throughput.
  - **Native DSPy Compatibility**: The `llm` object is a first-class `dspy.LM`, ready for any DSPy module or optimizer.
  - **Smart Memory Management**: Automatically unloads old models and clears GPU memory when you switch.
  - **Helpful Errors & Helpers**: Get clear, actionable error messages and use helpers like `suggest_models()` to find the right model for your hardware.
  - **Rich Documentation**: Comprehensive help is built-in via Python's `help()` function.

-----

## 📦 Installation

**Prerequisite**: OVLLM uses vLLM, which requires an NVIDIA GPU with CUDA 12.1 or newer. Please ensure your environment is set up correctly.

### From PyPI (Recommended)

```bash
pip install ovllm
```

### From Source

```bash
git clone https://github.com/maximerivest/ovllm
cd ovllm
pip install -e .
```

-----

## 🎯 Quick Start

### Basic Usage

Just import the `llm` object and call it. The first time you do, a small, capable default model will be downloaded and loaded into your GPU.

```python
import ovllm
# The first call loads the default model (Qwen/Qwen3-0.6B). Please wait a moment.
response = ovllm.llm("What is the capital of Canada?")

```

> **Note:** For full DSPy compatibility, `llm()` returns a list containing a completion object. That's why we access `response[0]` to get the first result.

### Switching Models

Easily switch to any model on the Hugging Face Hub. `ovllm` handles the cleanup automatically.

```python
from ovllm import llmtogpu, suggest_models, llm

# See what models your GPU can handle
suggest_models() # good suggestions to come! wip :)

# Load a different model
llmtogpu("google/gemma-3n-E4B-it", vllm_args={"tensor_parallel_size": 1, "gpu_memory_utilization": 0.80}) 

# Now all calls use the new model
response = llm("Explain quantum computing in simple terms")
print(response[0])
```

-----

## 🤖 DSPy Integration

OVLLM is designed to be a perfect companion for DSPy. Just configure it once.

### Simple Prediction

```python
import dspy
import ovllm

# Configure DSPy to use your local OVLLM instance
dspy.configure(lm=ovllm.llm)

# Create a simple predictor
predict = dspy.Predict("question -> answer")

# Run the predictor
result = predict(question="What is the powerhouse of the cell?")
print(result.answer)
```

### Chain of Thought (CoT) Reasoning

```python
import dspy
import ovllm

dspy.configure(lm=ovllm.llm)

# Use ChainOfThought to encourage step-by-step reasoning
cot_predictor = dspy.ChainOfThought("question -> answer")

result = cot_predictor(question="If I have 5 apples and I eat 2, then buy 3 more, how many apples do I have left?")
print(f"Answer: {result.answer}")
# The model's reasoning is also available!
print(f"\nReasoning:\n{result.reasoning=}")
```

### Automatic Batching

When you use DSPy features that make multiple calls to the LM (like `predict.batch` or optimizers), OVLLM's `AutoBatchLM` layer automatically catches these concurrent requests and sends them to the GPU in a single, efficient batch. You don't have to do anything extra to get this performance boost.

```python
import dspy
import ovllm

dspy.configure(lm=ovllm.llm)

questions = [
    "What color is the sky on a clear day?",
    "What is 2+2?",
    "What is the main component of air?",
]

examples = [dspy.Example(question=q).with_inputs("question") for q in questions]

predict = dspy.Predict("question -> answer")

# This automatically runs as a single efficient batch on the GPU!
results = predict.batch(examples)

for ex, res in zip(examples, results):
    print(f"Q: {ex.question}")
    print(f"A: {res.answer}\n")
```

-----

## 🛠️ Advanced Usage

### Custom Parameters

Pass any vLLM-supported parameters directly to `llmtogpu` to customize model loading and generation.

```python
from ovllm import llmtogpu

llmtogpu(
    "microsoft/phi-2",
    temperature=0.0,                  # For deterministic outputs
    max_tokens=2048,                  # Allow longer responses
    gpu_memory_utilization=0.9,       # Use 90% of GPU VRAM
    dtype="float16"                   # Use specific precision
)
```

### Error Handling

OVLLM provides clear, actionable error messages if something goes wrong.

**Model too large for VRAM:**

```
❌ ERROR: Not enough GPU memory to load 'meta-llama/Llama-2-70b-hf'.
   Try lowering the `gpu_memory_utilization` (e.g., `llmtogpu(..., gpu_memory_utilization=0.8)`) or use a smaller model.
```

**Gated Hugging Face Model:**

```
❌ ERROR: The HuggingFace repository for 'meta-llama/Llama-3-8B-Instruct' is gated.
   1. Visit https://huggingface.co/settings/tokens to create a token.
   2. Run `huggingface-cli login` in your terminal and paste the token.
```

-----

## ⚙️ How It Works

OVLLM's simplicity is enabled by a robust underlying architecture designed to solve common challenges with local LLMs.

1.  **The Proxy Object (`llm`)**: When you use `ovllm.llm`, you're interacting with a lightweight proxy object. This object doesn't contain the massive model itself, so it can be safely copied by DSPy without duplicating the engine.
2.  **The Singleton Manager**: The proxy communicates with a single, global instance manager. On the first call, this manager loads the vLLM engine into the GPU.
3.  **The Auto-Batching Queue (`AutoBatchLM`)**: All requests from the proxy are sent to an intelligent queue. This queue collects concurrent requests and groups them into an optimal batch before sending them to the vLLM engine, maximizing GPU throughput.
4.  **Automatic Cleanup**: When you call `llmtogpu()`, the manager gracefully shuts down the old engine and its batching queue, clears the GPU memory, and then loads the new model.

This architecture gives you the best of both worlds: a dead-simple, stateless-feeling API and a high-performance, statefully-managed backend.

-----

## 🤝 Contributing

Contributions are welcome\! If you find a bug or have a feature request, please feel free to open an issue or submit a pull request.

## 📄 License

This project is licensed under the MIT License - see the `LICENSE` file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ovllm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.10",
    "maintainer_email": null,
    "keywords": "llm, vllm, dspy, ai, machine-learning, deep-learning, language-models",
    "author": null,
    "author_email": "Maxime Rivest <mrive052@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/24/71/bbb545ec863d29413b2a766442b68ccce47fc18f67be259efedf50ed17f0/ovllm-0.3.0.tar.gz",
    "platform": null,
    "description": "# OvLLM \ud83d\ude80\n\n> This is the very first 'working' version. Expect things to change and improve a lot. Still, what is documented here works on my machine.\n\n**One-line vLLM for everyone**\n\nOvLLM is a Python library that makes running local LLMs as easy as a function call, while leveraging the incredible performance of [vLLM](https://github.com/vllm-project/vllm). It's designed for simplicity without sacrificing power, featuring native [DSPy](https://github.com/stanfordnlp/dspy) integration with proper output formatting and automatic request batching for maximum GPU efficiency.\n\nAt its core, `ovllm` ensures you can get started instantly and build complex pipelines without worrying about the underlying engine's state management, memory, or batching logic.\n\n-----\n\n## \u2728 Features\n\n  - **Zero-Config Startup**: Works out of the box with a sensible default model that runs on most systems.\n  - **One-Line Model Swapping**: Hot-swap models on the GPU with a single command: `llmtogpu(\"any-huggingface-model\")`.\n  - **Automatic Request Batching**: Transparently groups concurrent requests for optimal GPU throughput.\n  - **Native DSPy Compatibility**: The `llm` object is a first-class `dspy.LM`, ready for any DSPy module or optimizer.\n  - **Smart Memory Management**: Automatically unloads old models and clears GPU memory when you switch.\n  - **Helpful Errors & Helpers**: Get clear, actionable error messages and use helpers like `suggest_models()` to find the right model for your hardware.\n  - **Rich Documentation**: Comprehensive help is built-in via Python's `help()` function.\n\n-----\n\n## \ud83d\udce6 Installation\n\n**Prerequisite**: OVLLM uses vLLM, which requires an NVIDIA GPU with CUDA 12.1 or newer. Please ensure your environment is set up correctly.\n\n### From PyPI (Recommended)\n\n```bash\npip install ovllm\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/maximerivest/ovllm\ncd ovllm\npip install -e .\n```\n\n-----\n\n## \ud83c\udfaf Quick Start\n\n### Basic Usage\n\nJust import the `llm` object and call it. The first time you do, a small, capable default model will be downloaded and loaded into your GPU.\n\n```python\nimport ovllm\n# The first call loads the default model (Qwen/Qwen3-0.6B). Please wait a moment.\nresponse = ovllm.llm(\"What is the capital of Canada?\")\n\n```\n\n> **Note:** For full DSPy compatibility, `llm()` returns a list containing a completion object. That's why we access `response[0]` to get the first result.\n\n### Switching Models\n\nEasily switch to any model on the Hugging Face Hub. `ovllm` handles the cleanup automatically.\n\n```python\nfrom ovllm import llmtogpu, suggest_models, llm\n\n# See what models your GPU can handle\nsuggest_models() # good suggestions to come! wip :)\n\n# Load a different model\nllmtogpu(\"google/gemma-3n-E4B-it\", vllm_args={\"tensor_parallel_size\": 1, \"gpu_memory_utilization\": 0.80}) \n\n# Now all calls use the new model\nresponse = llm(\"Explain quantum computing in simple terms\")\nprint(response[0])\n```\n\n-----\n\n## \ud83e\udd16 DSPy Integration\n\nOVLLM is designed to be a perfect companion for DSPy. Just configure it once.\n\n### Simple Prediction\n\n```python\nimport dspy\nimport ovllm\n\n# Configure DSPy to use your local OVLLM instance\ndspy.configure(lm=ovllm.llm)\n\n# Create a simple predictor\npredict = dspy.Predict(\"question -> answer\")\n\n# Run the predictor\nresult = predict(question=\"What is the powerhouse of the cell?\")\nprint(result.answer)\n```\n\n### Chain of Thought (CoT) Reasoning\n\n```python\nimport dspy\nimport ovllm\n\ndspy.configure(lm=ovllm.llm)\n\n# Use ChainOfThought to encourage step-by-step reasoning\ncot_predictor = dspy.ChainOfThought(\"question -> answer\")\n\nresult = cot_predictor(question=\"If I have 5 apples and I eat 2, then buy 3 more, how many apples do I have left?\")\nprint(f\"Answer: {result.answer}\")\n# The model's reasoning is also available!\nprint(f\"\\nReasoning:\\n{result.reasoning=}\")\n```\n\n### Automatic Batching\n\nWhen you use DSPy features that make multiple calls to the LM (like `predict.batch` or optimizers), OVLLM's `AutoBatchLM` layer automatically catches these concurrent requests and sends them to the GPU in a single, efficient batch. You don't have to do anything extra to get this performance boost.\n\n```python\nimport dspy\nimport ovllm\n\ndspy.configure(lm=ovllm.llm)\n\nquestions = [\n    \"What color is the sky on a clear day?\",\n    \"What is 2+2?\",\n    \"What is the main component of air?\",\n]\n\nexamples = [dspy.Example(question=q).with_inputs(\"question\") for q in questions]\n\npredict = dspy.Predict(\"question -> answer\")\n\n# This automatically runs as a single efficient batch on the GPU!\nresults = predict.batch(examples)\n\nfor ex, res in zip(examples, results):\n    print(f\"Q: {ex.question}\")\n    print(f\"A: {res.answer}\\n\")\n```\n\n-----\n\n## \ud83d\udee0\ufe0f Advanced Usage\n\n### Custom Parameters\n\nPass any vLLM-supported parameters directly to `llmtogpu` to customize model loading and generation.\n\n```python\nfrom ovllm import llmtogpu\n\nllmtogpu(\n    \"microsoft/phi-2\",\n    temperature=0.0,                  # For deterministic outputs\n    max_tokens=2048,                  # Allow longer responses\n    gpu_memory_utilization=0.9,       # Use 90% of GPU VRAM\n    dtype=\"float16\"                   # Use specific precision\n)\n```\n\n### Error Handling\n\nOVLLM provides clear, actionable error messages if something goes wrong.\n\n**Model too large for VRAM:**\n\n```\n\u274c ERROR: Not enough GPU memory to load 'meta-llama/Llama-2-70b-hf'.\n   Try lowering the `gpu_memory_utilization` (e.g., `llmtogpu(..., gpu_memory_utilization=0.8)`) or use a smaller model.\n```\n\n**Gated Hugging Face Model:**\n\n```\n\u274c ERROR: The HuggingFace repository for 'meta-llama/Llama-3-8B-Instruct' is gated.\n   1. Visit https://huggingface.co/settings/tokens to create a token.\n   2. Run `huggingface-cli login` in your terminal and paste the token.\n```\n\n-----\n\n## \u2699\ufe0f How It Works\n\nOVLLM's simplicity is enabled by a robust underlying architecture designed to solve common challenges with local LLMs.\n\n1.  **The Proxy Object (`llm`)**: When you use `ovllm.llm`, you're interacting with a lightweight proxy object. This object doesn't contain the massive model itself, so it can be safely copied by DSPy without duplicating the engine.\n2.  **The Singleton Manager**: The proxy communicates with a single, global instance manager. On the first call, this manager loads the vLLM engine into the GPU.\n3.  **The Auto-Batching Queue (`AutoBatchLM`)**: All requests from the proxy are sent to an intelligent queue. This queue collects concurrent requests and groups them into an optimal batch before sending them to the vLLM engine, maximizing GPU throughput.\n4.  **Automatic Cleanup**: When you call `llmtogpu()`, the manager gracefully shuts down the old engine and its batching queue, clears the GPU memory, and then loads the new model.\n\nThis architecture gives you the best of both worlds: a dead-simple, stateless-feeling API and a high-performance, statefully-managed backend.\n\n-----\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome\\! If you find a bug or have a feature request, please feel free to open an issue or submit a pull request.\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the `LICENSE` file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "One-line vLLM wrapper with gorgeous DSPy integration",
    "version": "0.3.0",
    "project_urls": {
        "Documentation": "https://ovllm.readthedocs.io",
        "Homepage": "https://github.com/maximerivest/ovllm",
        "Issues": "https://github.com/maximerivest/ovllm/issues",
        "Repository": "https://github.com/maximerivest/ovllm"
    },
    "split_keywords": [
        "llm",
        " vllm",
        " dspy",
        " ai",
        " machine-learning",
        " deep-learning",
        " language-models"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "938a1f568451ae942e98058dc57bd2881423fecf758c4541b223508c66a71c18",
                "md5": "aeb1f8737faeba1ffed905814cffa237",
                "sha256": "b24aa84ad60ce5e27b0f3834ab7b02cfafc77296d283adb29d2cd5e784e035f0"
            },
            "downloads": -1,
            "filename": "ovllm-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "aeb1f8737faeba1ffed905814cffa237",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.10",
            "size": 13111,
            "upload_time": "2025-08-04T22:42:34",
            "upload_time_iso_8601": "2025-08-04T22:42:34.887200Z",
            "url": "https://files.pythonhosted.org/packages/93/8a/1f568451ae942e98058dc57bd2881423fecf758c4541b223508c66a71c18/ovllm-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2471bbb545ec863d29413b2a766442b68ccce47fc18f67be259efedf50ed17f0",
                "md5": "b29253803b4dfe9f933e52dece01dc73",
                "sha256": "a51e794ad9d8df5333ddc6c553bad6803b91648f3dbe12ce990d23a47d5803b6"
            },
            "downloads": -1,
            "filename": "ovllm-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b29253803b4dfe9f933e52dece01dc73",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.10",
            "size": 18142,
            "upload_time": "2025-08-04T22:42:36",
            "upload_time_iso_8601": "2025-08-04T22:42:36.319460Z",
            "url": "https://files.pythonhosted.org/packages/24/71/bbb545ec863d29413b2a766442b68ccce47fc18f67be259efedf50ed17f0/ovllm-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-04 22:42:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "maximerivest",
    "github_project": "ovllm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "ovllm"
}

None