[](https://github.com/Blaizzy/mlx-vlm/actions/workflows/python-publish.yml)
# MLX-VLM
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.
## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [Command Line Interface (CLI)](#command-line-interface-cli)
- [Chat UI with Gradio](#chat-ui-with-gradio)
- [Python Script](#python-script)
- [Multi-Image Chat Support](#multi-image-chat-support)
- [Supported Models](#supported-models)
- [Usage Examples](#usage-examples)
- [Fine-tuning](#fine-tuning)
## Installation
The easiest way to get started is to install the `mlx-vlm` package using pip:
```sh
pip install -U mlx-vlm
```
## Usage
### Command Line Interface (CLI)
Generate output from a model using the CLI:
```sh
# Image generation
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg
# Audio generation (New)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you hear" --audio /path/to/audio.wav
# Multi-modal generation (Image + Audio)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you see and hear" --image /path/to/image.jpg --audio /path/to/audio.wav
```
### Chat UI with Gradio
Launch a chat interface using Gradio:
```sh
mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit
```
### Python Script
Here's an example of how to use MLX-VLM in a Python script:
```python
import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] can also be used with PIL.Image.Image objects
prompt = "Describe this image."
# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(image)
)
# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)
```
#### Audio Example
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load model with audio support
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config
# Prepare audio input
audio = ["/path/to/audio1.wav", "/path/to/audio2.mp3"]
prompt = "Describe what you hear in these audio files."
# Apply chat template with audio
formatted_prompt = apply_chat_template(
processor, config, prompt, num_audios=len(audio)
)
# Generate output with audio
output = generate(model, processor, formatted_prompt, audio=audio, verbose=False)
print(output)
```
#### Multi-Modal Example (Image + Audio)
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load multi-modal model
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config
# Prepare inputs
image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = ""
# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt,
num_images=len(image),
num_audios=len(audio)
)
# Generate output
output = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)
print(output)
```
### Server (FastAPI)
Start the server:
```sh
mlx_vlm.server
```
The server provides multiple endpoints for different use cases and supports dynamic model loading/unloading with caching (one model at a time).
#### Available Endpoints
- `/generate` - Main generation endpoint with support for images, audio, and text
- `/chat` - Chat-style interaction endpoint
- `/responses` - OpenAI-compatible endpoint
- `/health` - Check server status
- `/unload` - Unload current model from memory
#### Usage Examples
##### Basic Image Generation
```sh
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2.5-VL-32B-Instruct-8bit",
"image": ["/path/to/repo/examples/images/renewables_california.png"],
"prompt": "This is today'\''s chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?",
"system": "You are a helpful assistant.",
"stream": true,
"max_tokens": 1000
}'
```
##### Audio Support (New)
```sh
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gemma-3n-E2B-it-4bit",
"audio": ["/path/to/audio1.wav", "https://example.com/audio2.mp3"],
"prompt": "Describe what you hear in these audio files",
"stream": true,
"max_tokens": 500
}'
```
##### Multi-Modal (Image + Audio)
```sh
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gemma-3n-E2B-it-4bit",
"image": ["/path/to/image.jpg"],
"audio": ["/path/to/audio.wav"],
"prompt": "",
"max_tokens": 1000
}'
```
##### Chat Endpoint
```sh
curl -X POST "http://localhost:8000/chat" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
"messages": [
{
"role": "user",
"content": "What is in this image?",
"images": ["/path/to/image.jpg"]
}
],
"max_tokens": 100
}'
```
##### OpenAI-Compatible Endpoint
```sh
curl -X POST "http://localhost:8000/responses" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
"messages": [
{
"role": "user",
"content": [
{"type": "input_text", "text": "What is in this image?"},
{"type": "input_image", "image": "/path/to/image.jpg"}
]
}
],
"max_tokens": 100
}'
```
#### Request Parameters
- `model`: Model identifier (required)
- `prompt`: Text prompt for generation
- `image`: List of image URLs or local paths (optional)
- `audio`: List of audio URLs or local paths (optional, new)
- `system`: System prompt (optional)
- `messages`: Chat messages for chat/OpenAI endpoints
- `max_tokens`: Maximum tokens to generate
- `temperature`: Sampling temperature
- `top_p`: Top-p sampling parameter
- `stream`: Enable streaming responses
## Multi-Image Chat Support
MLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.
### Usage Examples
#### Python Script
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = model.config
images = ["path/to/image1.jpg", "path/to/image2.jpg"]
prompt = "Compare these two images."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(images)
)
output = generate(model, processor, formatted_prompt, images, verbose=False)
print(output)
```
#### Command Line
```sh
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Compare these images" --image path/to/image1.jpg path/to/image2.jpg
```
## Video Understanding
MLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.
### Supported Models
The following models support video chat:
1. Qwen2-VL
2. Qwen2.5-VL
3. Idefics3
4. LLaVA
With more coming soon.
### Usage Examples
#### Command Line
```sh
mlx_vlm.video_generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Describe this video" --video path/to/video.mp4 --max-pixels 224 224 --fps 1.0
```
These examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.
# Fine-tuning
MLX-VLM supports fine-tuning models with LoRA and QLoRA.
## LoRA & QLoRA
To learn more about LoRA, please refer to the [LoRA.md](./mlx_vlm/LORA.MD) file.
Raw data
{
"_id": null,
"home_page": "https://github.com/Blaizzy/mlx-vlm",
"name": "mlx-vlm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Prince Canuma",
"author_email": "prince.gdt@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/de/c2/f8a664ba84159bdf4ee89511ffe603e3a98fd2e50fabb3b4d01829246793/mlx_vlm-0.3.1.tar.gz",
"platform": null,
"description": "[](https://github.com/Blaizzy/mlx-vlm/actions/workflows/python-publish.yml)\n# MLX-VLM\n\nMLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.\n\n## Table of Contents\n- [Installation](#installation)\n- [Usage](#usage)\n - [Command Line Interface (CLI)](#command-line-interface-cli)\n - [Chat UI with Gradio](#chat-ui-with-gradio)\n - [Python Script](#python-script)\n- [Multi-Image Chat Support](#multi-image-chat-support)\n - [Supported Models](#supported-models)\n - [Usage Examples](#usage-examples)\n- [Fine-tuning](#fine-tuning)\n\n## Installation\n\nThe easiest way to get started is to install the `mlx-vlm` package using pip:\n\n```sh\npip install -U mlx-vlm\n```\n\n## Usage\n\n### Command Line Interface (CLI)\n\nGenerate output from a model using the CLI:\n\n```sh\n# Image generation\nmlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg\n\n# Audio generation (New)\nmlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"Describe what you hear\" --audio /path/to/audio.wav\n\n# Multi-modal generation (Image + Audio)\nmlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"Describe what you see and hear\" --image /path/to/image.jpg --audio /path/to/audio.wav\n```\n\n### Chat UI with Gradio\n\nLaunch a chat interface using Gradio:\n\n```sh\nmlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit\n```\n\n### Python Script\n\nHere's an example of how to use MLX-VLM in a Python script:\n\n```python\nimport mlx.core as mx\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load the model\nmodel_path = \"mlx-community/Qwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = load_config(model_path)\n\n# Prepare input\nimage = [\"http://images.cocodataset.org/val2017/000000039769.jpg\"]\n# image = [Image.open(\"...\")] can also be used with PIL.Image.Image objects\nprompt = \"Describe this image.\"\n\n# Apply chat template\nformatted_prompt = apply_chat_template(\n processor, config, prompt, num_images=len(image)\n)\n\n# Generate output\noutput = generate(model, processor, formatted_prompt, image, verbose=False)\nprint(output)\n```\n\n#### Audio Example\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load model with audio support\nmodel_path = \"mlx-community/gemma-3n-E2B-it-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\n# Prepare audio input\naudio = [\"/path/to/audio1.wav\", \"/path/to/audio2.mp3\"]\nprompt = \"Describe what you hear in these audio files.\"\n\n# Apply chat template with audio\nformatted_prompt = apply_chat_template(\n processor, config, prompt, num_audios=len(audio)\n)\n\n# Generate output with audio\noutput = generate(model, processor, formatted_prompt, audio=audio, verbose=False)\nprint(output)\n```\n\n#### Multi-Modal Example (Image + Audio)\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load multi-modal model\nmodel_path = \"mlx-community/gemma-3n-E2B-it-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\n# Prepare inputs\nimage = [\"/path/to/image.jpg\"]\naudio = [\"/path/to/audio.wav\"]\nprompt = \"\"\n\n# Apply chat template\nformatted_prompt = apply_chat_template(\n processor, config, prompt,\n num_images=len(image),\n num_audios=len(audio)\n)\n\n# Generate output\noutput = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)\nprint(output)\n```\n\n### Server (FastAPI)\n\nStart the server:\n```sh\nmlx_vlm.server\n```\n\nThe server provides multiple endpoints for different use cases and supports dynamic model loading/unloading with caching (one model at a time).\n\n#### Available Endpoints\n\n- `/generate` - Main generation endpoint with support for images, audio, and text\n- `/chat` - Chat-style interaction endpoint\n- `/responses` - OpenAI-compatible endpoint\n- `/health` - Check server status\n- `/unload` - Unload current model from memory\n\n#### Usage Examples\n\n##### Basic Image Generation\n```sh\ncurl -X POST \"http://localhost:8000/generate\" \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"model\": \"mlx-community/Qwen2.5-VL-32B-Instruct-8bit\",\n \"image\": [\"/path/to/repo/examples/images/renewables_california.png\"],\n \"prompt\": \"This is today'\\''s chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?\",\n \"system\": \"You are a helpful assistant.\",\n \"stream\": true,\n \"max_tokens\": 1000\n }'\n```\n\n##### Audio Support (New)\n```sh\ncurl -X POST \"http://localhost:8000/generate\" \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"model\": \"mlx-community/gemma-3n-E2B-it-4bit\",\n \"audio\": [\"/path/to/audio1.wav\", \"https://example.com/audio2.mp3\"],\n \"prompt\": \"Describe what you hear in these audio files\",\n \"stream\": true,\n \"max_tokens\": 500\n }'\n```\n\n##### Multi-Modal (Image + Audio)\n```sh\ncurl -X POST \"http://localhost:8000/generate\" \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"model\": \"mlx-community/gemma-3n-E2B-it-4bit\",\n \"image\": [\"/path/to/image.jpg\"],\n \"audio\": [\"/path/to/audio.wav\"],\n \"prompt\": \"\",\n \"max_tokens\": 1000\n }'\n```\n\n##### Chat Endpoint\n```sh\ncurl -X POST \"http://localhost:8000/chat\" \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"model\": \"mlx-community/Qwen2-VL-2B-Instruct-4bit\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"What is in this image?\",\n \"images\": [\"/path/to/image.jpg\"]\n }\n ],\n \"max_tokens\": 100\n }'\n```\n\n##### OpenAI-Compatible Endpoint\n```sh\ncurl -X POST \"http://localhost:8000/responses\" \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"model\": \"mlx-community/Qwen2-VL-2B-Instruct-4bit\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": [\n {\"type\": \"input_text\", \"text\": \"What is in this image?\"},\n {\"type\": \"input_image\", \"image\": \"/path/to/image.jpg\"}\n ]\n }\n ],\n \"max_tokens\": 100\n }'\n```\n\n#### Request Parameters\n\n- `model`: Model identifier (required)\n- `prompt`: Text prompt for generation\n- `image`: List of image URLs or local paths (optional)\n- `audio`: List of audio URLs or local paths (optional, new)\n- `system`: System prompt (optional)\n- `messages`: Chat messages for chat/OpenAI endpoints\n- `max_tokens`: Maximum tokens to generate\n- `temperature`: Sampling temperature\n- `top_p`: Top-p sampling parameter\n- `stream`: Enable streaming responses\n\n\n## Multi-Image Chat Support\n\nMLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.\n\n\n### Usage Examples\n\n#### Python Script\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\nmodel_path = \"mlx-community/Qwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\nimages = [\"path/to/image1.jpg\", \"path/to/image2.jpg\"]\nprompt = \"Compare these two images.\"\n\nformatted_prompt = apply_chat_template(\n processor, config, prompt, num_images=len(images)\n)\n\noutput = generate(model, processor, formatted_prompt, images, verbose=False)\nprint(output)\n```\n\n#### Command Line\n\n```sh\nmlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Compare these images\" --image path/to/image1.jpg path/to/image2.jpg\n```\n\n## Video Understanding\n\nMLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.\n\n### Supported Models\n\nThe following models support video chat:\n\n1. Qwen2-VL\n2. Qwen2.5-VL\n3. Idefics3\n4. LLaVA\n\nWith more coming soon.\n\n### Usage Examples\n\n#### Command Line\n```sh\nmlx_vlm.video_generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Describe this video\" --video path/to/video.mp4 --max-pixels 224 224 --fps 1.0\n```\n\n\nThese examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.\n\n# Fine-tuning\n\nMLX-VLM supports fine-tuning models with LoRA and QLoRA.\n\n## LoRA & QLoRA\n\nTo learn more about LoRA, please refer to the [LoRA.md](./mlx_vlm/LORA.MD) file.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Vision Language Models (VLMs) and Omni Models (Vision, Audio and Video support) on Apple silicon with MLX and the Hugging Face Hub",
"version": "0.3.1",
"project_urls": {
"Homepage": "https://github.com/Blaizzy/mlx-vlm"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ceff7a24cb5a70482113f752ffa89b942d1c7e3710c99aa7eeebd6d4d0d34f7a",
"md5": "56d08ade8746ca0897d26df25875cc69",
"sha256": "9de5063149192f4801c2349d63a2949090d989d79aa72821f42afaae9d775f07"
},
"downloads": -1,
"filename": "mlx_vlm-0.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "56d08ade8746ca0897d26df25875cc69",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 282925,
"upload_time": "2025-07-12T16:21:30",
"upload_time_iso_8601": "2025-07-12T16:21:30.101358Z",
"url": "https://files.pythonhosted.org/packages/ce/ff/7a24cb5a70482113f752ffa89b942d1c7e3710c99aa7eeebd6d4d0d34f7a/mlx_vlm-0.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "dec2f8a664ba84159bdf4ee89511ffe603e3a98fd2e50fabb3b4d01829246793",
"md5": "d6d72047e12fcedfeec57e71f401726a",
"sha256": "10044d5d3ab9bbb0bf0f4cd836a39fd8751bac44452a2a6735dc98925fd228fb"
},
"downloads": -1,
"filename": "mlx_vlm-0.3.1.tar.gz",
"has_sig": false,
"md5_digest": "d6d72047e12fcedfeec57e71f401726a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 227297,
"upload_time": "2025-07-12T16:21:31",
"upload_time_iso_8601": "2025-07-12T16:21:31.720222Z",
"url": "https://files.pythonhosted.org/packages/de/c2/f8a664ba84159bdf4ee89511ffe603e3a98fd2e50fabb3b4d01829246793/mlx_vlm-0.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-12 16:21:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Blaizzy",
"github_project": "mlx-vlm",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "mlx",
"specs": [
[
">=",
"0.26.0"
]
]
},
{
"name": "datasets",
"specs": [
[
">=",
"2.19.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.66.2"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.53.0"
]
]
},
{
"name": "gradio",
"specs": [
[
">=",
"5.19.0"
]
]
},
{
"name": "Pillow",
"specs": [
[
">=",
"10.3.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.31.0"
]
]
},
{
"name": "opencv-python",
"specs": [
[
"==",
"4.10.0.84"
]
]
},
{
"name": "mlx-lm",
"specs": [
[
">=",
"0.23.0"
]
]
},
{
"name": "fastapi",
"specs": [
[
">=",
"0.95.1"
]
]
},
{
"name": "soundfile",
"specs": [
[
">=",
"0.13.1"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.15.3"
]
]
},
{
"name": "numpy",
"specs": []
},
{
"name": "mlx-audio",
"specs": []
}
],
"lcname": "mlx-vlm"
}