llmq

Name	llmq JSON
Version	0.0.4 JSON
	download
home_page	None
Summary	High-Performance vLLM Job Queue Package
upload_time	2025-08-11 08:21:32
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	MIT
keywords	llm queue vllm gpu inference rabbitmq async
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # llmq

[![PyPI version](https://badge.fury.io/py/llmq.svg)](https://pypi.org/project/llmq/)
[![CI](https://github.com/ipieter/llmq/workflows/CI/badge.svg)](https://github.com/ipieter/llmq/actions/workflows/ci.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**A Scheduler for Batched LLM Inference** - Like OpenAI's Batch API, but for self-hosted open-source models. Submit millions of inference jobs, let workers process them with vLLM-backed inference, and stream results back to a single file. Ideal for synthetic data generation, translation pipelines, and batch inference workloads.

> **Note**: API may change until v1.0 as I'm actively developing new features.

<details>
<summary><strong>📋 Table of Contents</strong></summary>

- [Features](#features)
- [Quick Start](#quick-start)
  - [Installation](#installation)
  - [Start RabbitMQ](#start-rabbitmq)
  - [Run Your First Job](#run-your-first-job)
- [How It Works](#how-it-works)
- [Use Cases](#use-cases)
  - [Translation Pipeline](#translation-pipeline)
  - [Data Cleaning at Scale](#data-cleaning-at-scale)
  - [RL Training Rollouts](#rl-training-rollouts)
- [Real-World Usage](#real-world-usage)
- [Worker Types](#worker-types)
- [Core Commands](#core-commands)
  - [Job Management](#job-management)
  - [Worker Management](#worker-management)
  - [Monitoring](#monitoring)
- [Configuration](#configuration)
- [Job Formats](#job-formats)
- [Architecture](#architecture)
- [Performance Tips](#performance-tips)
- [Testing](#testing)
- [Links](#links)
- [Advanced Setup](#advanced-setup)
  - [Docker Compose Setup](#docker-compose-setup)
  - [Singularity Setup](#singularity-setup)
  - [Performance Tuning](#performance-tuning)
  - [Multi-GPU Setup](#multi-gpu-setup)
  - [Troubleshooting](#troubleshooting)
- [Acknowledgments](#acknowledgments)

</details>

## Features

- **High-Performance**: GPU-accelerated inference with vLLM batching
- **Scalable**: RabbitMQ-based distributed queuing, so never let your GPUs idle  
- **Simple**: Unix-friendly CLI with piped input/output
- **Flexible**: Supports many standard LLM operations for synthetic data generation. You can combine different models and process Huggingface datasets directly

**Not for real-time use**: llmq is designed for (laaarge) batch processing, not chat applications or real-time inference. It doesn't support token streaming or optimized time-to-first-token (TTFT).

## Quick Start

### Installation

```bash
pip install llmq
```

### Start RabbitMQ

```bash
docker run -d --name rabbitmq \
  -p 5672:5672 -p 15672:15672 \
  -e RABBITMQ_DEFAULT_USER=llmq \
  -e RABBITMQ_DEFAULT_PASS=llmq123 \
  rabbitmq:3-management
```

### Run Your First Job

```bash
# Start a worker
llmq worker run Unbabel/Tower-Plus-9B translation-queue

# Submit jobs (in another terminal)
echo '{"id": "hello", "prompt": "Translate to Spanish: Hello world"}' | llmq submit translation-queue -

# Results stream back immediately
{"id": "hello", "result": "Hola mundo", "worker_id": "worker-gpu0", "duration_ms": 45.2}
```

## How It Works

Similar to OpenAI's Batch API, llmq separates job submission from processing:

1. **Submit jobs** - Upload thousands of inference requests to a queue
2. **Workers process** - GPU-accelerated workers pull jobs and generate responses  
3. **Stream results** - Get real-time results as jobs complete, with automatic timeout handling

## Use Cases

### Translation Pipeline

Process translation jobs with specialized multilingual models:

```bash
# Start translation worker
llmq worker run Unbabel/Tower-Plus-9B translation-queue

# Example jobs file (jobs.jsonl)
{"id": "job1", "messages": [{"role": "user", "content": "Translate to Spanish: {text}"}], "text": "Hello world"}
{"id": "job2", "messages": [{"role": "user", "content": "Translate to French: {text}"}], "text": "Good morning"}

# Process jobs
llmq submit translation-queue jobs.jsonl > results.jsonl
```

### Data Cleaning at Scale

Clean and process large datasets with custom prompts:

```bash
# Start worker for data cleaning
llmq worker run meta-llama/Llama-3.2-3B-Instruct cleaning-queue

# Submit HuggingFace dataset directly
llmq submit cleaning-queue HuggingFaceFW/fineweb \
  --map 'messages=[{"role": "user", "content": "Clean this text: {text}"}]' \
  --max-samples 10000 > cleaned_data.jsonl
```

### RL Training Rollouts

Currently requires manual orchestration - you need to manually switch between queues and manage workers for different training phases. For example, you'd start policy workers, submit rollout jobs, tear down those workers, then start reward model workers to score the rollouts.

Future versions will add automatic model switching and queue coordination to streamline complex RL workflows with policy models, reward models, and value functions.

## Real-World Usage

`llmq` has been used to process the following datasets:

- **[fineweb-edu-dutch-mt](https://huggingface.co/datasets/pdelobelle/fineweb-edu-dutch-mt)** - Machine translation of a subset of fineweb-edu to Dutch using a 72B MT model.
- **[fineweb-dutch-synthetic-mt](https://huggingface.co/datasets/pdelobelle/fineweb-dutch-synthetic-mt)** - Translated The synthetic split of Germanweb to Dutch using a 9B MT model.


## Worker Types

**Production Workers:**
- `llmq worker run <model-name> <queue-name>` - GPU-accelerated inference with vLLM

**Development & Testing:**
- `llmq worker dummy <queue-name>` - Simple echo worker for testing (no GPU required)

All workers support the same configuration options and can be scaled horizontally by running multiple instances.

## Core Commands

### Job Management

```bash
# Submit jobs from file or stdin
llmq submit <queue-name> <jobs.jsonl>
llmq submit <queue-name> -  # from stdin

# Monitor progress
llmq status <queue-name>
```

### Worker Management

```bash
# Start GPU-accelerated worker
llmq worker run <model-name> <queue-name>

# Start test worker (no GPU required)
llmq worker dummy <queue-name>

# Start filter worker (job filtering)
llmq worker filter <queue-name> <field> <value>

# Multiple workers: run command multiple times
```

### Monitoring

```bash
# Check connection and queues
llmq status
✅ Connected to RabbitMQ
URL: amqp://llmq:llmq123@localhost:5672/

# View queue statistics
llmq status <queue-name>

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                         ┃ Value               ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ Queue Name                     │ translation-queue   │
│ Total Messages                 │ 0                   │
│ ├─ Ready (awaiting processing) │ 0                   │
│ └─ Unacknowledged (processing) │ 0                   │
│ Total Bytes                    │ 0 bytes (0.0 MB)    │
│ ├─ Ready Bytes                 │ 0 bytes             │
│ └─ Unacked Bytes               │ 0 bytes             │
│ Active Consumers               │ 0                   │
│ Timestamp                      │ 2025-08-08 11:36:31 │
└────────────────────────────────┴─────────────────────┘
```

## Configuration

Configure via environment variables or `.env` file:

```bash
# Connection
RABBITMQ_URL=amqp://llmq:llmq123@localhost:5672/

# Performance tuning
VLLM_QUEUE_PREFETCH=100              # Messages per worker
VLLM_GPU_MEMORY_UTILIZATION=0.9     # GPU memory usage
VLLM_MAX_NUM_SEQS=256               # Batch size

# Job processing
LLMQ_CHUNK_SIZE=10000               # Bulk submission size
```

## Job Formats

### Modern Chat Format (Recommended)

```json
{
  "id": "job-1",
  "messages": [
    {"role": "user", "content": "Translate to {language}: {text}"}
  ],
  "text": "Hello world",
  "language": "Spanish"
}
```

### Traditional Prompt Format

```json
{
  "id": "job-1", 
  "prompt": "Translate to {language}: {text}",
  "text": "Hello world",
  "language": "Spanish"
}
```

Both formats support template substitution with `{variable}` syntax.

## Architecture

llmq creates two components per queue:
- **Job Queue**: `<queue-name>` - Where jobs are submitted
- **Results Exchange**: `<queue-name>.results` - Streams results back

Workers use vLLM for GPU acceleration and RabbitMQ for reliable job distribution. Results stream back in real-time as jobs complete.

## Performance Tips

- **GPU Memory**: Adjust `VLLM_GPU_MEMORY_UTILIZATION` (default: 0.9)
- **Concurrency**: Tune `VLLM_QUEUE_PREFETCH` based on model size
- **Batch Size**: Set `VLLM_MAX_NUM_SEQS` for optimal throughput
- **Multiple GPUs**: vLLM automatically uses all visible GPUs. You can also start multiple workers yourself for data parallel processing, which [is actually recommended for larger deployements](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html#external-load-balancing).

## Testing

```bash
# Install with test dependencies
pip install llmq[test]

# Run unit tests (no external dependencies)
pytest -m unit

# Run integration tests (requires RabbitMQ)
pytest -m integration
```

## Links

- **PyPI**: https://pypi.org/project/llmq/
- **Issues**: https://github.com/ipieter/llmq/issues
- **Docker Compose Setup**: [docker-compose.yml](#docker-compose-setup)
- **HPC/SLURM/Singularity Setup**: [Singularity Setup](#singularity-setup)

---

## Advanced Setup

### Docker Compose Setup

Create `docker-compose.yml`:

```yaml
version: '3.8'
services:
  rabbitmq:
    image: rabbitmq:3-management
    container_name: llmq-rabbitmq
    ports:
      - "5672:5672"
      - "15672:15672"
    environment:
      RABBITMQ_DEFAULT_USER: llmq
      RABBITMQ_DEFAULT_PASS: llmq123
    volumes:
      - rabbitmq_data:/var/lib/rabbitmq
    restart: unless-stopped

volumes:
  rabbitmq_data:
```

Run with: `docker-compose up -d`

### Singularity Setup

For HPC clusters:

```bash
# Use provided utility
./utils/start_singularity_broker.sh

# Set connection URL  
export RABBITMQ_URL=amqp://guest:guest@$(hostname):5672/

# Test connection
llmq status
```

### Performance Tuning

#### GPU Memory Management
```bash
# Reduce for large models
export VLLM_GPU_MEMORY_UTILIZATION=0.7

# Increase for small models
export VLLM_GPU_MEMORY_UTILIZATION=0.95
```

#### Concurrency Tuning
```bash
# Higher throughput, more memory usage
export VLLM_QUEUE_PREFETCH=200

# Lower memory usage, potentially lower throughput
export VLLM_QUEUE_PREFETCH=50
```

#### Batch Processing
```bash
# Larger batches for better GPU utilization
export VLLM_MAX_NUM_SEQS=512

# Smaller batches for lower latency
export VLLM_MAX_NUM_SEQS=64
```

### Multi-GPU Setup

```bash
# Use specific GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 llmq worker run model-name queue-name

# vLLM automatically distributes across all visible GPUs
```

### Troubleshooting

#### Connection Issues
```bash
# Check RabbitMQ status
docker ps
docker logs rabbitmq

# Test management API
curl -u llmq:llmq123 http://localhost:15672/api/overview
```

#### Worker Issues
```bash
# Check GPU memory
nvidia-smi

# Reduce GPU utilization if needed
export VLLM_GPU_MEMORY_UTILIZATION=0.7

# View structured logs
llmq worker run model queue 2>&1 | jq .
```

#### Queue Issues
```bash
# Check queue health
llmq health queue-name

# View failed jobs
llmq errors queue-name --limit 10

# Access RabbitMQ management UI
open http://localhost:15672
```

## Acknowledgments

🇪🇺 Development and testing of this project was supported by computational resources provided by EuroHPC under grant EHPC-AIF-2025PG01-128.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llmq",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "llm, queue, vllm, gpu, inference, rabbitmq, async",
    "author": null,
    "author_email": "Pieter <pieter@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/fd/d6/5acd065959f5cb29d08b51b1fc514f8f5e7ebfddeca6fd882417ce734bf0/llmq-0.0.4.tar.gz",
    "platform": null,
    "description": "# llmq\n\n[![PyPI version](https://badge.fury.io/py/llmq.svg)](https://pypi.org/project/llmq/)\n[![CI](https://github.com/ipieter/llmq/workflows/CI/badge.svg)](https://github.com/ipieter/llmq/actions/workflows/ci.yml)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n**A Scheduler for Batched LLM Inference** - Like OpenAI's Batch API, but for self-hosted open-source models. Submit millions of inference jobs, let workers process them with vLLM-backed inference, and stream results back to a single file. Ideal for synthetic data generation, translation pipelines, and batch inference workloads.\n\n> **Note**: API may change until v1.0 as I'm actively developing new features.\n\n<details>\n<summary><strong>\ud83d\udccb Table of Contents</strong></summary>\n\n- [Features](#features)\n- [Quick Start](#quick-start)\n  - [Installation](#installation)\n  - [Start RabbitMQ](#start-rabbitmq)\n  - [Run Your First Job](#run-your-first-job)\n- [How It Works](#how-it-works)\n- [Use Cases](#use-cases)\n  - [Translation Pipeline](#translation-pipeline)\n  - [Data Cleaning at Scale](#data-cleaning-at-scale)\n  - [RL Training Rollouts](#rl-training-rollouts)\n- [Real-World Usage](#real-world-usage)\n- [Worker Types](#worker-types)\n- [Core Commands](#core-commands)\n  - [Job Management](#job-management)\n  - [Worker Management](#worker-management)\n  - [Monitoring](#monitoring)\n- [Configuration](#configuration)\n- [Job Formats](#job-formats)\n- [Architecture](#architecture)\n- [Performance Tips](#performance-tips)\n- [Testing](#testing)\n- [Links](#links)\n- [Advanced Setup](#advanced-setup)\n  - [Docker Compose Setup](#docker-compose-setup)\n  - [Singularity Setup](#singularity-setup)\n  - [Performance Tuning](#performance-tuning)\n  - [Multi-GPU Setup](#multi-gpu-setup)\n  - [Troubleshooting](#troubleshooting)\n- [Acknowledgments](#acknowledgments)\n\n</details>\n\n## Features\n\n- **High-Performance**: GPU-accelerated inference with vLLM batching\n- **Scalable**: RabbitMQ-based distributed queuing, so never let your GPUs idle  \n- **Simple**: Unix-friendly CLI with piped input/output\n- **Flexible**: Supports many standard LLM operations for synthetic data generation. You can combine different models and process Huggingface datasets directly\n\n**Not for real-time use**: llmq is designed for (laaarge) batch processing, not chat applications or real-time inference. It doesn't support token streaming or optimized time-to-first-token (TTFT).\n\n## Quick Start\n\n### Installation\n\n```bash\npip install llmq\n```\n\n### Start RabbitMQ\n\n```bash\ndocker run -d --name rabbitmq \\\n  -p 5672:5672 -p 15672:15672 \\\n  -e RABBITMQ_DEFAULT_USER=llmq \\\n  -e RABBITMQ_DEFAULT_PASS=llmq123 \\\n  rabbitmq:3-management\n```\n\n### Run Your First Job\n\n```bash\n# Start a worker\nllmq worker run Unbabel/Tower-Plus-9B translation-queue\n\n# Submit jobs (in another terminal)\necho '{\"id\": \"hello\", \"prompt\": \"Translate to Spanish: Hello world\"}' | llmq submit translation-queue -\n\n# Results stream back immediately\n{\"id\": \"hello\", \"result\": \"Hola mundo\", \"worker_id\": \"worker-gpu0\", \"duration_ms\": 45.2}\n```\n\n## How It Works\n\nSimilar to OpenAI's Batch API, llmq separates job submission from processing:\n\n1. **Submit jobs** - Upload thousands of inference requests to a queue\n2. **Workers process** - GPU-accelerated workers pull jobs and generate responses  \n3. **Stream results** - Get real-time results as jobs complete, with automatic timeout handling\n\n## Use Cases\n\n### Translation Pipeline\n\nProcess translation jobs with specialized multilingual models:\n\n```bash\n# Start translation worker\nllmq worker run Unbabel/Tower-Plus-9B translation-queue\n\n# Example jobs file (jobs.jsonl)\n{\"id\": \"job1\", \"messages\": [{\"role\": \"user\", \"content\": \"Translate to Spanish: {text}\"}], \"text\": \"Hello world\"}\n{\"id\": \"job2\", \"messages\": [{\"role\": \"user\", \"content\": \"Translate to French: {text}\"}], \"text\": \"Good morning\"}\n\n# Process jobs\nllmq submit translation-queue jobs.jsonl > results.jsonl\n```\n\n### Data Cleaning at Scale\n\nClean and process large datasets with custom prompts:\n\n```bash\n# Start worker for data cleaning\nllmq worker run meta-llama/Llama-3.2-3B-Instruct cleaning-queue\n\n# Submit HuggingFace dataset directly\nllmq submit cleaning-queue HuggingFaceFW/fineweb \\\n  --map 'messages=[{\"role\": \"user\", \"content\": \"Clean this text: {text}\"}]' \\\n  --max-samples 10000 > cleaned_data.jsonl\n```\n\n### RL Training Rollouts\n\nCurrently requires manual orchestration - you need to manually switch between queues and manage workers for different training phases. For example, you'd start policy workers, submit rollout jobs, tear down those workers, then start reward model workers to score the rollouts.\n\nFuture versions will add automatic model switching and queue coordination to streamline complex RL workflows with policy models, reward models, and value functions.\n\n## Real-World Usage\n\n`llmq` has been used to process the following datasets:\n\n- **[fineweb-edu-dutch-mt](https://huggingface.co/datasets/pdelobelle/fineweb-edu-dutch-mt)** - Machine translation of a subset of fineweb-edu to Dutch using a 72B MT model.\n- **[fineweb-dutch-synthetic-mt](https://huggingface.co/datasets/pdelobelle/fineweb-dutch-synthetic-mt)** - Translated The synthetic split of Germanweb to Dutch using a 9B MT model.\n\n\n## Worker Types\n\n**Production Workers:**\n- `llmq worker run <model-name> <queue-name>` - GPU-accelerated inference with vLLM\n\n**Development & Testing:**\n- `llmq worker dummy <queue-name>` - Simple echo worker for testing (no GPU required)\n\nAll workers support the same configuration options and can be scaled horizontally by running multiple instances.\n\n## Core Commands\n\n### Job Management\n\n```bash\n# Submit jobs from file or stdin\nllmq submit <queue-name> <jobs.jsonl>\nllmq submit <queue-name> -  # from stdin\n\n# Monitor progress\nllmq status <queue-name>\n```\n\n### Worker Management\n\n```bash\n# Start GPU-accelerated worker\nllmq worker run <model-name> <queue-name>\n\n# Start test worker (no GPU required)\nllmq worker dummy <queue-name>\n\n# Start filter worker (job filtering)\nllmq worker filter <queue-name> <field> <value>\n\n# Multiple workers: run command multiple times\n```\n\n### Monitoring\n\n```bash\n# Check connection and queues\nllmq status\n\u2705 Connected to RabbitMQ\nURL: amqp://llmq:llmq123@localhost:5672/\n\n# View queue statistics\nllmq status <queue-name>\n\n\u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n\u2503 Metric                         \u2503 Value               \u2503\n\u2521\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2529\n\u2502 Queue Name                     \u2502 translation-queue   \u2502\n\u2502 Total Messages                 \u2502 0                   \u2502\n\u2502 \u251c\u2500 Ready (awaiting processing) \u2502 0                   \u2502\n\u2502 \u2514\u2500 Unacknowledged (processing) \u2502 0                   \u2502\n\u2502 Total Bytes                    \u2502 0 bytes (0.0 MB)    \u2502\n\u2502 \u251c\u2500 Ready Bytes                 \u2502 0 bytes             \u2502\n\u2502 \u2514\u2500 Unacked Bytes               \u2502 0 bytes             \u2502\n\u2502 Active Consumers               \u2502 0                   \u2502\n\u2502 Timestamp                      \u2502 2025-08-08 11:36:31 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n## Configuration\n\nConfigure via environment variables or `.env` file:\n\n```bash\n# Connection\nRABBITMQ_URL=amqp://llmq:llmq123@localhost:5672/\n\n# Performance tuning\nVLLM_QUEUE_PREFETCH=100              # Messages per worker\nVLLM_GPU_MEMORY_UTILIZATION=0.9     # GPU memory usage\nVLLM_MAX_NUM_SEQS=256               # Batch size\n\n# Job processing\nLLMQ_CHUNK_SIZE=10000               # Bulk submission size\n```\n\n## Job Formats\n\n### Modern Chat Format (Recommended)\n\n```json\n{\n  \"id\": \"job-1\",\n  \"messages\": [\n    {\"role\": \"user\", \"content\": \"Translate to {language}: {text}\"}\n  ],\n  \"text\": \"Hello world\",\n  \"language\": \"Spanish\"\n}\n```\n\n### Traditional Prompt Format\n\n```json\n{\n  \"id\": \"job-1\", \n  \"prompt\": \"Translate to {language}: {text}\",\n  \"text\": \"Hello world\",\n  \"language\": \"Spanish\"\n}\n```\n\nBoth formats support template substitution with `{variable}` syntax.\n\n## Architecture\n\nllmq creates two components per queue:\n- **Job Queue**: `<queue-name>` - Where jobs are submitted\n- **Results Exchange**: `<queue-name>.results` - Streams results back\n\nWorkers use vLLM for GPU acceleration and RabbitMQ for reliable job distribution. Results stream back in real-time as jobs complete.\n\n## Performance Tips\n\n- **GPU Memory**: Adjust `VLLM_GPU_MEMORY_UTILIZATION` (default: 0.9)\n- **Concurrency**: Tune `VLLM_QUEUE_PREFETCH` based on model size\n- **Batch Size**: Set `VLLM_MAX_NUM_SEQS` for optimal throughput\n- **Multiple GPUs**: vLLM automatically uses all visible GPUs. You can also start multiple workers yourself for data parallel processing, which [is actually recommended for larger deployements](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html#external-load-balancing).\n\n## Testing\n\n```bash\n# Install with test dependencies\npip install llmq[test]\n\n# Run unit tests (no external dependencies)\npytest -m unit\n\n# Run integration tests (requires RabbitMQ)\npytest -m integration\n```\n\n## Links\n\n- **PyPI**: https://pypi.org/project/llmq/\n- **Issues**: https://github.com/ipieter/llmq/issues\n- **Docker Compose Setup**: [docker-compose.yml](#docker-compose-setup)\n- **HPC/SLURM/Singularity Setup**: [Singularity Setup](#singularity-setup)\n\n---\n\n## Advanced Setup\n\n### Docker Compose Setup\n\nCreate `docker-compose.yml`:\n\n```yaml\nversion: '3.8'\nservices:\n  rabbitmq:\n    image: rabbitmq:3-management\n    container_name: llmq-rabbitmq\n    ports:\n      - \"5672:5672\"\n      - \"15672:15672\"\n    environment:\n      RABBITMQ_DEFAULT_USER: llmq\n      RABBITMQ_DEFAULT_PASS: llmq123\n    volumes:\n      - rabbitmq_data:/var/lib/rabbitmq\n    restart: unless-stopped\n\nvolumes:\n  rabbitmq_data:\n```\n\nRun with: `docker-compose up -d`\n\n### Singularity Setup\n\nFor HPC clusters:\n\n```bash\n# Use provided utility\n./utils/start_singularity_broker.sh\n\n# Set connection URL  \nexport RABBITMQ_URL=amqp://guest:guest@$(hostname):5672/\n\n# Test connection\nllmq status\n```\n\n### Performance Tuning\n\n#### GPU Memory Management\n```bash\n# Reduce for large models\nexport VLLM_GPU_MEMORY_UTILIZATION=0.7\n\n# Increase for small models\nexport VLLM_GPU_MEMORY_UTILIZATION=0.95\n```\n\n#### Concurrency Tuning\n```bash\n# Higher throughput, more memory usage\nexport VLLM_QUEUE_PREFETCH=200\n\n# Lower memory usage, potentially lower throughput\nexport VLLM_QUEUE_PREFETCH=50\n```\n\n#### Batch Processing\n```bash\n# Larger batches for better GPU utilization\nexport VLLM_MAX_NUM_SEQS=512\n\n# Smaller batches for lower latency\nexport VLLM_MAX_NUM_SEQS=64\n```\n\n### Multi-GPU Setup\n\n```bash\n# Use specific GPUs\nCUDA_VISIBLE_DEVICES=0,1,2,3 llmq worker run model-name queue-name\n\n# vLLM automatically distributes across all visible GPUs\n```\n\n### Troubleshooting\n\n#### Connection Issues\n```bash\n# Check RabbitMQ status\ndocker ps\ndocker logs rabbitmq\n\n# Test management API\ncurl -u llmq:llmq123 http://localhost:15672/api/overview\n```\n\n#### Worker Issues\n```bash\n# Check GPU memory\nnvidia-smi\n\n# Reduce GPU utilization if needed\nexport VLLM_GPU_MEMORY_UTILIZATION=0.7\n\n# View structured logs\nllmq worker run model queue 2>&1 | jq .\n```\n\n#### Queue Issues\n```bash\n# Check queue health\nllmq health queue-name\n\n# View failed jobs\nllmq errors queue-name --limit 10\n\n# Access RabbitMQ management UI\nopen http://localhost:15672\n```\n\n## Acknowledgments\n\n\ud83c\uddea\ud83c\uddfa Development and testing of this project was supported by computational resources provided by EuroHPC under grant EHPC-AIF-2025PG01-128.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "High-Performance vLLM Job Queue Package",
    "version": "0.0.4",
    "project_urls": {
        "Documentation": "https://github.com/ipieter/llmq#readme",
        "Homepage": "https://github.com/ipieter/llmq",
        "Issues": "https://github.com/ipieter/llmq/issues",
        "Repository": "https://github.com/ipieter/llmq"
    },
    "split_keywords": [
        "llm",
        " queue",
        " vllm",
        " gpu",
        " inference",
        " rabbitmq",
        " async"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "26ec827b0e5c89a3d314bb58446e05e37cbf0950d37d6f2d7364d21af76586c3",
                "md5": "663eabea66807e507d9705bd0fa89cef",
                "sha256": "019304809e1e479ee0af50e7d43cc354b016f5f1a95e7640018ce2ccdedea602"
            },
            "downloads": -1,
            "filename": "llmq-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "663eabea66807e507d9705bd0fa89cef",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 29216,
            "upload_time": "2025-08-11T08:21:31",
            "upload_time_iso_8601": "2025-08-11T08:21:31.501448Z",
            "url": "https://files.pythonhosted.org/packages/26/ec/827b0e5c89a3d314bb58446e05e37cbf0950d37d6f2d7364d21af76586c3/llmq-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fdd65acd065959f5cb29d08b51b1fc514f8f5e7ebfddeca6fd882417ce734bf0",
                "md5": "44b4a2d1c68185261a9c4e009c5a2b90",
                "sha256": "0e64e92adac80d727112329c4900fe98363e41e31640d2d98eccd53024a4c28a"
            },
            "downloads": -1,
            "filename": "llmq-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "44b4a2d1c68185261a9c4e009c5a2b90",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 45914,
            "upload_time": "2025-08-11T08:21:32",
            "upload_time_iso_8601": "2025-08-11T08:21:32.719437Z",
            "url": "https://files.pythonhosted.org/packages/fd/d6/5acd065959f5cb29d08b51b1fc514f8f5e7ebfddeca6fd882417ce734bf0/llmq-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-11 08:21:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ipieter",
    "github_project": "llmq#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "llmq"
}

None