# SOLLOL - Production-Ready Orchestration for Local LLM Clusters
<div align="center">
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/BenevolentJoker-JohnL/SOLLOL/actions/workflows/tests.yml)
[](https://codecov.io/gh/BenevolentJoker-JohnL/SOLLOL)
**Open-source orchestration layer that combines intelligent task routing with distributed model inference for local LLM clusters.**
[Quick Start](#quick-start) β’ [Features](#why-sollol) β’ [Architecture](#architecture) β’ [Documentation](#documentation) β’ [Examples](#examples)
</div>
---
## π― What is SOLLOL?
SOLLOL (Super Ollama Load balancer & Orchestration Layer) transforms your collection of Ollama nodes into an **intelligent AI cluster** with adaptive routing and automatic failoverβall running on your own hardware.
### The Problem
You have multiple machines with GPUs running Ollama, but:
- β Manual node selection for each request
- β No way to run models larger than your biggest GPU
- β Can't distribute multi-agent workloads efficiently
- β No automatic failover or load balancing
- β Zero visibility into cluster performance
### The SOLLOL Solution
SOLLOL provides:
- β
**Intelligent routing** that learns which nodes work best for each task
- β
**Model sharding** to run 70B+ models across multiple machines
- β
**Parallel agent execution** for multi-agent frameworks
- β
**Auto-discovery** of all nodes and capabilities
- β
**Built-in observability** with real-time metrics
- β
**Zero-config deployment** - just point and go
---
## π Why SOLLOL?
### 1. **Two Distribution Modes in One System**
SOLLOL combines both task distribution and model sharding:
#### π Task Distribution (Horizontal Scaling)
Distribute **multiple requests** across your cluster in parallel:
```python
# Run 10 agents simultaneously across 5 nodes
pool = OllamaPool.auto_configure()
responses = await asyncio.gather(*[
pool.chat(model="llama3.2", messages=[...])
for _ in range(10)
])
# Parallel execution across available nodes
```
#### π§© Model Sharding (Vertical Scaling)
Run **single large models** that don't fit on one machine:
```python
# Run larger models across multiple nodes
# Note: Verified with 13B across 2-3 nodes; larger models not extensively tested
router = HybridRouter(
enable_distributed=True,
num_rpc_backends=4
)
response = await router.route_request(
model="llama3:70b", # Sharded automatically
messages=[...]
)
```
**Use them together!** Small models use task distribution, large models use sharding.
---
### 2. **Intelligent, Not Just Balanced**
SOLLOL doesn't just distribute requests randomlyβit **learns** and **optimizes**:
| Feature | Simple Load Balancer | SOLLOL |
|---------|---------------------|---------|
| **Routing** | Round-robin | Context-aware scoring |
| **Learning** | None | Adapts from performance history |
| **Resource Awareness** | None | GPU/CPU/memory-aware |
| **Task Optimization** | None | Routes by task type complexity |
| **Failover** | Manual | Automatic with health checks |
| **Priority** | FIFO | Priority queue with fairness |
**Example**: SOLLOL automatically routes:
- Heavy generation tasks β GPU nodes with 24GB VRAM
- Fast embeddings β CPU nodes or smaller GPUs
- Critical requests β Fastest, most reliable nodes
- Batch processing β Lower priority, distributed load
---
### 3. **Production-Ready from Day One**
```python
from sollol import SOLLOL, SOLLOLConfig
# Literally 3 lines to production
config = SOLLOLConfig.auto_discover()
sollol = SOLLOL(config)
sollol.start() # β
Gateway running on :8000
```
**Out of the box**:
- Auto-discovery of Ollama nodes
- Health monitoring and failover
- Prometheus metrics
- Web dashboard
- Connection pooling
- Request hedging
- Priority queuing
---
## ποΈ Architecture
### High-Level Overview
```
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Application β
β (SynapticLlamas, custom agents, etc.) β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SOLLOL Gateway (:8000) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Intelligent Routing Engine β β
β β β’ Analyzes: task type, complexity, resources β β
β β β’ Scores: all nodes based on context β β
β β β’ Learns: from performance history β β
β β β’ Routes: to optimal node β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Priority Queue + Failover β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββ ββββββββββββββββ
β Task Mode β β Shard Mode β
β Ray Cluster β β llama.cpp β
ββββββββ¬βββββββ ββββββββ¬ββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Heterogeneous Cluster β
β GPU (24GB) β GPU (16GB) β CPU (64c) β GPU (8GB) β... β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### How Routing Works
```python
# 1. Request arrives
POST /api/chat {
"model": "llama3.2",
"messages": [{"role": "user", "content": "Complex analysis task..."}],
"priority": 8
}
# 2. SOLLOL analyzes
task_type = "generation" # Auto-detected
complexity = "high" # Token count analysis
requires_gpu = True # Based on task
estimated_duration = 3.2s # From history
# 3. SOLLOL scores all nodes
Node A (GPU 24GB, load: 0.2, latency: 120ms) β Score: 185.3 β WINNER
Node B (GPU 8GB, load: 0.6, latency: 200ms) β Score: 92.1
Node C (CPU only, load: 0.1, latency: 80ms) β Score: 41.2
# 4. Routes to Node A, monitors execution, learns for next time
```
**Scoring Algorithm**:
```
Score = 100.0 (baseline)
Γ success_rate (0.0-1.0)
Γ· (1 + latency_penalty)
Γ gpu_bonus (1.5x if GPU available & needed)
Γ· (1 + load_penalty)
Γ priority_alignment
Γ task_specialization
```
---
## π¦ Installation
### Quick Install (PyPI)
```bash
pip install sollol
```
### From Source
```bash
git clone https://github.com/BenevolentJoker-JohnL/SOLLOL.git
cd SOLLOL
pip install -e .
```
---
## β‘ Quick Start
### 1. Synchronous API (No async/await needed!)
**New in v0.3.6:** SOLLOL now provides a synchronous API for easier integration with non-async applications.
```python
from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority
# Auto-discover and connect to all Ollama nodes
pool = OllamaPool.auto_configure()
# Make requests - SOLLOL routes intelligently
# No async/await needed!
response = pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
priority=Priority.HIGH, # Semantic priority levels
timeout=60 # Request timeout in seconds
)
print(response['message']['content'])
print(f"Routed to: {response.get('_sollol_routing', {}).get('host', 'unknown')}")
```
**Key features of synchronous API:**
- β
No async/await syntax required
- β
Works with synchronous agent frameworks
- β
Same intelligent routing and features
- β
Runs async code in background thread automatically
---
### 2. Async API (Original)
For async applications, use the original async API:
```python
from sollol import OllamaPool
# Auto-discover and connect to all Ollama nodes
pool = await OllamaPool.auto_configure()
# Make requests - SOLLOL routes intelligently
response = await pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response['message']['content'])
print(f"Routed to: {response['_sollol_routing']['host']}")
print(f"Task type: {response['_sollol_routing']['task_type']}")
```
---
### 3. Priority-Based Multi-Agent Execution
**New in v0.3.6:** Use semantic priority levels and role-based mapping.
```python
from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority, get_priority_for_role
pool = OllamaPool.auto_configure()
# Define agents with different priorities
agents = [
{"name": "Researcher", "role": "researcher"}, # Priority 8
{"name": "Editor", "role": "editor"}, # Priority 6
{"name": "Summarizer", "role": "summarizer"}, # Priority 5
]
for agent in agents:
priority = get_priority_for_role(agent["role"])
response = pool.chat(
model="llama3.2",
messages=[{"role": "user", "content": f"Task for {agent['name']}"}],
priority=priority
)
# User-facing agents get priority, background tasks wait
```
**Priority levels available:**
- `Priority.CRITICAL` (10) - Mission-critical
- `Priority.URGENT` (9) - Fast response needed
- `Priority.HIGH` (7) - Important tasks
- `Priority.NORMAL` (5) - Default
- `Priority.LOW` (3) - Background tasks
- `Priority.BATCH` (1) - Can wait
---
### 4. Model Sharding with llama.cpp (Large Models)
**Run models larger than your biggest GPU** by distributing layers across multiple machines.
#### When to Use Model Sharding
Use model sharding when:
- β
Model doesn't fit on your largest GPU (e.g., 70B models on 16GB GPUs)
- β
You have multiple machines with network connectivity
- β
You can tolerate slower inference for capability
Don't use sharding when:
- β Model fits on a single GPU (use task distribution instead)
- β You need maximum inference speed
- β Network latency is high (>10ms between machines)
#### Quick Start: Auto-Setup (Easiest)
```python
from sollol.sync_wrapper import HybridRouter, OllamaPool
# SOLLOL handles all setup automatically
router = HybridRouter(
ollama_pool=OllamaPool.auto_configure(),
enable_distributed=True, # Enable model sharding
auto_setup_rpc=True, # Auto-configure RPC backends
num_rpc_backends=3 # Distribute across 3 machines
)
# Use large model that doesn't fit on one machine
response = router.route_request(
model="llama3.1:70b", # Automatically sharded across backends
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response['message']['content'])
```
**What happens automatically:**
1. SOLLOL discovers available RPC backends on your network
2. Extracts the GGUF model from Ollama storage
3. Starts llama-server coordinator with optimal settings
4. Distributes model layers across backends
5. Routes your request to the coordinator
#### RPC Server Auto-Installation
**SOLLOL can automatically clone, build, and start llama.cpp RPC servers for you!**
**One-line installation:**
```python
from sollol.rpc_auto_setup import auto_setup_rpc_backends
# Automatically: clone β build β start RPC servers
backends = auto_setup_rpc_backends(num_backends=2)
# Output: [{'host': '127.0.0.1', 'port': 50052}, {'host': '127.0.0.1', 'port': 50053}]
```
**What this does:**
1. β
Scans network for existing RPC servers
2. β
If none found: clones llama.cpp to `~/llama.cpp`
3. β
Builds llama.cpp with RPC support (`cmake -DGGML_RPC=ON`)
4. β
Starts RPC servers on ports 50052-50053
5. β
Returns ready-to-use backend list
**CLI installation:**
```bash
# Full automated setup (clone + build + install systemd service)
python3 -m sollol.setup_llama_cpp --all
# Or step by step
python3 -m sollol.setup_llama_cpp --clone # Clone llama.cpp
python3 -m sollol.setup_llama_cpp --build # Build with RPC support
python3 -m sollol.setup_llama_cpp --start # Start RPC server
```
**Docker IP Resolution:**
SOLLOL automatically resolves Docker container IPs to accessible host IPs:
```python
# If Docker container reports IP 172.17.0.5:11434
# SOLLOL automatically resolves to:
# β 127.0.0.1:11434 (published port mapping)
# β host IP (if accessible)
# β Docker host gateway
from sollol import is_docker_ip, resolve_docker_ip
# Check if IP is Docker internal
is_docker = is_docker_ip("172.17.0.5") # True
# Resolve Docker IP to accessible IP
accessible_ip = resolve_docker_ip("172.17.0.5", port=11434)
# Returns: "127.0.0.1" or host IP
```
**Network Discovery with Docker Support:**
```python
from sollol import OllamaPool
# Auto-discover nodes (automatically resolves Docker IPs)
pool = OllamaPool.auto_configure()
# Manual control
from sollol.discovery import discover_ollama_nodes
nodes = discover_ollama_nodes(auto_resolve_docker=True)
```
**Multi-Node Production Setup:**
For distributed clusters, use systemd services on each node:
```bash
# On each RPC node
sudo systemctl enable llama-rpc@50052.service
sudo systemctl start llama-rpc@50052.service
```
See [SOLLOL_RPC_SETUP.md](https://github.com/BenevolentJoker-JohnL/FlockParser/blob/main/SOLLOL_RPC_SETUP.md) for complete installation guide.
#### Architecture: How It Works
```
ββββββββββββββββββββββββββββββββββββββββββββββ
β Llama 3.1 70B Model (40GB total) β
β Distributed Sharding β
ββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Machine 1 β β Machine 2 β β Machine 3 β
β Layers 0-26 β β Layers 27-53 β β Layers 54-79 β
β (~13GB) β β (~13GB) β β (~13GB) β
β RPC Backend β β RPC Backend β β RPC Backend β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β² β² β²
ββββββββββββββΌβββββββββββββ
β
ββββββββββββ΄βββββββββββ
β llama-server β
β Coordinator β
β (Port 18080) β
βββββββββββββββββββββββ
```
#### Manual Setup (Advanced)
For explicit control over RPC backends:
```python
from sollol.llama_cpp_coordinator import LlamaCppCoordinator
from sollol.rpc_registry import RPCBackendRegistry
# 1. Register RPC backends explicitly
registry = RPCBackendRegistry()
registry.add_backend("rpc_1", "grpc://10.9.66.45:50052")
registry.add_backend("rpc_2", "grpc://10.9.66.46:50052")
registry.add_backend("rpc_3", "grpc://10.9.66.47:50052")
# 2. Create coordinator
coordinator = LlamaCppCoordinator(
coordinator_port=18080,
rpc_backends=registry.get_all_backends(),
context_size=4096,
gpu_layers=-1 # Use all available GPU layers
)
# 3. Start and use
await coordinator.start(model_name="llama3.1:70b")
response = await coordinator.generate(
prompt="Explain the theory of relativity",
max_tokens=500
)
```
#### Performance Expectations
| Model Size | Single GPU | Sharded (3 nodes) | Trade-off |
|------------|-----------|-------------------|-----------|
| **13B** | β
20 tok/s | β
5 tok/s | -75% speed, works on 3Γsmaller GPUs |
| **70B** | β OOM | β οΈ 3-5 tok/s (est.) | Enables model that won't run otherwise |
**Trade-offs:**
- π **Startup**: 2-5 minutes (model distribution + loading)
- π **Inference**: ~4x slower than local (network overhead)
- β
**Capability**: Run models that won't fit on single GPU
**Learn More:**
- π [Complete llama.cpp Guide](docs/llama_cpp_guide.md) - Setup, optimization, troubleshooting
- π» [Working Examples](examples/llama_cpp_distributed.py) - 5 complete examples including conversation, batch processing, error handling
---
### 5. Batch Processing API
**New in v0.7.0:** RESTful API for asynchronous batch job management.
Submit large-scale batch operations (thousands of embeddings, bulk inference) and track progress via job IDs:
```python
import requests
# Submit batch embedding job (up to 10,000 documents)
response = requests.post("http://localhost:11434/api/batch/embed", json={
"model": "nomic-embed-text",
"documents": ["Document 1", "Document 2", ...], # Can be thousands
"metadata": {"source": "knowledge_base"} # Optional metadata
})
job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")
# Poll for job status
import time
while True:
status = requests.get(f"http://localhost:11434/api/batch/jobs/{job_id}").json()
progress = status["progress"]["percent"]
print(f"Progress: {progress}%")
if status["status"] == "completed":
break
time.sleep(1)
# Get results
results = requests.get(f"http://localhost:11434/api/batch/results/{job_id}").json()
embeddings = results["results"] # List of embedding vectors
print(f"Processed {len(embeddings)} documents in {status['duration_seconds']}s")
```
**Available Batch Endpoints:**
- `POST /api/batch/embed` - Submit batch embedding job
- `GET /api/batch/jobs/{job_id}` - Get job status
- `GET /api/batch/results/{job_id}` - Get job results
- `GET /api/batch/jobs?limit=100` - List recent jobs
- `DELETE /api/batch/jobs/{job_id}` - Cancel job
**Use cases:**
- Embedding large document collections (thousands of documents)
- Bulk inference for batch predictions
- Background processing without blocking
- Long-running operations with progress tracking
---
### 6. SOLLOL Detection
**New in v0.3.6:** Detect if SOLLOL is running vs native Ollama.
```python
import requests
def is_sollol(url="http://localhost:11434"):
"""Check if SOLLOL is running at the given URL."""
# Method 1: Check X-Powered-By header
response = requests.get(url)
if response.headers.get("X-Powered-By") == "SOLLOL":
return True
# Method 2: Check health endpoint
response = requests.get(f"{url}/api/health")
data = response.json()
if data.get("service") == "SOLLOL":
return True
return False
# Use it
if is_sollol("http://localhost:11434"):
print("β SOLLOL detected - using intelligent routing")
else:
print("Native Ollama detected")
```
**Why this matters:**
- Enables graceful fallback in client applications
- Makes SOLLOL a true drop-in replacement
- Clients can auto-detect and use SOLLOL features when available
---
### 7. Production Gateway
```python
from sollol import SOLLOL, SOLLOLConfig
# Full production setup
config = SOLLOLConfig(
ray_workers=4,
dask_workers=2,
hosts=["gpu-1:11434", "gpu-2:11434", "cpu-1:11434"],
gateway_port=8000,
metrics_port=9090
)
sollol = SOLLOL(config)
sollol.start() # Blocks and runs gateway
# Access via HTTP:
# curl http://localhost:8000/api/chat -d '{...}'
# curl http://localhost:8000/api/stats
# curl http://localhost:8000/api/dashboard
```
---
## π Use Cases
### 1. Multi-Agent AI Systems (SynapticLlamas, CrewAI, AutoGPT)
**Problem**: Running 10 agents sequentially takes 10x longer than necessary.
**Solution**: SOLLOL distributes agents across nodes in parallel.
```python
# Before: Sequential execution on one node
# After: Parallel execution with SOLLOL
pool = OllamaPool.auto_configure()
agents = await asyncio.gather(*[
pool.chat(model="llama3.2", messages=agent_prompts[i])
for i in range(10)
])
# Speedup depends on number of available nodes and their capacity
```
### 2. Large Model Inference
**Problem**: Your model doesn't fit in available VRAM.
**Solution**: SOLLOL can shard models across multiple machines via llama.cpp.
```python
# Distribute model across multiple nodes
# Note: Verified with 13B models; larger models not extensively tested
router = HybridRouter(
enable_distributed=True,
num_rpc_backends=4
)
# Trade-off: Slower startup/inference but enables running larger models
```
### 3. Mixed Workloads
**Problem**: Different tasks need different resources.
**Solution**: SOLLOL routes each task to the optimal node.
```python
pool = OllamaPool.auto_configure()
# Heavy generation β GPU node
chat = pool.chat(model="llama3.2:70b", messages=[...])
# Fast embeddings β CPU node
embeddings = pool.embed(model="nomic-embed-text", input=[...])
# SOLLOL automatically routes each to the best available node
```
### 4. High Availability Production
**Problem**: Node failures break your service.
**Solution**: SOLLOL auto-fails over and recovers.
```python
# Node A fails mid-request
# β
SOLLOL automatically:
# 1. Detects failure
# 2. Retries on Node B
# 3. Marks Node A as degraded
# 4. Periodically re-checks Node A
# 5. Restores Node A when healthy
```
---
## π Performance & Benchmarks
### Validation Status
**What's Been Validated β
**
- Single-node baseline performance measured
- Code exists and is reviewable (75+ modules)
- Tests pass in CI (57 tests, coverage tracked)
- Architecture implements intelligent routing
**What Needs Validation β οΈ**
- Comparative benchmarks (SOLLOL vs round-robin)
- Multi-node performance improvements
- Real-world latency/throughput gains
π **See [BENCHMARKING.md](BENCHMARKING.md) for complete validation roadmap and how to run comparative tests.**
---
### Measured Baseline Performance
**Single Ollama Node** (llama3.2-3B, 50 requests, concurrency=5):
- β
**Success Rate:** 100%
- β‘ **Throughput:** 0.51 req/s
- π **Average Latency:** 5,659 ms
- π **P95 Latency:** 11,299 ms
- π **P99 Latency:** 12,259 ms
**Hardware:** Single Ollama instance with 75+ models loaded
**Data:** See [`benchmarks/results/`](benchmarks/results/) for raw JSON
**Run Your Own:**
```bash
# Baseline test (no cluster needed)
python benchmarks/simple_ollama_benchmark.py llama3.2 50
# Comparative test (requires docker-compose)
docker-compose up -d
python benchmarks/run_benchmarks.py --sollol-url http://localhost:8000 --duration 60
```
---
### Projected Performance (Unvalidated)
**Note:** These are architectural projections, not measured results. Requires multi-node cluster setup for validation.
**Theory:** With N nodes and parallelizable workload:
- Task distribution can approach NΓ parallelization (limited by request rate)
- Intelligent routing should reduce tail latencies vs random selection
- Resource-aware placement reduces contention and failures
**Reality:** Requires multi-node cluster validation. See [BENCHMARKING.md](BENCHMARKING.md) for test procedure and [CODE_WALKTHROUGH.md](CODE_WALKTHROUGH.md) for implementation details.
### Model Sharding Performance
| Model | Single 24GB GPU | SOLLOL (3Γ16GB) | Status |
|-------|----------------|-----------------|-----------|
| **13B** | β
~20 tok/s | β
~5 tok/s | β
Verified working |
| **70B** | β OOM | β οΈ Estimated ~3-5 tok/s | β οΈ Not extensively tested |
**When to use sharding**: When model doesn't fit on your largest GPU. You trade speed for capability.
**Performance trade-offs**: Distributed inference is 2-5 minutes slower to start and ~4x slower for inference compared to local. Use only when necessary.
### Overhead
- **Routing decision**: ~5-10ms (tested with 5-10 nodes)
- **Network overhead**: Varies by network (typically 5-20ms)
- **Total added latency**: ~20-50ms
- **Benefit**: Better resource utilization + automatic failover
---
## π οΈ Advanced Configuration
### Custom Routing Strategy
```python
from sollol import OllamaPool
pool = OllamaPool(
nodes=[
{"host": "gpu-1.local", "port": 11434, "priority": 10}, # Prefer this
{"host": "gpu-2.local", "port": 11434, "priority": 5},
{"host": "cpu-1.local", "port": 11434, "priority": 1}, # Last resort
],
enable_intelligent_routing=True,
enable_hedging=True, # Duplicate critical requests
max_queue_size=100
)
```
### Priority-Based Scheduling
```python
# Critical user-facing request
response = pool.chat(
model="llama3.2",
messages=[...],
priority=10 # Highest priority
)
# Background batch job
response = pool.chat(
model="llama3.2",
messages=[...],
priority=1 # Lowest priority
)
# SOLLOL ensures high-priority requests jump the queue
```
### Observability & Monitoring
```python
# Get detailed stats
stats = pool.get_stats()
print(f"Total requests: {stats['total_requests']}")
print(f"Average latency: {stats['avg_latency_ms']}ms")
print(f"Success rate: {stats['success_rate']:.2%}")
# Per-node breakdown
for host, metrics in stats['hosts'].items():
print(f"{host}: {metrics['latency_ms']}ms, {metrics['success_rate']:.2%}")
```
```bash
# Prometheus metrics endpoint
curl http://localhost:9090/metrics
# sollol_requests_total{host="gpu-1:11434",model="llama3.2"} 1234
# sollol_latency_seconds{host="gpu-1:11434"} 0.234
# sollol_success_rate{host="gpu-1:11434"} 0.98
```
---
## π Integration Examples
### SynapticLlamas Integration
```python
from sollol import SOLLOL, SOLLOLConfig
from synaptic_llamas import AgentOrchestrator
# Setup SOLLOL for multi-agent orchestration
config = SOLLOLConfig.auto_discover()
sollol = SOLLOL(config)
sollol.start(blocking=False)
# SynapticLlamas now uses SOLLOL for intelligent routing
orchestrator = AgentOrchestrator(
llm_endpoint="http://localhost:8000/api/chat"
)
# All agents automatically distributed and optimized
orchestrator.run_parallel_agents([...])
```
### LangChain Integration
```python
from langchain.llms import Ollama
from sollol import OllamaPool
# Use SOLLOL as LangChain backend
pool = OllamaPool.auto_configure()
llm = Ollama(
base_url="http://localhost:8000",
model="llama3.2"
)
# LangChain requests now go through SOLLOL
response = llm("What is quantum computing?")
```
---
## π Documentation
- **[Architecture Guide](ARCHITECTURE.md)** - Deep dive into system design
- **[Batch Processing API](BATCH_API.md)** - Complete guide to batch job management (NEW in v0.7.0)
- API endpoints and examples
- Job lifecycle and progress tracking
- Best practices and error handling
- **[llama.cpp Distributed Inference Guide](docs/llama_cpp_guide.md)** - Complete guide to model sharding
- Setup and configuration
- Performance optimization
- Troubleshooting common issues
- Advanced topics (custom layer distribution, monitoring, etc.)
- **[Integration Examples](examples/integration/)** - Practical integration patterns
- [Synchronous Agent Integration](examples/integration/sync_agents.py)
- [Priority Configuration](examples/integration/priority_mapping.py)
- [Load Balancer Wrapper](examples/integration/load_balancer_wrapper.py)
- **[llama.cpp Distributed Examples](examples/llama_cpp_distributed.py)** - Model sharding examples
- Auto-setup and manual configuration
- Multi-turn conversations with monitoring
- Batch processing with multiple models
- Error handling and recovery patterns
- **[Deployment Guide](docs/deployment.md)** - Production deployment patterns
- **[API Reference](docs/api.md)** - Complete API documentation
- **[Performance Tuning](docs/performance.md)** - Optimization guide
- **[SynapticLlamas Learnings](SYNAPTICLLAMAS_LEARNINGS.md)** - Features from production use
---
## π What's New in v0.7.0
### π¦ Batch Processing API
Complete RESTful API for asynchronous batch job management. Submit large-scale batch operations (embeddings, bulk inference) and track progress via job IDs.
```python
import requests
# Submit batch embedding job (up to 10,000 documents)
response = requests.post("http://localhost:11434/api/batch/embed", json={
"model": "nomic-embed-text",
"documents": ["doc1", "doc2", ...], # Thousands of documents
})
job_id = response.json()["job_id"]
# Check status
status = requests.get(f"http://localhost:11434/api/batch/jobs/{job_id}")
print(status.json()["progress"]["percent"]) # 100.0
# Get results
results = requests.get(f"http://localhost:11434/api/batch/results/{job_id}")
embeddings = results.json()["results"]
```
**Batch API Endpoints:**
- `POST /api/batch/embed` - Submit batch embedding job
- `GET /api/batch/jobs/{job_id}` - Get job status with progress tracking
- `GET /api/batch/results/{job_id}` - Retrieve job results and errors
- `DELETE /api/batch/jobs/{job_id}` - Cancel running jobs
- `GET /api/batch/jobs?limit=100` - List recent jobs
**Features:**
- UUID-based job tracking with 5 states (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)
- Automatic TTL-based cleanup (1 hour default)
- Progress tracking: completed_items, failed_items, percentage
- Duration calculation and metadata storage
- Async job execution via Dask distributed processing
### Previous Features (v0.3.6+)
**Synchronous API** - No async/await required:
```python
from sollol.sync_wrapper import OllamaPool
pool = OllamaPool.auto_configure()
response = pool.chat(...) # Synchronous call
```
**Priority Helpers** - Semantic priority levels:
```python
from sollol.priority_helpers import Priority
priority = Priority.HIGH # 7
```
**SOLLOL Detection:**
- `X-Powered-By: SOLLOL` header on all responses
- `/api/health` endpoint returns `{"service": "SOLLOL", "version": "0.7.0"}`
---
## π Comparison
### SOLLOL vs. Simple Load Balancers
| Feature | nginx/HAProxy | SOLLOL |
|---------|--------------|---------|
| Routing | Round-robin/random | Context-aware, adapts from history |
| Resource awareness | None | GPU/CPU/memory-aware |
| Failover | Manual config | Automatic detection & recovery |
| Model sharding | β | β
llama.cpp integration |
| Task prioritization | β | β
Priority queue |
| Observability | Basic | Rich metrics + dashboard |
| Setup | Complex config | Auto-discover |
### SOLLOL vs. Kubernetes
| Feature | Kubernetes | SOLLOL |
|---------|-----------|---------|
| **Complexity** | High - requires cluster setup | Low - pip install |
| **AI-specific** | Generic container orchestration | Purpose-built for LLMs |
| **Intelligence** | None | Task-aware routing |
| **Model sharding** | Manual | Automatic |
| **Best for** | Large-scale production | AI-focused teams |
**Use both!** Deploy SOLLOL on Kubernetes for ultimate scalability.
---
## π€ Contributing
We welcome contributions! Areas we'd love help with:
- ML-based routing predictions
- Additional monitoring integrations
- Cloud provider integrations
- Performance optimizations
- Documentation improvements
See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
---
## π License
MIT License - see [LICENSE](LICENSE) file for details.
---
## π Credits
Created by [BenevolentJoker-JohnL](https://github.com/BenevolentJoker-JohnL)
Part of the [SynapticLlamas](https://github.com/BenevolentJoker-JohnL/SynapticLlamas) ecosystem.
Built with: Ray, Dask, FastAPI, llama.cpp, Ollama
---
## π― What Makes SOLLOL Different?
1. **Combines task distribution AND model sharding** in one system
2. **Context-aware routing** that adapts based on performance metrics
3. **Auto-discovery** of nodes with minimal configuration
4. **Built-in failover** and priority queuing
5. **Purpose-built for Ollama clusters** (understands GPU requirements, task types)
**Limitations to know**:
- Model sharding verified with 13B models; larger models not extensively tested
- Performance benefits depend on network latency and workload patterns
- Not a drop-in replacement for single-node setups in all scenarios
---
<div align="center">
**Stop manually managing your LLM cluster. Let SOLLOL optimize it for you.**
[Get Started](#quick-start) β’ [View on GitHub](https://github.com/BenevolentJoker-JohnL/SOLLOL) β’ [Report Issue](https://github.com/BenevolentJoker-JohnL/SOLLOL/issues)
</div>
Raw data
{
"_id": null,
"home_page": "https://github.com/BenevolentJoker-JohnL/SOLLOL",
"name": "sollol",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "ai, llm, distributed, ollama, load-balancing, inference, llama-cpp",
"author": "BenevolentJoker-JohnL",
"author_email": "BenevolentJoker-JohnL <benevolentjoker@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/22/84/507a9031a7ba403dc4424528d607acc401a75f0a07f5a48e8a02bfdd1c09/sollol-0.8.1.tar.gz",
"platform": null,
"description": "# SOLLOL - Production-Ready Orchestration for Local LLM Clusters\n\n<div align=\"center\">\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/BenevolentJoker-JohnL/SOLLOL/actions/workflows/tests.yml)\n[](https://codecov.io/gh/BenevolentJoker-JohnL/SOLLOL)\n\n**Open-source orchestration layer that combines intelligent task routing with distributed model inference for local LLM clusters.**\n\n[Quick Start](#quick-start) \u2022 [Features](#why-sollol) \u2022 [Architecture](#architecture) \u2022 [Documentation](#documentation) \u2022 [Examples](#examples)\n\n</div>\n\n---\n\n## \ud83c\udfaf What is SOLLOL?\n\nSOLLOL (Super Ollama Load balancer & Orchestration Layer) transforms your collection of Ollama nodes into an **intelligent AI cluster** with adaptive routing and automatic failover\u2014all running on your own hardware.\n\n### The Problem\n\nYou have multiple machines with GPUs running Ollama, but:\n- \u274c Manual node selection for each request\n- \u274c No way to run models larger than your biggest GPU\n- \u274c Can't distribute multi-agent workloads efficiently\n- \u274c No automatic failover or load balancing\n- \u274c Zero visibility into cluster performance\n\n### The SOLLOL Solution\n\nSOLLOL provides:\n- \u2705 **Intelligent routing** that learns which nodes work best for each task\n- \u2705 **Model sharding** to run 70B+ models across multiple machines\n- \u2705 **Parallel agent execution** for multi-agent frameworks\n- \u2705 **Auto-discovery** of all nodes and capabilities\n- \u2705 **Built-in observability** with real-time metrics\n- \u2705 **Zero-config deployment** - just point and go\n\n---\n\n## \ud83d\ude80 Why SOLLOL?\n\n### 1. **Two Distribution Modes in One System**\n\nSOLLOL combines both task distribution and model sharding:\n\n#### \ud83d\udcca Task Distribution (Horizontal Scaling)\nDistribute **multiple requests** across your cluster in parallel:\n```python\n# Run 10 agents simultaneously across 5 nodes\npool = OllamaPool.auto_configure()\nresponses = await asyncio.gather(*[\n pool.chat(model=\"llama3.2\", messages=[...])\n for _ in range(10)\n])\n# Parallel execution across available nodes\n```\n\n#### \ud83e\udde9 Model Sharding (Vertical Scaling)\nRun **single large models** that don't fit on one machine:\n```python\n# Run larger models across multiple nodes\n# Note: Verified with 13B across 2-3 nodes; larger models not extensively tested\nrouter = HybridRouter(\n enable_distributed=True,\n num_rpc_backends=4\n)\nresponse = await router.route_request(\n model=\"llama3:70b\", # Sharded automatically\n messages=[...]\n)\n```\n\n**Use them together!** Small models use task distribution, large models use sharding.\n\n---\n\n### 2. **Intelligent, Not Just Balanced**\n\nSOLLOL doesn't just distribute requests randomly\u2014it **learns** and **optimizes**:\n\n| Feature | Simple Load Balancer | SOLLOL |\n|---------|---------------------|---------|\n| **Routing** | Round-robin | Context-aware scoring |\n| **Learning** | None | Adapts from performance history |\n| **Resource Awareness** | None | GPU/CPU/memory-aware |\n| **Task Optimization** | None | Routes by task type complexity |\n| **Failover** | Manual | Automatic with health checks |\n| **Priority** | FIFO | Priority queue with fairness |\n\n**Example**: SOLLOL automatically routes:\n- Heavy generation tasks \u2192 GPU nodes with 24GB VRAM\n- Fast embeddings \u2192 CPU nodes or smaller GPUs\n- Critical requests \u2192 Fastest, most reliable nodes\n- Batch processing \u2192 Lower priority, distributed load\n\n---\n\n### 3. **Production-Ready from Day One**\n\n```python\nfrom sollol import SOLLOL, SOLLOLConfig\n\n# Literally 3 lines to production\nconfig = SOLLOLConfig.auto_discover()\nsollol = SOLLOL(config)\nsollol.start() # \u2705 Gateway running on :8000\n```\n\n**Out of the box**:\n- Auto-discovery of Ollama nodes\n- Health monitoring and failover\n- Prometheus metrics\n- Web dashboard\n- Connection pooling\n- Request hedging\n- Priority queuing\n\n---\n\n## \ud83c\udfd7\ufe0f Architecture\n\n### High-Level Overview\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Your Application \u2502\n\u2502 (SynapticLlamas, custom agents, etc.) \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 SOLLOL Gateway (:8000) \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Intelligent Routing Engine \u2502 \u2502\n\u2502 \u2502 \u2022 Analyzes: task type, complexity, resources \u2502 \u2502\n\u2502 \u2502 \u2022 Scores: all nodes based on context \u2502 \u2502\n\u2502 \u2502 \u2022 Learns: from performance history \u2502 \u2502\n\u2502 \u2502 \u2022 Routes: to optimal node \u2502 \u2502\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Priority Queue + Failover \u2502 \u2502\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n \u25bc \u25bc\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 Task Mode \u2502 \u2502 Shard Mode \u2502\n \u2502 Ray Cluster \u2502 \u2502 llama.cpp \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502\n \u25bc \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Your Heterogeneous Cluster \u2502\n\u2502 GPU (24GB) \u2502 GPU (16GB) \u2502 CPU (64c) \u2502 GPU (8GB) \u2502... \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### How Routing Works\n\n```python\n# 1. Request arrives\nPOST /api/chat {\n \"model\": \"llama3.2\",\n \"messages\": [{\"role\": \"user\", \"content\": \"Complex analysis task...\"}],\n \"priority\": 8\n}\n\n# 2. SOLLOL analyzes\ntask_type = \"generation\" # Auto-detected\ncomplexity = \"high\" # Token count analysis\nrequires_gpu = True # Based on task\nestimated_duration = 3.2s # From history\n\n# 3. SOLLOL scores all nodes\nNode A (GPU 24GB, load: 0.2, latency: 120ms) \u2192 Score: 185.3 \u2713 WINNER\nNode B (GPU 8GB, load: 0.6, latency: 200ms) \u2192 Score: 92.1\nNode C (CPU only, load: 0.1, latency: 80ms) \u2192 Score: 41.2\n\n# 4. Routes to Node A, monitors execution, learns for next time\n```\n\n**Scoring Algorithm**:\n```\nScore = 100.0 (baseline)\n \u00d7 success_rate (0.0-1.0)\n \u00f7 (1 + latency_penalty)\n \u00d7 gpu_bonus (1.5x if GPU available & needed)\n \u00f7 (1 + load_penalty)\n \u00d7 priority_alignment\n \u00d7 task_specialization\n```\n\n---\n\n## \ud83d\udce6 Installation\n\n### Quick Install (PyPI)\n```bash\npip install sollol\n```\n\n### From Source\n```bash\ngit clone https://github.com/BenevolentJoker-JohnL/SOLLOL.git\ncd SOLLOL\npip install -e .\n```\n\n---\n\n## \u26a1 Quick Start\n\n### 1. Synchronous API (No async/await needed!)\n\n**New in v0.3.6:** SOLLOL now provides a synchronous API for easier integration with non-async applications.\n\n```python\nfrom sollol.sync_wrapper import OllamaPool\nfrom sollol.priority_helpers import Priority\n\n# Auto-discover and connect to all Ollama nodes\npool = OllamaPool.auto_configure()\n\n# Make requests - SOLLOL routes intelligently\n# No async/await needed!\nresponse = pool.chat(\n model=\"llama3.2\",\n messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n priority=Priority.HIGH, # Semantic priority levels\n timeout=60 # Request timeout in seconds\n)\n\nprint(response['message']['content'])\nprint(f\"Routed to: {response.get('_sollol_routing', {}).get('host', 'unknown')}\")\n```\n\n**Key features of synchronous API:**\n- \u2705 No async/await syntax required\n- \u2705 Works with synchronous agent frameworks\n- \u2705 Same intelligent routing and features\n- \u2705 Runs async code in background thread automatically\n\n---\n\n### 2. Async API (Original)\n\nFor async applications, use the original async API:\n\n```python\nfrom sollol import OllamaPool\n\n# Auto-discover and connect to all Ollama nodes\npool = await OllamaPool.auto_configure()\n\n# Make requests - SOLLOL routes intelligently\nresponse = await pool.chat(\n model=\"llama3.2\",\n messages=[{\"role\": \"user\", \"content\": \"Hello!\"}]\n)\n\nprint(response['message']['content'])\nprint(f\"Routed to: {response['_sollol_routing']['host']}\")\nprint(f\"Task type: {response['_sollol_routing']['task_type']}\")\n```\n\n---\n\n### 3. Priority-Based Multi-Agent Execution\n\n**New in v0.3.6:** Use semantic priority levels and role-based mapping.\n\n```python\nfrom sollol.sync_wrapper import OllamaPool\nfrom sollol.priority_helpers import Priority, get_priority_for_role\n\npool = OllamaPool.auto_configure()\n\n# Define agents with different priorities\nagents = [\n {\"name\": \"Researcher\", \"role\": \"researcher\"}, # Priority 8\n {\"name\": \"Editor\", \"role\": \"editor\"}, # Priority 6\n {\"name\": \"Summarizer\", \"role\": \"summarizer\"}, # Priority 5\n]\n\nfor agent in agents:\n priority = get_priority_for_role(agent[\"role\"])\n\n response = pool.chat(\n model=\"llama3.2\",\n messages=[{\"role\": \"user\", \"content\": f\"Task for {agent['name']}\"}],\n priority=priority\n )\n # User-facing agents get priority, background tasks wait\n```\n\n**Priority levels available:**\n- `Priority.CRITICAL` (10) - Mission-critical\n- `Priority.URGENT` (9) - Fast response needed\n- `Priority.HIGH` (7) - Important tasks\n- `Priority.NORMAL` (5) - Default\n- `Priority.LOW` (3) - Background tasks\n- `Priority.BATCH` (1) - Can wait\n\n---\n\n### 4. Model Sharding with llama.cpp (Large Models)\n\n**Run models larger than your biggest GPU** by distributing layers across multiple machines.\n\n#### When to Use Model Sharding\n\nUse model sharding when:\n- \u2705 Model doesn't fit on your largest GPU (e.g., 70B models on 16GB GPUs)\n- \u2705 You have multiple machines with network connectivity\n- \u2705 You can tolerate slower inference for capability\n\nDon't use sharding when:\n- \u274c Model fits on a single GPU (use task distribution instead)\n- \u274c You need maximum inference speed\n- \u274c Network latency is high (>10ms between machines)\n\n#### Quick Start: Auto-Setup (Easiest)\n\n```python\nfrom sollol.sync_wrapper import HybridRouter, OllamaPool\n\n# SOLLOL handles all setup automatically\nrouter = HybridRouter(\n ollama_pool=OllamaPool.auto_configure(),\n enable_distributed=True, # Enable model sharding\n auto_setup_rpc=True, # Auto-configure RPC backends\n num_rpc_backends=3 # Distribute across 3 machines\n)\n\n# Use large model that doesn't fit on one machine\nresponse = router.route_request(\n model=\"llama3.1:70b\", # Automatically sharded across backends\n messages=[{\"role\": \"user\", \"content\": \"Explain quantum computing\"}]\n)\n\nprint(response['message']['content'])\n```\n\n**What happens automatically:**\n1. SOLLOL discovers available RPC backends on your network\n2. Extracts the GGUF model from Ollama storage\n3. Starts llama-server coordinator with optimal settings\n4. Distributes model layers across backends\n5. Routes your request to the coordinator\n\n#### RPC Server Auto-Installation\n\n**SOLLOL can automatically clone, build, and start llama.cpp RPC servers for you!**\n\n**One-line installation:**\n\n```python\nfrom sollol.rpc_auto_setup import auto_setup_rpc_backends\n\n# Automatically: clone \u2192 build \u2192 start RPC servers\nbackends = auto_setup_rpc_backends(num_backends=2)\n# Output: [{'host': '127.0.0.1', 'port': 50052}, {'host': '127.0.0.1', 'port': 50053}]\n```\n\n**What this does:**\n1. \u2705 Scans network for existing RPC servers\n2. \u2705 If none found: clones llama.cpp to `~/llama.cpp`\n3. \u2705 Builds llama.cpp with RPC support (`cmake -DGGML_RPC=ON`)\n4. \u2705 Starts RPC servers on ports 50052-50053\n5. \u2705 Returns ready-to-use backend list\n\n**CLI installation:**\n\n```bash\n# Full automated setup (clone + build + install systemd service)\npython3 -m sollol.setup_llama_cpp --all\n\n# Or step by step\npython3 -m sollol.setup_llama_cpp --clone # Clone llama.cpp\npython3 -m sollol.setup_llama_cpp --build # Build with RPC support\npython3 -m sollol.setup_llama_cpp --start # Start RPC server\n```\n\n**Docker IP Resolution:**\n\nSOLLOL automatically resolves Docker container IPs to accessible host IPs:\n\n```python\n# If Docker container reports IP 172.17.0.5:11434\n# SOLLOL automatically resolves to:\n# \u2192 127.0.0.1:11434 (published port mapping)\n# \u2192 host IP (if accessible)\n# \u2192 Docker host gateway\n\nfrom sollol import is_docker_ip, resolve_docker_ip\n\n# Check if IP is Docker internal\nis_docker = is_docker_ip(\"172.17.0.5\") # True\n\n# Resolve Docker IP to accessible IP\naccessible_ip = resolve_docker_ip(\"172.17.0.5\", port=11434)\n# Returns: \"127.0.0.1\" or host IP\n```\n\n**Network Discovery with Docker Support:**\n\n```python\nfrom sollol import OllamaPool\n\n# Auto-discover nodes (automatically resolves Docker IPs)\npool = OllamaPool.auto_configure()\n\n# Manual control\nfrom sollol.discovery import discover_ollama_nodes\nnodes = discover_ollama_nodes(auto_resolve_docker=True)\n```\n\n**Multi-Node Production Setup:**\n\nFor distributed clusters, use systemd services on each node:\n\n```bash\n# On each RPC node\nsudo systemctl enable llama-rpc@50052.service\nsudo systemctl start llama-rpc@50052.service\n```\n\nSee [SOLLOL_RPC_SETUP.md](https://github.com/BenevolentJoker-JohnL/FlockParser/blob/main/SOLLOL_RPC_SETUP.md) for complete installation guide.\n\n#### Architecture: How It Works\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Llama 3.1 70B Model (40GB total) \u2502\n\u2502 Distributed Sharding \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 \u2502 \u2502\n \u25bc \u25bc \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Machine 1 \u2502 \u2502 Machine 2 \u2502 \u2502 Machine 3 \u2502\n\u2502 Layers 0-26 \u2502 \u2502 Layers 27-53 \u2502 \u2502 Layers 54-79 \u2502\n\u2502 (~13GB) \u2502 \u2502 (~13GB) \u2502 \u2502 (~13GB) \u2502\n\u2502 RPC Backend \u2502 \u2502 RPC Backend \u2502 \u2502 RPC Backend \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u25b2 \u25b2 \u25b2\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 llama-server \u2502\n \u2502 Coordinator \u2502\n \u2502 (Port 18080) \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n#### Manual Setup (Advanced)\n\nFor explicit control over RPC backends:\n\n```python\nfrom sollol.llama_cpp_coordinator import LlamaCppCoordinator\nfrom sollol.rpc_registry import RPCBackendRegistry\n\n# 1. Register RPC backends explicitly\nregistry = RPCBackendRegistry()\nregistry.add_backend(\"rpc_1\", \"grpc://10.9.66.45:50052\")\nregistry.add_backend(\"rpc_2\", \"grpc://10.9.66.46:50052\")\nregistry.add_backend(\"rpc_3\", \"grpc://10.9.66.47:50052\")\n\n# 2. Create coordinator\ncoordinator = LlamaCppCoordinator(\n coordinator_port=18080,\n rpc_backends=registry.get_all_backends(),\n context_size=4096,\n gpu_layers=-1 # Use all available GPU layers\n)\n\n# 3. Start and use\nawait coordinator.start(model_name=\"llama3.1:70b\")\nresponse = await coordinator.generate(\n prompt=\"Explain the theory of relativity\",\n max_tokens=500\n)\n```\n\n#### Performance Expectations\n\n| Model Size | Single GPU | Sharded (3 nodes) | Trade-off |\n|------------|-----------|-------------------|-----------|\n| **13B** | \u2705 20 tok/s | \u2705 5 tok/s | -75% speed, works on 3\u00d7smaller GPUs |\n| **70B** | \u274c OOM | \u26a0\ufe0f 3-5 tok/s (est.) | Enables model that won't run otherwise |\n\n**Trade-offs:**\n- \ud83d\udc0c **Startup**: 2-5 minutes (model distribution + loading)\n- \ud83d\udc0c **Inference**: ~4x slower than local (network overhead)\n- \u2705 **Capability**: Run models that won't fit on single GPU\n\n**Learn More:**\n- \ud83d\udcd6 [Complete llama.cpp Guide](docs/llama_cpp_guide.md) - Setup, optimization, troubleshooting\n- \ud83d\udcbb [Working Examples](examples/llama_cpp_distributed.py) - 5 complete examples including conversation, batch processing, error handling\n\n---\n\n### 5. Batch Processing API\n\n**New in v0.7.0:** RESTful API for asynchronous batch job management.\n\nSubmit large-scale batch operations (thousands of embeddings, bulk inference) and track progress via job IDs:\n\n```python\nimport requests\n\n# Submit batch embedding job (up to 10,000 documents)\nresponse = requests.post(\"http://localhost:11434/api/batch/embed\", json={\n \"model\": \"nomic-embed-text\",\n \"documents\": [\"Document 1\", \"Document 2\", ...], # Can be thousands\n \"metadata\": {\"source\": \"knowledge_base\"} # Optional metadata\n})\n\njob_id = response.json()[\"job_id\"]\nprint(f\"Job submitted: {job_id}\")\n\n# Poll for job status\nimport time\nwhile True:\n status = requests.get(f\"http://localhost:11434/api/batch/jobs/{job_id}\").json()\n\n progress = status[\"progress\"][\"percent\"]\n print(f\"Progress: {progress}%\")\n\n if status[\"status\"] == \"completed\":\n break\n time.sleep(1)\n\n# Get results\nresults = requests.get(f\"http://localhost:11434/api/batch/results/{job_id}\").json()\nembeddings = results[\"results\"] # List of embedding vectors\nprint(f\"Processed {len(embeddings)} documents in {status['duration_seconds']}s\")\n```\n\n**Available Batch Endpoints:**\n- `POST /api/batch/embed` - Submit batch embedding job\n- `GET /api/batch/jobs/{job_id}` - Get job status\n- `GET /api/batch/results/{job_id}` - Get job results\n- `GET /api/batch/jobs?limit=100` - List recent jobs\n- `DELETE /api/batch/jobs/{job_id}` - Cancel job\n\n**Use cases:**\n- Embedding large document collections (thousands of documents)\n- Bulk inference for batch predictions\n- Background processing without blocking\n- Long-running operations with progress tracking\n\n---\n\n### 6. SOLLOL Detection\n\n**New in v0.3.6:** Detect if SOLLOL is running vs native Ollama.\n\n```python\nimport requests\n\ndef is_sollol(url=\"http://localhost:11434\"):\n \"\"\"Check if SOLLOL is running at the given URL.\"\"\"\n\n # Method 1: Check X-Powered-By header\n response = requests.get(url)\n if response.headers.get(\"X-Powered-By\") == \"SOLLOL\":\n return True\n\n # Method 2: Check health endpoint\n response = requests.get(f\"{url}/api/health\")\n data = response.json()\n if data.get(\"service\") == \"SOLLOL\":\n return True\n\n return False\n\n# Use it\nif is_sollol(\"http://localhost:11434\"):\n print(\"\u2713 SOLLOL detected - using intelligent routing\")\nelse:\n print(\"Native Ollama detected\")\n```\n\n**Why this matters:**\n- Enables graceful fallback in client applications\n- Makes SOLLOL a true drop-in replacement\n- Clients can auto-detect and use SOLLOL features when available\n\n---\n\n### 7. Production Gateway\n\n```python\nfrom sollol import SOLLOL, SOLLOLConfig\n\n# Full production setup\nconfig = SOLLOLConfig(\n ray_workers=4,\n dask_workers=2,\n hosts=[\"gpu-1:11434\", \"gpu-2:11434\", \"cpu-1:11434\"],\n gateway_port=8000,\n metrics_port=9090\n)\n\nsollol = SOLLOL(config)\nsollol.start() # Blocks and runs gateway\n\n# Access via HTTP:\n# curl http://localhost:8000/api/chat -d '{...}'\n# curl http://localhost:8000/api/stats\n# curl http://localhost:8000/api/dashboard\n```\n\n---\n\n## \ud83c\udf93 Use Cases\n\n### 1. Multi-Agent AI Systems (SynapticLlamas, CrewAI, AutoGPT)\n\n**Problem**: Running 10 agents sequentially takes 10x longer than necessary.\n\n**Solution**: SOLLOL distributes agents across nodes in parallel.\n\n```python\n# Before: Sequential execution on one node\n# After: Parallel execution with SOLLOL\npool = OllamaPool.auto_configure()\nagents = await asyncio.gather(*[\n pool.chat(model=\"llama3.2\", messages=agent_prompts[i])\n for i in range(10)\n])\n# Speedup depends on number of available nodes and their capacity\n```\n\n### 2. Large Model Inference\n\n**Problem**: Your model doesn't fit in available VRAM.\n\n**Solution**: SOLLOL can shard models across multiple machines via llama.cpp.\n\n```python\n# Distribute model across multiple nodes\n# Note: Verified with 13B models; larger models not extensively tested\nrouter = HybridRouter(\n enable_distributed=True,\n num_rpc_backends=4\n)\n# Trade-off: Slower startup/inference but enables running larger models\n```\n\n### 3. Mixed Workloads\n\n**Problem**: Different tasks need different resources.\n\n**Solution**: SOLLOL routes each task to the optimal node.\n\n```python\npool = OllamaPool.auto_configure()\n\n# Heavy generation \u2192 GPU node\nchat = pool.chat(model=\"llama3.2:70b\", messages=[...])\n\n# Fast embeddings \u2192 CPU node\nembeddings = pool.embed(model=\"nomic-embed-text\", input=[...])\n\n# SOLLOL automatically routes each to the best available node\n```\n\n### 4. High Availability Production\n\n**Problem**: Node failures break your service.\n\n**Solution**: SOLLOL auto-fails over and recovers.\n\n```python\n# Node A fails mid-request\n# \u2705 SOLLOL automatically:\n# 1. Detects failure\n# 2. Retries on Node B\n# 3. Marks Node A as degraded\n# 4. Periodically re-checks Node A\n# 5. Restores Node A when healthy\n```\n\n---\n\n## \ud83d\udcca Performance & Benchmarks\n\n### Validation Status\n\n**What's Been Validated \u2705**\n- Single-node baseline performance measured\n- Code exists and is reviewable (75+ modules)\n- Tests pass in CI (57 tests, coverage tracked)\n- Architecture implements intelligent routing\n\n**What Needs Validation \u26a0\ufe0f**\n- Comparative benchmarks (SOLLOL vs round-robin)\n- Multi-node performance improvements\n- Real-world latency/throughput gains\n\n\ud83d\udcd6 **See [BENCHMARKING.md](BENCHMARKING.md) for complete validation roadmap and how to run comparative tests.**\n\n---\n\n### Measured Baseline Performance\n\n**Single Ollama Node** (llama3.2-3B, 50 requests, concurrency=5):\n- \u2705 **Success Rate:** 100%\n- \u26a1 **Throughput:** 0.51 req/s\n- \ud83d\udcc8 **Average Latency:** 5,659 ms\n- \ud83d\udcc8 **P95 Latency:** 11,299 ms\n- \ud83d\udcc8 **P99 Latency:** 12,259 ms\n\n**Hardware:** Single Ollama instance with 75+ models loaded\n**Data:** See [`benchmarks/results/`](benchmarks/results/) for raw JSON\n\n**Run Your Own:**\n```bash\n# Baseline test (no cluster needed)\npython benchmarks/simple_ollama_benchmark.py llama3.2 50\n\n# Comparative test (requires docker-compose)\ndocker-compose up -d\npython benchmarks/run_benchmarks.py --sollol-url http://localhost:8000 --duration 60\n```\n\n---\n\n### Projected Performance (Unvalidated)\n\n**Note:** These are architectural projections, not measured results. Requires multi-node cluster setup for validation.\n\n**Theory:** With N nodes and parallelizable workload:\n- Task distribution can approach N\u00d7 parallelization (limited by request rate)\n- Intelligent routing should reduce tail latencies vs random selection\n- Resource-aware placement reduces contention and failures\n\n**Reality:** Requires multi-node cluster validation. See [BENCHMARKING.md](BENCHMARKING.md) for test procedure and [CODE_WALKTHROUGH.md](CODE_WALKTHROUGH.md) for implementation details.\n\n### Model Sharding Performance\n\n| Model | Single 24GB GPU | SOLLOL (3\u00d716GB) | Status |\n|-------|----------------|-----------------|-----------|\n| **13B** | \u2705 ~20 tok/s | \u2705 ~5 tok/s | \u2705 Verified working |\n| **70B** | \u274c OOM | \u26a0\ufe0f Estimated ~3-5 tok/s | \u26a0\ufe0f Not extensively tested |\n\n**When to use sharding**: When model doesn't fit on your largest GPU. You trade speed for capability.\n\n**Performance trade-offs**: Distributed inference is 2-5 minutes slower to start and ~4x slower for inference compared to local. Use only when necessary.\n\n### Overhead\n\n- **Routing decision**: ~5-10ms (tested with 5-10 nodes)\n- **Network overhead**: Varies by network (typically 5-20ms)\n- **Total added latency**: ~20-50ms\n- **Benefit**: Better resource utilization + automatic failover\n\n---\n\n## \ud83d\udee0\ufe0f Advanced Configuration\n\n### Custom Routing Strategy\n\n```python\nfrom sollol import OllamaPool\n\npool = OllamaPool(\n nodes=[\n {\"host\": \"gpu-1.local\", \"port\": 11434, \"priority\": 10}, # Prefer this\n {\"host\": \"gpu-2.local\", \"port\": 11434, \"priority\": 5},\n {\"host\": \"cpu-1.local\", \"port\": 11434, \"priority\": 1}, # Last resort\n ],\n enable_intelligent_routing=True,\n enable_hedging=True, # Duplicate critical requests\n max_queue_size=100\n)\n```\n\n### Priority-Based Scheduling\n\n```python\n# Critical user-facing request\nresponse = pool.chat(\n model=\"llama3.2\",\n messages=[...],\n priority=10 # Highest priority\n)\n\n# Background batch job\nresponse = pool.chat(\n model=\"llama3.2\",\n messages=[...],\n priority=1 # Lowest priority\n)\n\n# SOLLOL ensures high-priority requests jump the queue\n```\n\n### Observability & Monitoring\n\n```python\n# Get detailed stats\nstats = pool.get_stats()\nprint(f\"Total requests: {stats['total_requests']}\")\nprint(f\"Average latency: {stats['avg_latency_ms']}ms\")\nprint(f\"Success rate: {stats['success_rate']:.2%}\")\n\n# Per-node breakdown\nfor host, metrics in stats['hosts'].items():\n print(f\"{host}: {metrics['latency_ms']}ms, {metrics['success_rate']:.2%}\")\n```\n\n```bash\n# Prometheus metrics endpoint\ncurl http://localhost:9090/metrics\n\n# sollol_requests_total{host=\"gpu-1:11434\",model=\"llama3.2\"} 1234\n# sollol_latency_seconds{host=\"gpu-1:11434\"} 0.234\n# sollol_success_rate{host=\"gpu-1:11434\"} 0.98\n```\n\n---\n\n## \ud83d\udd0c Integration Examples\n\n### SynapticLlamas Integration\n\n```python\nfrom sollol import SOLLOL, SOLLOLConfig\nfrom synaptic_llamas import AgentOrchestrator\n\n# Setup SOLLOL for multi-agent orchestration\nconfig = SOLLOLConfig.auto_discover()\nsollol = SOLLOL(config)\nsollol.start(blocking=False)\n\n# SynapticLlamas now uses SOLLOL for intelligent routing\norchestrator = AgentOrchestrator(\n llm_endpoint=\"http://localhost:8000/api/chat\"\n)\n\n# All agents automatically distributed and optimized\norchestrator.run_parallel_agents([...])\n```\n\n### LangChain Integration\n\n```python\nfrom langchain.llms import Ollama\nfrom sollol import OllamaPool\n\n# Use SOLLOL as LangChain backend\npool = OllamaPool.auto_configure()\n\nllm = Ollama(\n base_url=\"http://localhost:8000\",\n model=\"llama3.2\"\n)\n\n# LangChain requests now go through SOLLOL\nresponse = llm(\"What is quantum computing?\")\n```\n\n---\n\n## \ud83d\udcda Documentation\n\n- **[Architecture Guide](ARCHITECTURE.md)** - Deep dive into system design\n- **[Batch Processing API](BATCH_API.md)** - Complete guide to batch job management (NEW in v0.7.0)\n - API endpoints and examples\n - Job lifecycle and progress tracking\n - Best practices and error handling\n- **[llama.cpp Distributed Inference Guide](docs/llama_cpp_guide.md)** - Complete guide to model sharding\n - Setup and configuration\n - Performance optimization\n - Troubleshooting common issues\n - Advanced topics (custom layer distribution, monitoring, etc.)\n- **[Integration Examples](examples/integration/)** - Practical integration patterns\n - [Synchronous Agent Integration](examples/integration/sync_agents.py)\n - [Priority Configuration](examples/integration/priority_mapping.py)\n - [Load Balancer Wrapper](examples/integration/load_balancer_wrapper.py)\n- **[llama.cpp Distributed Examples](examples/llama_cpp_distributed.py)** - Model sharding examples\n - Auto-setup and manual configuration\n - Multi-turn conversations with monitoring\n - Batch processing with multiple models\n - Error handling and recovery patterns\n- **[Deployment Guide](docs/deployment.md)** - Production deployment patterns\n- **[API Reference](docs/api.md)** - Complete API documentation\n- **[Performance Tuning](docs/performance.md)** - Optimization guide\n- **[SynapticLlamas Learnings](SYNAPTICLLAMAS_LEARNINGS.md)** - Features from production use\n\n---\n\n## \ud83c\udd95 What's New in v0.7.0\n\n### \ud83d\udce6 Batch Processing API\nComplete RESTful API for asynchronous batch job management. Submit large-scale batch operations (embeddings, bulk inference) and track progress via job IDs.\n\n```python\nimport requests\n\n# Submit batch embedding job (up to 10,000 documents)\nresponse = requests.post(\"http://localhost:11434/api/batch/embed\", json={\n \"model\": \"nomic-embed-text\",\n \"documents\": [\"doc1\", \"doc2\", ...], # Thousands of documents\n})\njob_id = response.json()[\"job_id\"]\n\n# Check status\nstatus = requests.get(f\"http://localhost:11434/api/batch/jobs/{job_id}\")\nprint(status.json()[\"progress\"][\"percent\"]) # 100.0\n\n# Get results\nresults = requests.get(f\"http://localhost:11434/api/batch/results/{job_id}\")\nembeddings = results.json()[\"results\"]\n```\n\n**Batch API Endpoints:**\n- `POST /api/batch/embed` - Submit batch embedding job\n- `GET /api/batch/jobs/{job_id}` - Get job status with progress tracking\n- `GET /api/batch/results/{job_id}` - Retrieve job results and errors\n- `DELETE /api/batch/jobs/{job_id}` - Cancel running jobs\n- `GET /api/batch/jobs?limit=100` - List recent jobs\n\n**Features:**\n- UUID-based job tracking with 5 states (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)\n- Automatic TTL-based cleanup (1 hour default)\n- Progress tracking: completed_items, failed_items, percentage\n- Duration calculation and metadata storage\n- Async job execution via Dask distributed processing\n\n### Previous Features (v0.3.6+)\n\n**Synchronous API** - No async/await required:\n```python\nfrom sollol.sync_wrapper import OllamaPool\npool = OllamaPool.auto_configure()\nresponse = pool.chat(...) # Synchronous call\n```\n\n**Priority Helpers** - Semantic priority levels:\n```python\nfrom sollol.priority_helpers import Priority\npriority = Priority.HIGH # 7\n```\n\n**SOLLOL Detection:**\n- `X-Powered-By: SOLLOL` header on all responses\n- `/api/health` endpoint returns `{\"service\": \"SOLLOL\", \"version\": \"0.7.0\"}`\n\n---\n\n## \ud83c\udd9a Comparison\n\n### SOLLOL vs. Simple Load Balancers\n\n| Feature | nginx/HAProxy | SOLLOL |\n|---------|--------------|---------|\n| Routing | Round-robin/random | Context-aware, adapts from history |\n| Resource awareness | None | GPU/CPU/memory-aware |\n| Failover | Manual config | Automatic detection & recovery |\n| Model sharding | \u274c | \u2705 llama.cpp integration |\n| Task prioritization | \u274c | \u2705 Priority queue |\n| Observability | Basic | Rich metrics + dashboard |\n| Setup | Complex config | Auto-discover |\n\n### SOLLOL vs. Kubernetes\n\n| Feature | Kubernetes | SOLLOL |\n|---------|-----------|---------|\n| **Complexity** | High - requires cluster setup | Low - pip install |\n| **AI-specific** | Generic container orchestration | Purpose-built for LLMs |\n| **Intelligence** | None | Task-aware routing |\n| **Model sharding** | Manual | Automatic |\n| **Best for** | Large-scale production | AI-focused teams |\n\n**Use both!** Deploy SOLLOL on Kubernetes for ultimate scalability.\n\n---\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Areas we'd love help with:\n\n- ML-based routing predictions\n- Additional monitoring integrations\n- Cloud provider integrations\n- Performance optimizations\n- Documentation improvements\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n---\n\n## \ud83d\udcdc License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n---\n\n## \ud83d\ude4f Credits\n\nCreated by [BenevolentJoker-JohnL](https://github.com/BenevolentJoker-JohnL)\n\nPart of the [SynapticLlamas](https://github.com/BenevolentJoker-JohnL/SynapticLlamas) ecosystem.\n\nBuilt with: Ray, Dask, FastAPI, llama.cpp, Ollama\n\n---\n\n## \ud83c\udfaf What Makes SOLLOL Different?\n\n1. **Combines task distribution AND model sharding** in one system\n2. **Context-aware routing** that adapts based on performance metrics\n3. **Auto-discovery** of nodes with minimal configuration\n4. **Built-in failover** and priority queuing\n5. **Purpose-built for Ollama clusters** (understands GPU requirements, task types)\n\n**Limitations to know**:\n- Model sharding verified with 13B models; larger models not extensively tested\n- Performance benefits depend on network latency and workload patterns\n- Not a drop-in replacement for single-node setups in all scenarios\n\n---\n\n<div align=\"center\">\n\n**Stop manually managing your LLM cluster. Let SOLLOL optimize it for you.**\n\n[Get Started](#quick-start) \u2022 [View on GitHub](https://github.com/BenevolentJoker-JohnL/SOLLOL) \u2022 [Report Issue](https://github.com/BenevolentJoker-JohnL/SOLLOL/issues)\n\n</div>\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Intelligent Load Balancer for Ollama Clusters - 3 Distribution Modes: Ray Parallel + Dask Batch + llama.cpp Sharding",
"version": "0.8.1",
"project_urls": {
"Bug Tracker": "https://github.com/BenevolentJoker-JohnL/SOLLOL/issues",
"Documentation": "https://github.com/BenevolentJoker-JohnL/SOLLOL/blob/main/README.md",
"Homepage": "https://github.com/BenevolentJoker-JohnL/SOLLOL",
"Repository": "https://github.com/BenevolentJoker-JohnL/SOLLOL"
},
"split_keywords": [
"ai",
" llm",
" distributed",
" ollama",
" load-balancing",
" inference",
" llama-cpp"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "bf82baf6d96a7a56d72c01c62887ed476f22f466cc9d11a4f0b34033c9413d1d",
"md5": "519235a3cc35632a35cb38bbd7ecb940",
"sha256": "224393c81b218550aab12617d75846c687ef73b4fe7986b441e885f3490c8448"
},
"downloads": -1,
"filename": "sollol-0.8.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "519235a3cc35632a35cb38bbd7ecb940",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 169086,
"upload_time": "2025-10-07T01:23:20",
"upload_time_iso_8601": "2025-10-07T01:23:20.612737Z",
"url": "https://files.pythonhosted.org/packages/bf/82/baf6d96a7a56d72c01c62887ed476f22f466cc9d11a4f0b34033c9413d1d/sollol-0.8.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2284507a9031a7ba403dc4424528d607acc401a75f0a07f5a48e8a02bfdd1c09",
"md5": "6c14f38f68e80c98bcf75a9327224f81",
"sha256": "eaab7b1f314628511062ba44ca5224c52b44ad87849b40d5a534352ea19f8456"
},
"downloads": -1,
"filename": "sollol-0.8.1.tar.gz",
"has_sig": false,
"md5_digest": "6c14f38f68e80c98bcf75a9327224f81",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 352966,
"upload_time": "2025-10-07T01:23:22",
"upload_time_iso_8601": "2025-10-07T01:23:22.152657Z",
"url": "https://files.pythonhosted.org/packages/22/84/507a9031a7ba403dc4424528d607acc401a75f0a07f5a48e8a02bfdd1c09/sollol-0.8.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-07 01:23:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "BenevolentJoker-JohnL",
"github_project": "SOLLOL",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "sollol"
}