sollol


Namesollol JSON
Version 0.8.1 PyPI version JSON
download
home_pagehttps://github.com/BenevolentJoker-JohnL/SOLLOL
SummaryIntelligent Load Balancer for Ollama Clusters - 3 Distribution Modes: Ray Parallel + Dask Batch + llama.cpp Sharding
upload_time2025-10-07 01:23:22
maintainerNone
docs_urlNone
authorBenevolentJoker-JohnL
requires_python>=3.8
licenseMIT
keywords ai llm distributed ollama load-balancing inference llama-cpp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SOLLOL - Production-Ready Orchestration for Local LLM Clusters

<div align="center">

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://github.com/BenevolentJoker-JohnL/SOLLOL/actions/workflows/tests.yml/badge.svg)](https://github.com/BenevolentJoker-JohnL/SOLLOL/actions/workflows/tests.yml)
[![codecov](https://codecov.io/gh/BenevolentJoker-JohnL/SOLLOL/branch/main/graph/badge.svg)](https://codecov.io/gh/BenevolentJoker-JohnL/SOLLOL)

**Open-source orchestration layer that combines intelligent task routing with distributed model inference for local LLM clusters.**

[Quick Start](#quick-start) β€’ [Features](#why-sollol) β€’ [Architecture](#architecture) β€’ [Documentation](#documentation) β€’ [Examples](#examples)

</div>

---

## 🎯 What is SOLLOL?

SOLLOL (Super Ollama Load balancer & Orchestration Layer) transforms your collection of Ollama nodes into an **intelligent AI cluster** with adaptive routing and automatic failoverβ€”all running on your own hardware.

### The Problem

You have multiple machines with GPUs running Ollama, but:
- ❌ Manual node selection for each request
- ❌ No way to run models larger than your biggest GPU
- ❌ Can't distribute multi-agent workloads efficiently
- ❌ No automatic failover or load balancing
- ❌ Zero visibility into cluster performance

### The SOLLOL Solution

SOLLOL provides:
- βœ… **Intelligent routing** that learns which nodes work best for each task
- βœ… **Model sharding** to run 70B+ models across multiple machines
- βœ… **Parallel agent execution** for multi-agent frameworks
- βœ… **Auto-discovery** of all nodes and capabilities
- βœ… **Built-in observability** with real-time metrics
- βœ… **Zero-config deployment** - just point and go

---

## πŸš€ Why SOLLOL?

### 1. **Two Distribution Modes in One System**

SOLLOL combines both task distribution and model sharding:

#### πŸ“Š Task Distribution (Horizontal Scaling)
Distribute **multiple requests** across your cluster in parallel:
```python
# Run 10 agents simultaneously across 5 nodes
pool = OllamaPool.auto_configure()
responses = await asyncio.gather(*[
    pool.chat(model="llama3.2", messages=[...])
    for _ in range(10)
])
# Parallel execution across available nodes
```

#### 🧩 Model Sharding (Vertical Scaling)
Run **single large models** that don't fit on one machine:
```python
# Run larger models across multiple nodes
# Note: Verified with 13B across 2-3 nodes; larger models not extensively tested
router = HybridRouter(
    enable_distributed=True,
    num_rpc_backends=4
)
response = await router.route_request(
    model="llama3:70b",  # Sharded automatically
    messages=[...]
)
```

**Use them together!** Small models use task distribution, large models use sharding.

---

### 2. **Intelligent, Not Just Balanced**

SOLLOL doesn't just distribute requests randomlyβ€”it **learns** and **optimizes**:

| Feature | Simple Load Balancer | SOLLOL |
|---------|---------------------|---------|
| **Routing** | Round-robin | Context-aware scoring |
| **Learning** | None | Adapts from performance history |
| **Resource Awareness** | None | GPU/CPU/memory-aware |
| **Task Optimization** | None | Routes by task type complexity |
| **Failover** | Manual | Automatic with health checks |
| **Priority** | FIFO | Priority queue with fairness |

**Example**: SOLLOL automatically routes:
- Heavy generation tasks β†’ GPU nodes with 24GB VRAM
- Fast embeddings β†’ CPU nodes or smaller GPUs
- Critical requests β†’ Fastest, most reliable nodes
- Batch processing β†’ Lower priority, distributed load

---

### 3. **Production-Ready from Day One**

```python
from sollol import SOLLOL, SOLLOLConfig

# Literally 3 lines to production
config = SOLLOLConfig.auto_discover()
sollol = SOLLOL(config)
sollol.start()  # βœ… Gateway running on :8000
```

**Out of the box**:
- Auto-discovery of Ollama nodes
- Health monitoring and failover
- Prometheus metrics
- Web dashboard
- Connection pooling
- Request hedging
- Priority queuing

---

## πŸ—οΈ Architecture

### High-Level Overview

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Your Application                       β”‚
β”‚         (SynapticLlamas, custom agents, etc.)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 SOLLOL Gateway (:8000)                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Intelligent Routing Engine               β”‚  β”‚
β”‚  β”‚  β€’ Analyzes: task type, complexity, resources    β”‚  β”‚
β”‚  β”‚  β€’ Scores: all nodes based on context            β”‚  β”‚
β”‚  β”‚  β€’ Learns: from performance history              β”‚  β”‚
β”‚  β”‚  β€’ Routes: to optimal node                       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚          Priority Queue + Failover               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                         β”‚
         β–Ό                         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Task Mode   β”‚          β”‚  Shard Mode  β”‚
  β”‚ Ray Cluster β”‚          β”‚  llama.cpp   β”‚
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                         β”‚
         β–Ό                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Your Heterogeneous Cluster                 β”‚
β”‚  GPU (24GB) β”‚ GPU (16GB) β”‚ CPU (64c) β”‚ GPU (8GB) β”‚...  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### How Routing Works

```python
# 1. Request arrives
POST /api/chat {
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Complex analysis task..."}],
  "priority": 8
}

# 2. SOLLOL analyzes
task_type = "generation"       # Auto-detected
complexity = "high"             # Token count analysis
requires_gpu = True             # Based on task
estimated_duration = 3.2s       # From history

# 3. SOLLOL scores all nodes
Node A (GPU 24GB, load: 0.2, latency: 120ms) β†’ Score: 185.3 βœ“ WINNER
Node B (GPU 8GB,  load: 0.6, latency: 200ms) β†’ Score: 92.1
Node C (CPU only, load: 0.1, latency: 80ms)  β†’ Score: 41.2

# 4. Routes to Node A, monitors execution, learns for next time
```

**Scoring Algorithm**:
```
Score = 100.0 (baseline)
      Γ— success_rate (0.0-1.0)
      Γ· (1 + latency_penalty)
      Γ— gpu_bonus (1.5x if GPU available & needed)
      Γ· (1 + load_penalty)
      Γ— priority_alignment
      Γ— task_specialization
```

---

## πŸ“¦ Installation

### Quick Install (PyPI)
```bash
pip install sollol
```

### From Source
```bash
git clone https://github.com/BenevolentJoker-JohnL/SOLLOL.git
cd SOLLOL
pip install -e .
```

---

## ⚑ Quick Start

### 1. Synchronous API (No async/await needed!)

**New in v0.3.6:** SOLLOL now provides a synchronous API for easier integration with non-async applications.

```python
from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority

# Auto-discover and connect to all Ollama nodes
pool = OllamaPool.auto_configure()

# Make requests - SOLLOL routes intelligently
# No async/await needed!
response = pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}],
    priority=Priority.HIGH,  # Semantic priority levels
    timeout=60  # Request timeout in seconds
)

print(response['message']['content'])
print(f"Routed to: {response.get('_sollol_routing', {}).get('host', 'unknown')}")
```

**Key features of synchronous API:**
- βœ… No async/await syntax required
- βœ… Works with synchronous agent frameworks
- βœ… Same intelligent routing and features
- βœ… Runs async code in background thread automatically

---

### 2. Async API (Original)

For async applications, use the original async API:

```python
from sollol import OllamaPool

# Auto-discover and connect to all Ollama nodes
pool = await OllamaPool.auto_configure()

# Make requests - SOLLOL routes intelligently
response = await pool.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response['message']['content'])
print(f"Routed to: {response['_sollol_routing']['host']}")
print(f"Task type: {response['_sollol_routing']['task_type']}")
```

---

### 3. Priority-Based Multi-Agent Execution

**New in v0.3.6:** Use semantic priority levels and role-based mapping.

```python
from sollol.sync_wrapper import OllamaPool
from sollol.priority_helpers import Priority, get_priority_for_role

pool = OllamaPool.auto_configure()

# Define agents with different priorities
agents = [
    {"name": "Researcher", "role": "researcher"},  # Priority 8
    {"name": "Editor", "role": "editor"},          # Priority 6
    {"name": "Summarizer", "role": "summarizer"},  # Priority 5
]

for agent in agents:
    priority = get_priority_for_role(agent["role"])

    response = pool.chat(
        model="llama3.2",
        messages=[{"role": "user", "content": f"Task for {agent['name']}"}],
        priority=priority
    )
    # User-facing agents get priority, background tasks wait
```

**Priority levels available:**
- `Priority.CRITICAL` (10) - Mission-critical
- `Priority.URGENT` (9) - Fast response needed
- `Priority.HIGH` (7) - Important tasks
- `Priority.NORMAL` (5) - Default
- `Priority.LOW` (3) - Background tasks
- `Priority.BATCH` (1) - Can wait

---

### 4. Model Sharding with llama.cpp (Large Models)

**Run models larger than your biggest GPU** by distributing layers across multiple machines.

#### When to Use Model Sharding

Use model sharding when:
- βœ… Model doesn't fit on your largest GPU (e.g., 70B models on 16GB GPUs)
- βœ… You have multiple machines with network connectivity
- βœ… You can tolerate slower inference for capability

Don't use sharding when:
- ❌ Model fits on a single GPU (use task distribution instead)
- ❌ You need maximum inference speed
- ❌ Network latency is high (>10ms between machines)

#### Quick Start: Auto-Setup (Easiest)

```python
from sollol.sync_wrapper import HybridRouter, OllamaPool

# SOLLOL handles all setup automatically
router = HybridRouter(
    ollama_pool=OllamaPool.auto_configure(),
    enable_distributed=True,  # Enable model sharding
    auto_setup_rpc=True,      # Auto-configure RPC backends
    num_rpc_backends=3        # Distribute across 3 machines
)

# Use large model that doesn't fit on one machine
response = router.route_request(
    model="llama3.1:70b",  # Automatically sharded across backends
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

print(response['message']['content'])
```

**What happens automatically:**
1. SOLLOL discovers available RPC backends on your network
2. Extracts the GGUF model from Ollama storage
3. Starts llama-server coordinator with optimal settings
4. Distributes model layers across backends
5. Routes your request to the coordinator

#### RPC Server Auto-Installation

**SOLLOL can automatically clone, build, and start llama.cpp RPC servers for you!**

**One-line installation:**

```python
from sollol.rpc_auto_setup import auto_setup_rpc_backends

# Automatically: clone β†’ build β†’ start RPC servers
backends = auto_setup_rpc_backends(num_backends=2)
# Output: [{'host': '127.0.0.1', 'port': 50052}, {'host': '127.0.0.1', 'port': 50053}]
```

**What this does:**
1. βœ… Scans network for existing RPC servers
2. βœ… If none found: clones llama.cpp to `~/llama.cpp`
3. βœ… Builds llama.cpp with RPC support (`cmake -DGGML_RPC=ON`)
4. βœ… Starts RPC servers on ports 50052-50053
5. βœ… Returns ready-to-use backend list

**CLI installation:**

```bash
# Full automated setup (clone + build + install systemd service)
python3 -m sollol.setup_llama_cpp --all

# Or step by step
python3 -m sollol.setup_llama_cpp --clone  # Clone llama.cpp
python3 -m sollol.setup_llama_cpp --build  # Build with RPC support
python3 -m sollol.setup_llama_cpp --start  # Start RPC server
```

**Docker IP Resolution:**

SOLLOL automatically resolves Docker container IPs to accessible host IPs:

```python
# If Docker container reports IP 172.17.0.5:11434
# SOLLOL automatically resolves to:
# β†’ 127.0.0.1:11434 (published port mapping)
# β†’ host IP (if accessible)
# β†’ Docker host gateway

from sollol import is_docker_ip, resolve_docker_ip

# Check if IP is Docker internal
is_docker = is_docker_ip("172.17.0.5")  # True

# Resolve Docker IP to accessible IP
accessible_ip = resolve_docker_ip("172.17.0.5", port=11434)
# Returns: "127.0.0.1" or host IP
```

**Network Discovery with Docker Support:**

```python
from sollol import OllamaPool

# Auto-discover nodes (automatically resolves Docker IPs)
pool = OllamaPool.auto_configure()

# Manual control
from sollol.discovery import discover_ollama_nodes
nodes = discover_ollama_nodes(auto_resolve_docker=True)
```

**Multi-Node Production Setup:**

For distributed clusters, use systemd services on each node:

```bash
# On each RPC node
sudo systemctl enable llama-rpc@50052.service
sudo systemctl start llama-rpc@50052.service
```

See [SOLLOL_RPC_SETUP.md](https://github.com/BenevolentJoker-JohnL/FlockParser/blob/main/SOLLOL_RPC_SETUP.md) for complete installation guide.

#### Architecture: How It Works

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Llama 3.1 70B Model (40GB total)        β”‚
β”‚           Distributed Sharding             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚            β”‚            β”‚
       β–Ό            β–Ό            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Machine 1   β”‚ β”‚  Machine 2   β”‚ β”‚  Machine 3   β”‚
β”‚ Layers 0-26  β”‚ β”‚ Layers 27-53 β”‚ β”‚ Layers 54-79 β”‚
β”‚   (~13GB)    β”‚ β”‚   (~13GB)    β”‚ β”‚   (~13GB)    β”‚
β”‚ RPC Backend  β”‚ β”‚ RPC Backend  β”‚ β”‚ RPC Backend  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β–²            β–²            β–²
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ llama-server        β”‚
         β”‚ Coordinator         β”‚
         β”‚ (Port 18080)        β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

#### Manual Setup (Advanced)

For explicit control over RPC backends:

```python
from sollol.llama_cpp_coordinator import LlamaCppCoordinator
from sollol.rpc_registry import RPCBackendRegistry

# 1. Register RPC backends explicitly
registry = RPCBackendRegistry()
registry.add_backend("rpc_1", "grpc://10.9.66.45:50052")
registry.add_backend("rpc_2", "grpc://10.9.66.46:50052")
registry.add_backend("rpc_3", "grpc://10.9.66.47:50052")

# 2. Create coordinator
coordinator = LlamaCppCoordinator(
    coordinator_port=18080,
    rpc_backends=registry.get_all_backends(),
    context_size=4096,
    gpu_layers=-1  # Use all available GPU layers
)

# 3. Start and use
await coordinator.start(model_name="llama3.1:70b")
response = await coordinator.generate(
    prompt="Explain the theory of relativity",
    max_tokens=500
)
```

#### Performance Expectations

| Model Size | Single GPU | Sharded (3 nodes) | Trade-off |
|------------|-----------|-------------------|-----------|
| **13B** | βœ… 20 tok/s | βœ… 5 tok/s | -75% speed, works on 3Γ—smaller GPUs |
| **70B** | ❌ OOM | ⚠️ 3-5 tok/s (est.) | Enables model that won't run otherwise |

**Trade-offs:**
- 🐌 **Startup**: 2-5 minutes (model distribution + loading)
- 🐌 **Inference**: ~4x slower than local (network overhead)
- βœ… **Capability**: Run models that won't fit on single GPU

**Learn More:**
- πŸ“– [Complete llama.cpp Guide](docs/llama_cpp_guide.md) - Setup, optimization, troubleshooting
- πŸ’» [Working Examples](examples/llama_cpp_distributed.py) - 5 complete examples including conversation, batch processing, error handling

---

### 5. Batch Processing API

**New in v0.7.0:** RESTful API for asynchronous batch job management.

Submit large-scale batch operations (thousands of embeddings, bulk inference) and track progress via job IDs:

```python
import requests

# Submit batch embedding job (up to 10,000 documents)
response = requests.post("http://localhost:11434/api/batch/embed", json={
    "model": "nomic-embed-text",
    "documents": ["Document 1", "Document 2", ...],  # Can be thousands
    "metadata": {"source": "knowledge_base"}  # Optional metadata
})

job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")

# Poll for job status
import time
while True:
    status = requests.get(f"http://localhost:11434/api/batch/jobs/{job_id}").json()

    progress = status["progress"]["percent"]
    print(f"Progress: {progress}%")

    if status["status"] == "completed":
        break
    time.sleep(1)

# Get results
results = requests.get(f"http://localhost:11434/api/batch/results/{job_id}").json()
embeddings = results["results"]  # List of embedding vectors
print(f"Processed {len(embeddings)} documents in {status['duration_seconds']}s")
```

**Available Batch Endpoints:**
- `POST /api/batch/embed` - Submit batch embedding job
- `GET /api/batch/jobs/{job_id}` - Get job status
- `GET /api/batch/results/{job_id}` - Get job results
- `GET /api/batch/jobs?limit=100` - List recent jobs
- `DELETE /api/batch/jobs/{job_id}` - Cancel job

**Use cases:**
- Embedding large document collections (thousands of documents)
- Bulk inference for batch predictions
- Background processing without blocking
- Long-running operations with progress tracking

---

### 6. SOLLOL Detection

**New in v0.3.6:** Detect if SOLLOL is running vs native Ollama.

```python
import requests

def is_sollol(url="http://localhost:11434"):
    """Check if SOLLOL is running at the given URL."""

    # Method 1: Check X-Powered-By header
    response = requests.get(url)
    if response.headers.get("X-Powered-By") == "SOLLOL":
        return True

    # Method 2: Check health endpoint
    response = requests.get(f"{url}/api/health")
    data = response.json()
    if data.get("service") == "SOLLOL":
        return True

    return False

# Use it
if is_sollol("http://localhost:11434"):
    print("βœ“ SOLLOL detected - using intelligent routing")
else:
    print("Native Ollama detected")
```

**Why this matters:**
- Enables graceful fallback in client applications
- Makes SOLLOL a true drop-in replacement
- Clients can auto-detect and use SOLLOL features when available

---

### 7. Production Gateway

```python
from sollol import SOLLOL, SOLLOLConfig

# Full production setup
config = SOLLOLConfig(
    ray_workers=4,
    dask_workers=2,
    hosts=["gpu-1:11434", "gpu-2:11434", "cpu-1:11434"],
    gateway_port=8000,
    metrics_port=9090
)

sollol = SOLLOL(config)
sollol.start()  # Blocks and runs gateway

# Access via HTTP:
# curl http://localhost:8000/api/chat -d '{...}'
# curl http://localhost:8000/api/stats
# curl http://localhost:8000/api/dashboard
```

---

## πŸŽ“ Use Cases

### 1. Multi-Agent AI Systems (SynapticLlamas, CrewAI, AutoGPT)

**Problem**: Running 10 agents sequentially takes 10x longer than necessary.

**Solution**: SOLLOL distributes agents across nodes in parallel.

```python
# Before: Sequential execution on one node
# After: Parallel execution with SOLLOL
pool = OllamaPool.auto_configure()
agents = await asyncio.gather(*[
    pool.chat(model="llama3.2", messages=agent_prompts[i])
    for i in range(10)
])
# Speedup depends on number of available nodes and their capacity
```

### 2. Large Model Inference

**Problem**: Your model doesn't fit in available VRAM.

**Solution**: SOLLOL can shard models across multiple machines via llama.cpp.

```python
# Distribute model across multiple nodes
# Note: Verified with 13B models; larger models not extensively tested
router = HybridRouter(
    enable_distributed=True,
    num_rpc_backends=4
)
# Trade-off: Slower startup/inference but enables running larger models
```

### 3. Mixed Workloads

**Problem**: Different tasks need different resources.

**Solution**: SOLLOL routes each task to the optimal node.

```python
pool = OllamaPool.auto_configure()

# Heavy generation β†’ GPU node
chat = pool.chat(model="llama3.2:70b", messages=[...])

# Fast embeddings β†’ CPU node
embeddings = pool.embed(model="nomic-embed-text", input=[...])

# SOLLOL automatically routes each to the best available node
```

### 4. High Availability Production

**Problem**: Node failures break your service.

**Solution**: SOLLOL auto-fails over and recovers.

```python
# Node A fails mid-request
# βœ… SOLLOL automatically:
# 1. Detects failure
# 2. Retries on Node B
# 3. Marks Node A as degraded
# 4. Periodically re-checks Node A
# 5. Restores Node A when healthy
```

---

## πŸ“Š Performance & Benchmarks

### Validation Status

**What's Been Validated βœ…**
- Single-node baseline performance measured
- Code exists and is reviewable (75+ modules)
- Tests pass in CI (57 tests, coverage tracked)
- Architecture implements intelligent routing

**What Needs Validation ⚠️**
- Comparative benchmarks (SOLLOL vs round-robin)
- Multi-node performance improvements
- Real-world latency/throughput gains

πŸ“– **See [BENCHMARKING.md](BENCHMARKING.md) for complete validation roadmap and how to run comparative tests.**

---

### Measured Baseline Performance

**Single Ollama Node** (llama3.2-3B, 50 requests, concurrency=5):
- βœ… **Success Rate:** 100%
- ⚑ **Throughput:** 0.51 req/s
- πŸ“ˆ **Average Latency:** 5,659 ms
- πŸ“ˆ **P95 Latency:** 11,299 ms
- πŸ“ˆ **P99 Latency:** 12,259 ms

**Hardware:** Single Ollama instance with 75+ models loaded
**Data:** See [`benchmarks/results/`](benchmarks/results/) for raw JSON

**Run Your Own:**
```bash
# Baseline test (no cluster needed)
python benchmarks/simple_ollama_benchmark.py llama3.2 50

# Comparative test (requires docker-compose)
docker-compose up -d
python benchmarks/run_benchmarks.py --sollol-url http://localhost:8000 --duration 60
```

---

### Projected Performance (Unvalidated)

**Note:** These are architectural projections, not measured results. Requires multi-node cluster setup for validation.

**Theory:** With N nodes and parallelizable workload:
- Task distribution can approach NΓ— parallelization (limited by request rate)
- Intelligent routing should reduce tail latencies vs random selection
- Resource-aware placement reduces contention and failures

**Reality:** Requires multi-node cluster validation. See [BENCHMARKING.md](BENCHMARKING.md) for test procedure and [CODE_WALKTHROUGH.md](CODE_WALKTHROUGH.md) for implementation details.

### Model Sharding Performance

| Model | Single 24GB GPU | SOLLOL (3Γ—16GB) | Status |
|-------|----------------|-----------------|-----------|
| **13B** | βœ… ~20 tok/s | βœ… ~5 tok/s | βœ… Verified working |
| **70B** | ❌ OOM | ⚠️ Estimated ~3-5 tok/s | ⚠️ Not extensively tested |

**When to use sharding**: When model doesn't fit on your largest GPU. You trade speed for capability.

**Performance trade-offs**: Distributed inference is 2-5 minutes slower to start and ~4x slower for inference compared to local. Use only when necessary.

### Overhead

- **Routing decision**: ~5-10ms (tested with 5-10 nodes)
- **Network overhead**: Varies by network (typically 5-20ms)
- **Total added latency**: ~20-50ms
- **Benefit**: Better resource utilization + automatic failover

---

## πŸ› οΈ Advanced Configuration

### Custom Routing Strategy

```python
from sollol import OllamaPool

pool = OllamaPool(
    nodes=[
        {"host": "gpu-1.local", "port": 11434, "priority": 10},  # Prefer this
        {"host": "gpu-2.local", "port": 11434, "priority": 5},
        {"host": "cpu-1.local", "port": 11434, "priority": 1},   # Last resort
    ],
    enable_intelligent_routing=True,
    enable_hedging=True,  # Duplicate critical requests
    max_queue_size=100
)
```

### Priority-Based Scheduling

```python
# Critical user-facing request
response = pool.chat(
    model="llama3.2",
    messages=[...],
    priority=10  # Highest priority
)

# Background batch job
response = pool.chat(
    model="llama3.2",
    messages=[...],
    priority=1  # Lowest priority
)

# SOLLOL ensures high-priority requests jump the queue
```

### Observability & Monitoring

```python
# Get detailed stats
stats = pool.get_stats()
print(f"Total requests: {stats['total_requests']}")
print(f"Average latency: {stats['avg_latency_ms']}ms")
print(f"Success rate: {stats['success_rate']:.2%}")

# Per-node breakdown
for host, metrics in stats['hosts'].items():
    print(f"{host}: {metrics['latency_ms']}ms, {metrics['success_rate']:.2%}")
```

```bash
# Prometheus metrics endpoint
curl http://localhost:9090/metrics

# sollol_requests_total{host="gpu-1:11434",model="llama3.2"} 1234
# sollol_latency_seconds{host="gpu-1:11434"} 0.234
# sollol_success_rate{host="gpu-1:11434"} 0.98
```

---

## πŸ”Œ Integration Examples

### SynapticLlamas Integration

```python
from sollol import SOLLOL, SOLLOLConfig
from synaptic_llamas import AgentOrchestrator

# Setup SOLLOL for multi-agent orchestration
config = SOLLOLConfig.auto_discover()
sollol = SOLLOL(config)
sollol.start(blocking=False)

# SynapticLlamas now uses SOLLOL for intelligent routing
orchestrator = AgentOrchestrator(
    llm_endpoint="http://localhost:8000/api/chat"
)

# All agents automatically distributed and optimized
orchestrator.run_parallel_agents([...])
```

### LangChain Integration

```python
from langchain.llms import Ollama
from sollol import OllamaPool

# Use SOLLOL as LangChain backend
pool = OllamaPool.auto_configure()

llm = Ollama(
    base_url="http://localhost:8000",
    model="llama3.2"
)

# LangChain requests now go through SOLLOL
response = llm("What is quantum computing?")
```

---

## πŸ“š Documentation

- **[Architecture Guide](ARCHITECTURE.md)** - Deep dive into system design
- **[Batch Processing API](BATCH_API.md)** - Complete guide to batch job management (NEW in v0.7.0)
  - API endpoints and examples
  - Job lifecycle and progress tracking
  - Best practices and error handling
- **[llama.cpp Distributed Inference Guide](docs/llama_cpp_guide.md)** - Complete guide to model sharding
  - Setup and configuration
  - Performance optimization
  - Troubleshooting common issues
  - Advanced topics (custom layer distribution, monitoring, etc.)
- **[Integration Examples](examples/integration/)** - Practical integration patterns
  - [Synchronous Agent Integration](examples/integration/sync_agents.py)
  - [Priority Configuration](examples/integration/priority_mapping.py)
  - [Load Balancer Wrapper](examples/integration/load_balancer_wrapper.py)
- **[llama.cpp Distributed Examples](examples/llama_cpp_distributed.py)** - Model sharding examples
  - Auto-setup and manual configuration
  - Multi-turn conversations with monitoring
  - Batch processing with multiple models
  - Error handling and recovery patterns
- **[Deployment Guide](docs/deployment.md)** - Production deployment patterns
- **[API Reference](docs/api.md)** - Complete API documentation
- **[Performance Tuning](docs/performance.md)** - Optimization guide
- **[SynapticLlamas Learnings](SYNAPTICLLAMAS_LEARNINGS.md)** - Features from production use

---

## πŸ†• What's New in v0.7.0

### πŸ“¦ Batch Processing API
Complete RESTful API for asynchronous batch job management. Submit large-scale batch operations (embeddings, bulk inference) and track progress via job IDs.

```python
import requests

# Submit batch embedding job (up to 10,000 documents)
response = requests.post("http://localhost:11434/api/batch/embed", json={
    "model": "nomic-embed-text",
    "documents": ["doc1", "doc2", ...],  # Thousands of documents
})
job_id = response.json()["job_id"]

# Check status
status = requests.get(f"http://localhost:11434/api/batch/jobs/{job_id}")
print(status.json()["progress"]["percent"])  # 100.0

# Get results
results = requests.get(f"http://localhost:11434/api/batch/results/{job_id}")
embeddings = results.json()["results"]
```

**Batch API Endpoints:**
- `POST /api/batch/embed` - Submit batch embedding job
- `GET /api/batch/jobs/{job_id}` - Get job status with progress tracking
- `GET /api/batch/results/{job_id}` - Retrieve job results and errors
- `DELETE /api/batch/jobs/{job_id}` - Cancel running jobs
- `GET /api/batch/jobs?limit=100` - List recent jobs

**Features:**
- UUID-based job tracking with 5 states (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)
- Automatic TTL-based cleanup (1 hour default)
- Progress tracking: completed_items, failed_items, percentage
- Duration calculation and metadata storage
- Async job execution via Dask distributed processing

### Previous Features (v0.3.6+)

**Synchronous API** - No async/await required:
```python
from sollol.sync_wrapper import OllamaPool
pool = OllamaPool.auto_configure()
response = pool.chat(...)  # Synchronous call
```

**Priority Helpers** - Semantic priority levels:
```python
from sollol.priority_helpers import Priority
priority = Priority.HIGH  # 7
```

**SOLLOL Detection:**
- `X-Powered-By: SOLLOL` header on all responses
- `/api/health` endpoint returns `{"service": "SOLLOL", "version": "0.7.0"}`

---

## πŸ†š Comparison

### SOLLOL vs. Simple Load Balancers

| Feature | nginx/HAProxy | SOLLOL |
|---------|--------------|---------|
| Routing | Round-robin/random | Context-aware, adapts from history |
| Resource awareness | None | GPU/CPU/memory-aware |
| Failover | Manual config | Automatic detection & recovery |
| Model sharding | ❌ | βœ… llama.cpp integration |
| Task prioritization | ❌ | βœ… Priority queue |
| Observability | Basic | Rich metrics + dashboard |
| Setup | Complex config | Auto-discover |

### SOLLOL vs. Kubernetes

| Feature | Kubernetes | SOLLOL |
|---------|-----------|---------|
| **Complexity** | High - requires cluster setup | Low - pip install |
| **AI-specific** | Generic container orchestration | Purpose-built for LLMs |
| **Intelligence** | None | Task-aware routing |
| **Model sharding** | Manual | Automatic |
| **Best for** | Large-scale production | AI-focused teams |

**Use both!** Deploy SOLLOL on Kubernetes for ultimate scalability.

---

## 🀝 Contributing

We welcome contributions! Areas we'd love help with:

- ML-based routing predictions
- Additional monitoring integrations
- Cloud provider integrations
- Performance optimizations
- Documentation improvements

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

---

## πŸ“œ License

MIT License - see [LICENSE](LICENSE) file for details.

---

## πŸ™ Credits

Created by [BenevolentJoker-JohnL](https://github.com/BenevolentJoker-JohnL)

Part of the [SynapticLlamas](https://github.com/BenevolentJoker-JohnL/SynapticLlamas) ecosystem.

Built with: Ray, Dask, FastAPI, llama.cpp, Ollama

---

## 🎯 What Makes SOLLOL Different?

1. **Combines task distribution AND model sharding** in one system
2. **Context-aware routing** that adapts based on performance metrics
3. **Auto-discovery** of nodes with minimal configuration
4. **Built-in failover** and priority queuing
5. **Purpose-built for Ollama clusters** (understands GPU requirements, task types)

**Limitations to know**:
- Model sharding verified with 13B models; larger models not extensively tested
- Performance benefits depend on network latency and workload patterns
- Not a drop-in replacement for single-node setups in all scenarios

---

<div align="center">

**Stop manually managing your LLM cluster. Let SOLLOL optimize it for you.**

[Get Started](#quick-start) β€’ [View on GitHub](https://github.com/BenevolentJoker-JohnL/SOLLOL) β€’ [Report Issue](https://github.com/BenevolentJoker-JohnL/SOLLOL/issues)

</div>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/BenevolentJoker-JohnL/SOLLOL",
    "name": "sollol",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "ai, llm, distributed, ollama, load-balancing, inference, llama-cpp",
    "author": "BenevolentJoker-JohnL",
    "author_email": "BenevolentJoker-JohnL <benevolentjoker@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/22/84/507a9031a7ba403dc4424528d607acc401a75f0a07f5a48e8a02bfdd1c09/sollol-0.8.1.tar.gz",
    "platform": null,
    "description": "# SOLLOL - Production-Ready Orchestration for Local LLM Clusters\n\n<div align=\"center\">\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Tests](https://github.com/BenevolentJoker-JohnL/SOLLOL/actions/workflows/tests.yml/badge.svg)](https://github.com/BenevolentJoker-JohnL/SOLLOL/actions/workflows/tests.yml)\n[![codecov](https://codecov.io/gh/BenevolentJoker-JohnL/SOLLOL/branch/main/graph/badge.svg)](https://codecov.io/gh/BenevolentJoker-JohnL/SOLLOL)\n\n**Open-source orchestration layer that combines intelligent task routing with distributed model inference for local LLM clusters.**\n\n[Quick Start](#quick-start) \u2022 [Features](#why-sollol) \u2022 [Architecture](#architecture) \u2022 [Documentation](#documentation) \u2022 [Examples](#examples)\n\n</div>\n\n---\n\n## \ud83c\udfaf What is SOLLOL?\n\nSOLLOL (Super Ollama Load balancer & Orchestration Layer) transforms your collection of Ollama nodes into an **intelligent AI cluster** with adaptive routing and automatic failover\u2014all running on your own hardware.\n\n### The Problem\n\nYou have multiple machines with GPUs running Ollama, but:\n- \u274c Manual node selection for each request\n- \u274c No way to run models larger than your biggest GPU\n- \u274c Can't distribute multi-agent workloads efficiently\n- \u274c No automatic failover or load balancing\n- \u274c Zero visibility into cluster performance\n\n### The SOLLOL Solution\n\nSOLLOL provides:\n- \u2705 **Intelligent routing** that learns which nodes work best for each task\n- \u2705 **Model sharding** to run 70B+ models across multiple machines\n- \u2705 **Parallel agent execution** for multi-agent frameworks\n- \u2705 **Auto-discovery** of all nodes and capabilities\n- \u2705 **Built-in observability** with real-time metrics\n- \u2705 **Zero-config deployment** - just point and go\n\n---\n\n## \ud83d\ude80 Why SOLLOL?\n\n### 1. **Two Distribution Modes in One System**\n\nSOLLOL combines both task distribution and model sharding:\n\n#### \ud83d\udcca Task Distribution (Horizontal Scaling)\nDistribute **multiple requests** across your cluster in parallel:\n```python\n# Run 10 agents simultaneously across 5 nodes\npool = OllamaPool.auto_configure()\nresponses = await asyncio.gather(*[\n    pool.chat(model=\"llama3.2\", messages=[...])\n    for _ in range(10)\n])\n# Parallel execution across available nodes\n```\n\n#### \ud83e\udde9 Model Sharding (Vertical Scaling)\nRun **single large models** that don't fit on one machine:\n```python\n# Run larger models across multiple nodes\n# Note: Verified with 13B across 2-3 nodes; larger models not extensively tested\nrouter = HybridRouter(\n    enable_distributed=True,\n    num_rpc_backends=4\n)\nresponse = await router.route_request(\n    model=\"llama3:70b\",  # Sharded automatically\n    messages=[...]\n)\n```\n\n**Use them together!** Small models use task distribution, large models use sharding.\n\n---\n\n### 2. **Intelligent, Not Just Balanced**\n\nSOLLOL doesn't just distribute requests randomly\u2014it **learns** and **optimizes**:\n\n| Feature | Simple Load Balancer | SOLLOL |\n|---------|---------------------|---------|\n| **Routing** | Round-robin | Context-aware scoring |\n| **Learning** | None | Adapts from performance history |\n| **Resource Awareness** | None | GPU/CPU/memory-aware |\n| **Task Optimization** | None | Routes by task type complexity |\n| **Failover** | Manual | Automatic with health checks |\n| **Priority** | FIFO | Priority queue with fairness |\n\n**Example**: SOLLOL automatically routes:\n- Heavy generation tasks \u2192 GPU nodes with 24GB VRAM\n- Fast embeddings \u2192 CPU nodes or smaller GPUs\n- Critical requests \u2192 Fastest, most reliable nodes\n- Batch processing \u2192 Lower priority, distributed load\n\n---\n\n### 3. **Production-Ready from Day One**\n\n```python\nfrom sollol import SOLLOL, SOLLOLConfig\n\n# Literally 3 lines to production\nconfig = SOLLOLConfig.auto_discover()\nsollol = SOLLOL(config)\nsollol.start()  # \u2705 Gateway running on :8000\n```\n\n**Out of the box**:\n- Auto-discovery of Ollama nodes\n- Health monitoring and failover\n- Prometheus metrics\n- Web dashboard\n- Connection pooling\n- Request hedging\n- Priority queuing\n\n---\n\n## \ud83c\udfd7\ufe0f Architecture\n\n### High-Level Overview\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                  Your Application                       \u2502\n\u2502         (SynapticLlamas, custom agents, etc.)          \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                       \u2502\n                       \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                 SOLLOL Gateway (:8000)                  \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u2502\n\u2502  \u2502         Intelligent Routing Engine               \u2502  \u2502\n\u2502  \u2502  \u2022 Analyzes: task type, complexity, resources    \u2502  \u2502\n\u2502  \u2502  \u2022 Scores: all nodes based on context            \u2502  \u2502\n\u2502  \u2502  \u2022 Learns: from performance history              \u2502  \u2502\n\u2502  \u2502  \u2022 Routes: to optimal node                       \u2502  \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u2502\n\u2502  \u2502          Priority Queue + Failover               \u2502  \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n         \u2502                         \u2502\n         \u25bc                         \u25bc\n  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510          \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n  \u2502 Task Mode   \u2502          \u2502  Shard Mode  \u2502\n  \u2502 Ray Cluster \u2502          \u2502  llama.cpp   \u2502\n  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2518          \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n         \u2502                         \u2502\n         \u25bc                         \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502              Your Heterogeneous Cluster                 \u2502\n\u2502  GPU (24GB) \u2502 GPU (16GB) \u2502 CPU (64c) \u2502 GPU (8GB) \u2502...  \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### How Routing Works\n\n```python\n# 1. Request arrives\nPOST /api/chat {\n  \"model\": \"llama3.2\",\n  \"messages\": [{\"role\": \"user\", \"content\": \"Complex analysis task...\"}],\n  \"priority\": 8\n}\n\n# 2. SOLLOL analyzes\ntask_type = \"generation\"       # Auto-detected\ncomplexity = \"high\"             # Token count analysis\nrequires_gpu = True             # Based on task\nestimated_duration = 3.2s       # From history\n\n# 3. SOLLOL scores all nodes\nNode A (GPU 24GB, load: 0.2, latency: 120ms) \u2192 Score: 185.3 \u2713 WINNER\nNode B (GPU 8GB,  load: 0.6, latency: 200ms) \u2192 Score: 92.1\nNode C (CPU only, load: 0.1, latency: 80ms)  \u2192 Score: 41.2\n\n# 4. Routes to Node A, monitors execution, learns for next time\n```\n\n**Scoring Algorithm**:\n```\nScore = 100.0 (baseline)\n      \u00d7 success_rate (0.0-1.0)\n      \u00f7 (1 + latency_penalty)\n      \u00d7 gpu_bonus (1.5x if GPU available & needed)\n      \u00f7 (1 + load_penalty)\n      \u00d7 priority_alignment\n      \u00d7 task_specialization\n```\n\n---\n\n## \ud83d\udce6 Installation\n\n### Quick Install (PyPI)\n```bash\npip install sollol\n```\n\n### From Source\n```bash\ngit clone https://github.com/BenevolentJoker-JohnL/SOLLOL.git\ncd SOLLOL\npip install -e .\n```\n\n---\n\n## \u26a1 Quick Start\n\n### 1. Synchronous API (No async/await needed!)\n\n**New in v0.3.6:** SOLLOL now provides a synchronous API for easier integration with non-async applications.\n\n```python\nfrom sollol.sync_wrapper import OllamaPool\nfrom sollol.priority_helpers import Priority\n\n# Auto-discover and connect to all Ollama nodes\npool = OllamaPool.auto_configure()\n\n# Make requests - SOLLOL routes intelligently\n# No async/await needed!\nresponse = pool.chat(\n    model=\"llama3.2\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n    priority=Priority.HIGH,  # Semantic priority levels\n    timeout=60  # Request timeout in seconds\n)\n\nprint(response['message']['content'])\nprint(f\"Routed to: {response.get('_sollol_routing', {}).get('host', 'unknown')}\")\n```\n\n**Key features of synchronous API:**\n- \u2705 No async/await syntax required\n- \u2705 Works with synchronous agent frameworks\n- \u2705 Same intelligent routing and features\n- \u2705 Runs async code in background thread automatically\n\n---\n\n### 2. Async API (Original)\n\nFor async applications, use the original async API:\n\n```python\nfrom sollol import OllamaPool\n\n# Auto-discover and connect to all Ollama nodes\npool = await OllamaPool.auto_configure()\n\n# Make requests - SOLLOL routes intelligently\nresponse = await pool.chat(\n    model=\"llama3.2\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}]\n)\n\nprint(response['message']['content'])\nprint(f\"Routed to: {response['_sollol_routing']['host']}\")\nprint(f\"Task type: {response['_sollol_routing']['task_type']}\")\n```\n\n---\n\n### 3. Priority-Based Multi-Agent Execution\n\n**New in v0.3.6:** Use semantic priority levels and role-based mapping.\n\n```python\nfrom sollol.sync_wrapper import OllamaPool\nfrom sollol.priority_helpers import Priority, get_priority_for_role\n\npool = OllamaPool.auto_configure()\n\n# Define agents with different priorities\nagents = [\n    {\"name\": \"Researcher\", \"role\": \"researcher\"},  # Priority 8\n    {\"name\": \"Editor\", \"role\": \"editor\"},          # Priority 6\n    {\"name\": \"Summarizer\", \"role\": \"summarizer\"},  # Priority 5\n]\n\nfor agent in agents:\n    priority = get_priority_for_role(agent[\"role\"])\n\n    response = pool.chat(\n        model=\"llama3.2\",\n        messages=[{\"role\": \"user\", \"content\": f\"Task for {agent['name']}\"}],\n        priority=priority\n    )\n    # User-facing agents get priority, background tasks wait\n```\n\n**Priority levels available:**\n- `Priority.CRITICAL` (10) - Mission-critical\n- `Priority.URGENT` (9) - Fast response needed\n- `Priority.HIGH` (7) - Important tasks\n- `Priority.NORMAL` (5) - Default\n- `Priority.LOW` (3) - Background tasks\n- `Priority.BATCH` (1) - Can wait\n\n---\n\n### 4. Model Sharding with llama.cpp (Large Models)\n\n**Run models larger than your biggest GPU** by distributing layers across multiple machines.\n\n#### When to Use Model Sharding\n\nUse model sharding when:\n- \u2705 Model doesn't fit on your largest GPU (e.g., 70B models on 16GB GPUs)\n- \u2705 You have multiple machines with network connectivity\n- \u2705 You can tolerate slower inference for capability\n\nDon't use sharding when:\n- \u274c Model fits on a single GPU (use task distribution instead)\n- \u274c You need maximum inference speed\n- \u274c Network latency is high (>10ms between machines)\n\n#### Quick Start: Auto-Setup (Easiest)\n\n```python\nfrom sollol.sync_wrapper import HybridRouter, OllamaPool\n\n# SOLLOL handles all setup automatically\nrouter = HybridRouter(\n    ollama_pool=OllamaPool.auto_configure(),\n    enable_distributed=True,  # Enable model sharding\n    auto_setup_rpc=True,      # Auto-configure RPC backends\n    num_rpc_backends=3        # Distribute across 3 machines\n)\n\n# Use large model that doesn't fit on one machine\nresponse = router.route_request(\n    model=\"llama3.1:70b\",  # Automatically sharded across backends\n    messages=[{\"role\": \"user\", \"content\": \"Explain quantum computing\"}]\n)\n\nprint(response['message']['content'])\n```\n\n**What happens automatically:**\n1. SOLLOL discovers available RPC backends on your network\n2. Extracts the GGUF model from Ollama storage\n3. Starts llama-server coordinator with optimal settings\n4. Distributes model layers across backends\n5. Routes your request to the coordinator\n\n#### RPC Server Auto-Installation\n\n**SOLLOL can automatically clone, build, and start llama.cpp RPC servers for you!**\n\n**One-line installation:**\n\n```python\nfrom sollol.rpc_auto_setup import auto_setup_rpc_backends\n\n# Automatically: clone \u2192 build \u2192 start RPC servers\nbackends = auto_setup_rpc_backends(num_backends=2)\n# Output: [{'host': '127.0.0.1', 'port': 50052}, {'host': '127.0.0.1', 'port': 50053}]\n```\n\n**What this does:**\n1. \u2705 Scans network for existing RPC servers\n2. \u2705 If none found: clones llama.cpp to `~/llama.cpp`\n3. \u2705 Builds llama.cpp with RPC support (`cmake -DGGML_RPC=ON`)\n4. \u2705 Starts RPC servers on ports 50052-50053\n5. \u2705 Returns ready-to-use backend list\n\n**CLI installation:**\n\n```bash\n# Full automated setup (clone + build + install systemd service)\npython3 -m sollol.setup_llama_cpp --all\n\n# Or step by step\npython3 -m sollol.setup_llama_cpp --clone  # Clone llama.cpp\npython3 -m sollol.setup_llama_cpp --build  # Build with RPC support\npython3 -m sollol.setup_llama_cpp --start  # Start RPC server\n```\n\n**Docker IP Resolution:**\n\nSOLLOL automatically resolves Docker container IPs to accessible host IPs:\n\n```python\n# If Docker container reports IP 172.17.0.5:11434\n# SOLLOL automatically resolves to:\n# \u2192 127.0.0.1:11434 (published port mapping)\n# \u2192 host IP (if accessible)\n# \u2192 Docker host gateway\n\nfrom sollol import is_docker_ip, resolve_docker_ip\n\n# Check if IP is Docker internal\nis_docker = is_docker_ip(\"172.17.0.5\")  # True\n\n# Resolve Docker IP to accessible IP\naccessible_ip = resolve_docker_ip(\"172.17.0.5\", port=11434)\n# Returns: \"127.0.0.1\" or host IP\n```\n\n**Network Discovery with Docker Support:**\n\n```python\nfrom sollol import OllamaPool\n\n# Auto-discover nodes (automatically resolves Docker IPs)\npool = OllamaPool.auto_configure()\n\n# Manual control\nfrom sollol.discovery import discover_ollama_nodes\nnodes = discover_ollama_nodes(auto_resolve_docker=True)\n```\n\n**Multi-Node Production Setup:**\n\nFor distributed clusters, use systemd services on each node:\n\n```bash\n# On each RPC node\nsudo systemctl enable llama-rpc@50052.service\nsudo systemctl start llama-rpc@50052.service\n```\n\nSee [SOLLOL_RPC_SETUP.md](https://github.com/BenevolentJoker-JohnL/FlockParser/blob/main/SOLLOL_RPC_SETUP.md) for complete installation guide.\n\n#### Architecture: How It Works\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502    Llama 3.1 70B Model (40GB total)        \u2502\n\u2502           Distributed Sharding             \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                    \u2502\n       \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n       \u2502            \u2502            \u2502\n       \u25bc            \u25bc            \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502  Machine 1   \u2502 \u2502  Machine 2   \u2502 \u2502  Machine 3   \u2502\n\u2502 Layers 0-26  \u2502 \u2502 Layers 27-53 \u2502 \u2502 Layers 54-79 \u2502\n\u2502   (~13GB)    \u2502 \u2502   (~13GB)    \u2502 \u2502   (~13GB)    \u2502\n\u2502 RPC Backend  \u2502 \u2502 RPC Backend  \u2502 \u2502 RPC Backend  \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n       \u25b2            \u25b2            \u25b2\n       \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                    \u2502\n         \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n         \u2502 llama-server        \u2502\n         \u2502 Coordinator         \u2502\n         \u2502 (Port 18080)        \u2502\n         \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n#### Manual Setup (Advanced)\n\nFor explicit control over RPC backends:\n\n```python\nfrom sollol.llama_cpp_coordinator import LlamaCppCoordinator\nfrom sollol.rpc_registry import RPCBackendRegistry\n\n# 1. Register RPC backends explicitly\nregistry = RPCBackendRegistry()\nregistry.add_backend(\"rpc_1\", \"grpc://10.9.66.45:50052\")\nregistry.add_backend(\"rpc_2\", \"grpc://10.9.66.46:50052\")\nregistry.add_backend(\"rpc_3\", \"grpc://10.9.66.47:50052\")\n\n# 2. Create coordinator\ncoordinator = LlamaCppCoordinator(\n    coordinator_port=18080,\n    rpc_backends=registry.get_all_backends(),\n    context_size=4096,\n    gpu_layers=-1  # Use all available GPU layers\n)\n\n# 3. Start and use\nawait coordinator.start(model_name=\"llama3.1:70b\")\nresponse = await coordinator.generate(\n    prompt=\"Explain the theory of relativity\",\n    max_tokens=500\n)\n```\n\n#### Performance Expectations\n\n| Model Size | Single GPU | Sharded (3 nodes) | Trade-off |\n|------------|-----------|-------------------|-----------|\n| **13B** | \u2705 20 tok/s | \u2705 5 tok/s | -75% speed, works on 3\u00d7smaller GPUs |\n| **70B** | \u274c OOM | \u26a0\ufe0f 3-5 tok/s (est.) | Enables model that won't run otherwise |\n\n**Trade-offs:**\n- \ud83d\udc0c **Startup**: 2-5 minutes (model distribution + loading)\n- \ud83d\udc0c **Inference**: ~4x slower than local (network overhead)\n- \u2705 **Capability**: Run models that won't fit on single GPU\n\n**Learn More:**\n- \ud83d\udcd6 [Complete llama.cpp Guide](docs/llama_cpp_guide.md) - Setup, optimization, troubleshooting\n- \ud83d\udcbb [Working Examples](examples/llama_cpp_distributed.py) - 5 complete examples including conversation, batch processing, error handling\n\n---\n\n### 5. Batch Processing API\n\n**New in v0.7.0:** RESTful API for asynchronous batch job management.\n\nSubmit large-scale batch operations (thousands of embeddings, bulk inference) and track progress via job IDs:\n\n```python\nimport requests\n\n# Submit batch embedding job (up to 10,000 documents)\nresponse = requests.post(\"http://localhost:11434/api/batch/embed\", json={\n    \"model\": \"nomic-embed-text\",\n    \"documents\": [\"Document 1\", \"Document 2\", ...],  # Can be thousands\n    \"metadata\": {\"source\": \"knowledge_base\"}  # Optional metadata\n})\n\njob_id = response.json()[\"job_id\"]\nprint(f\"Job submitted: {job_id}\")\n\n# Poll for job status\nimport time\nwhile True:\n    status = requests.get(f\"http://localhost:11434/api/batch/jobs/{job_id}\").json()\n\n    progress = status[\"progress\"][\"percent\"]\n    print(f\"Progress: {progress}%\")\n\n    if status[\"status\"] == \"completed\":\n        break\n    time.sleep(1)\n\n# Get results\nresults = requests.get(f\"http://localhost:11434/api/batch/results/{job_id}\").json()\nembeddings = results[\"results\"]  # List of embedding vectors\nprint(f\"Processed {len(embeddings)} documents in {status['duration_seconds']}s\")\n```\n\n**Available Batch Endpoints:**\n- `POST /api/batch/embed` - Submit batch embedding job\n- `GET /api/batch/jobs/{job_id}` - Get job status\n- `GET /api/batch/results/{job_id}` - Get job results\n- `GET /api/batch/jobs?limit=100` - List recent jobs\n- `DELETE /api/batch/jobs/{job_id}` - Cancel job\n\n**Use cases:**\n- Embedding large document collections (thousands of documents)\n- Bulk inference for batch predictions\n- Background processing without blocking\n- Long-running operations with progress tracking\n\n---\n\n### 6. SOLLOL Detection\n\n**New in v0.3.6:** Detect if SOLLOL is running vs native Ollama.\n\n```python\nimport requests\n\ndef is_sollol(url=\"http://localhost:11434\"):\n    \"\"\"Check if SOLLOL is running at the given URL.\"\"\"\n\n    # Method 1: Check X-Powered-By header\n    response = requests.get(url)\n    if response.headers.get(\"X-Powered-By\") == \"SOLLOL\":\n        return True\n\n    # Method 2: Check health endpoint\n    response = requests.get(f\"{url}/api/health\")\n    data = response.json()\n    if data.get(\"service\") == \"SOLLOL\":\n        return True\n\n    return False\n\n# Use it\nif is_sollol(\"http://localhost:11434\"):\n    print(\"\u2713 SOLLOL detected - using intelligent routing\")\nelse:\n    print(\"Native Ollama detected\")\n```\n\n**Why this matters:**\n- Enables graceful fallback in client applications\n- Makes SOLLOL a true drop-in replacement\n- Clients can auto-detect and use SOLLOL features when available\n\n---\n\n### 7. Production Gateway\n\n```python\nfrom sollol import SOLLOL, SOLLOLConfig\n\n# Full production setup\nconfig = SOLLOLConfig(\n    ray_workers=4,\n    dask_workers=2,\n    hosts=[\"gpu-1:11434\", \"gpu-2:11434\", \"cpu-1:11434\"],\n    gateway_port=8000,\n    metrics_port=9090\n)\n\nsollol = SOLLOL(config)\nsollol.start()  # Blocks and runs gateway\n\n# Access via HTTP:\n# curl http://localhost:8000/api/chat -d '{...}'\n# curl http://localhost:8000/api/stats\n# curl http://localhost:8000/api/dashboard\n```\n\n---\n\n## \ud83c\udf93 Use Cases\n\n### 1. Multi-Agent AI Systems (SynapticLlamas, CrewAI, AutoGPT)\n\n**Problem**: Running 10 agents sequentially takes 10x longer than necessary.\n\n**Solution**: SOLLOL distributes agents across nodes in parallel.\n\n```python\n# Before: Sequential execution on one node\n# After: Parallel execution with SOLLOL\npool = OllamaPool.auto_configure()\nagents = await asyncio.gather(*[\n    pool.chat(model=\"llama3.2\", messages=agent_prompts[i])\n    for i in range(10)\n])\n# Speedup depends on number of available nodes and their capacity\n```\n\n### 2. Large Model Inference\n\n**Problem**: Your model doesn't fit in available VRAM.\n\n**Solution**: SOLLOL can shard models across multiple machines via llama.cpp.\n\n```python\n# Distribute model across multiple nodes\n# Note: Verified with 13B models; larger models not extensively tested\nrouter = HybridRouter(\n    enable_distributed=True,\n    num_rpc_backends=4\n)\n# Trade-off: Slower startup/inference but enables running larger models\n```\n\n### 3. Mixed Workloads\n\n**Problem**: Different tasks need different resources.\n\n**Solution**: SOLLOL routes each task to the optimal node.\n\n```python\npool = OllamaPool.auto_configure()\n\n# Heavy generation \u2192 GPU node\nchat = pool.chat(model=\"llama3.2:70b\", messages=[...])\n\n# Fast embeddings \u2192 CPU node\nembeddings = pool.embed(model=\"nomic-embed-text\", input=[...])\n\n# SOLLOL automatically routes each to the best available node\n```\n\n### 4. High Availability Production\n\n**Problem**: Node failures break your service.\n\n**Solution**: SOLLOL auto-fails over and recovers.\n\n```python\n# Node A fails mid-request\n# \u2705 SOLLOL automatically:\n# 1. Detects failure\n# 2. Retries on Node B\n# 3. Marks Node A as degraded\n# 4. Periodically re-checks Node A\n# 5. Restores Node A when healthy\n```\n\n---\n\n## \ud83d\udcca Performance & Benchmarks\n\n### Validation Status\n\n**What's Been Validated \u2705**\n- Single-node baseline performance measured\n- Code exists and is reviewable (75+ modules)\n- Tests pass in CI (57 tests, coverage tracked)\n- Architecture implements intelligent routing\n\n**What Needs Validation \u26a0\ufe0f**\n- Comparative benchmarks (SOLLOL vs round-robin)\n- Multi-node performance improvements\n- Real-world latency/throughput gains\n\n\ud83d\udcd6 **See [BENCHMARKING.md](BENCHMARKING.md) for complete validation roadmap and how to run comparative tests.**\n\n---\n\n### Measured Baseline Performance\n\n**Single Ollama Node** (llama3.2-3B, 50 requests, concurrency=5):\n- \u2705 **Success Rate:** 100%\n- \u26a1 **Throughput:** 0.51 req/s\n- \ud83d\udcc8 **Average Latency:** 5,659 ms\n- \ud83d\udcc8 **P95 Latency:** 11,299 ms\n- \ud83d\udcc8 **P99 Latency:** 12,259 ms\n\n**Hardware:** Single Ollama instance with 75+ models loaded\n**Data:** See [`benchmarks/results/`](benchmarks/results/) for raw JSON\n\n**Run Your Own:**\n```bash\n# Baseline test (no cluster needed)\npython benchmarks/simple_ollama_benchmark.py llama3.2 50\n\n# Comparative test (requires docker-compose)\ndocker-compose up -d\npython benchmarks/run_benchmarks.py --sollol-url http://localhost:8000 --duration 60\n```\n\n---\n\n### Projected Performance (Unvalidated)\n\n**Note:** These are architectural projections, not measured results. Requires multi-node cluster setup for validation.\n\n**Theory:** With N nodes and parallelizable workload:\n- Task distribution can approach N\u00d7 parallelization (limited by request rate)\n- Intelligent routing should reduce tail latencies vs random selection\n- Resource-aware placement reduces contention and failures\n\n**Reality:** Requires multi-node cluster validation. See [BENCHMARKING.md](BENCHMARKING.md) for test procedure and [CODE_WALKTHROUGH.md](CODE_WALKTHROUGH.md) for implementation details.\n\n### Model Sharding Performance\n\n| Model | Single 24GB GPU | SOLLOL (3\u00d716GB) | Status |\n|-------|----------------|-----------------|-----------|\n| **13B** | \u2705 ~20 tok/s | \u2705 ~5 tok/s | \u2705 Verified working |\n| **70B** | \u274c OOM | \u26a0\ufe0f Estimated ~3-5 tok/s | \u26a0\ufe0f Not extensively tested |\n\n**When to use sharding**: When model doesn't fit on your largest GPU. You trade speed for capability.\n\n**Performance trade-offs**: Distributed inference is 2-5 minutes slower to start and ~4x slower for inference compared to local. Use only when necessary.\n\n### Overhead\n\n- **Routing decision**: ~5-10ms (tested with 5-10 nodes)\n- **Network overhead**: Varies by network (typically 5-20ms)\n- **Total added latency**: ~20-50ms\n- **Benefit**: Better resource utilization + automatic failover\n\n---\n\n## \ud83d\udee0\ufe0f Advanced Configuration\n\n### Custom Routing Strategy\n\n```python\nfrom sollol import OllamaPool\n\npool = OllamaPool(\n    nodes=[\n        {\"host\": \"gpu-1.local\", \"port\": 11434, \"priority\": 10},  # Prefer this\n        {\"host\": \"gpu-2.local\", \"port\": 11434, \"priority\": 5},\n        {\"host\": \"cpu-1.local\", \"port\": 11434, \"priority\": 1},   # Last resort\n    ],\n    enable_intelligent_routing=True,\n    enable_hedging=True,  # Duplicate critical requests\n    max_queue_size=100\n)\n```\n\n### Priority-Based Scheduling\n\n```python\n# Critical user-facing request\nresponse = pool.chat(\n    model=\"llama3.2\",\n    messages=[...],\n    priority=10  # Highest priority\n)\n\n# Background batch job\nresponse = pool.chat(\n    model=\"llama3.2\",\n    messages=[...],\n    priority=1  # Lowest priority\n)\n\n# SOLLOL ensures high-priority requests jump the queue\n```\n\n### Observability & Monitoring\n\n```python\n# Get detailed stats\nstats = pool.get_stats()\nprint(f\"Total requests: {stats['total_requests']}\")\nprint(f\"Average latency: {stats['avg_latency_ms']}ms\")\nprint(f\"Success rate: {stats['success_rate']:.2%}\")\n\n# Per-node breakdown\nfor host, metrics in stats['hosts'].items():\n    print(f\"{host}: {metrics['latency_ms']}ms, {metrics['success_rate']:.2%}\")\n```\n\n```bash\n# Prometheus metrics endpoint\ncurl http://localhost:9090/metrics\n\n# sollol_requests_total{host=\"gpu-1:11434\",model=\"llama3.2\"} 1234\n# sollol_latency_seconds{host=\"gpu-1:11434\"} 0.234\n# sollol_success_rate{host=\"gpu-1:11434\"} 0.98\n```\n\n---\n\n## \ud83d\udd0c Integration Examples\n\n### SynapticLlamas Integration\n\n```python\nfrom sollol import SOLLOL, SOLLOLConfig\nfrom synaptic_llamas import AgentOrchestrator\n\n# Setup SOLLOL for multi-agent orchestration\nconfig = SOLLOLConfig.auto_discover()\nsollol = SOLLOL(config)\nsollol.start(blocking=False)\n\n# SynapticLlamas now uses SOLLOL for intelligent routing\norchestrator = AgentOrchestrator(\n    llm_endpoint=\"http://localhost:8000/api/chat\"\n)\n\n# All agents automatically distributed and optimized\norchestrator.run_parallel_agents([...])\n```\n\n### LangChain Integration\n\n```python\nfrom langchain.llms import Ollama\nfrom sollol import OllamaPool\n\n# Use SOLLOL as LangChain backend\npool = OllamaPool.auto_configure()\n\nllm = Ollama(\n    base_url=\"http://localhost:8000\",\n    model=\"llama3.2\"\n)\n\n# LangChain requests now go through SOLLOL\nresponse = llm(\"What is quantum computing?\")\n```\n\n---\n\n## \ud83d\udcda Documentation\n\n- **[Architecture Guide](ARCHITECTURE.md)** - Deep dive into system design\n- **[Batch Processing API](BATCH_API.md)** - Complete guide to batch job management (NEW in v0.7.0)\n  - API endpoints and examples\n  - Job lifecycle and progress tracking\n  - Best practices and error handling\n- **[llama.cpp Distributed Inference Guide](docs/llama_cpp_guide.md)** - Complete guide to model sharding\n  - Setup and configuration\n  - Performance optimization\n  - Troubleshooting common issues\n  - Advanced topics (custom layer distribution, monitoring, etc.)\n- **[Integration Examples](examples/integration/)** - Practical integration patterns\n  - [Synchronous Agent Integration](examples/integration/sync_agents.py)\n  - [Priority Configuration](examples/integration/priority_mapping.py)\n  - [Load Balancer Wrapper](examples/integration/load_balancer_wrapper.py)\n- **[llama.cpp Distributed Examples](examples/llama_cpp_distributed.py)** - Model sharding examples\n  - Auto-setup and manual configuration\n  - Multi-turn conversations with monitoring\n  - Batch processing with multiple models\n  - Error handling and recovery patterns\n- **[Deployment Guide](docs/deployment.md)** - Production deployment patterns\n- **[API Reference](docs/api.md)** - Complete API documentation\n- **[Performance Tuning](docs/performance.md)** - Optimization guide\n- **[SynapticLlamas Learnings](SYNAPTICLLAMAS_LEARNINGS.md)** - Features from production use\n\n---\n\n## \ud83c\udd95 What's New in v0.7.0\n\n### \ud83d\udce6 Batch Processing API\nComplete RESTful API for asynchronous batch job management. Submit large-scale batch operations (embeddings, bulk inference) and track progress via job IDs.\n\n```python\nimport requests\n\n# Submit batch embedding job (up to 10,000 documents)\nresponse = requests.post(\"http://localhost:11434/api/batch/embed\", json={\n    \"model\": \"nomic-embed-text\",\n    \"documents\": [\"doc1\", \"doc2\", ...],  # Thousands of documents\n})\njob_id = response.json()[\"job_id\"]\n\n# Check status\nstatus = requests.get(f\"http://localhost:11434/api/batch/jobs/{job_id}\")\nprint(status.json()[\"progress\"][\"percent\"])  # 100.0\n\n# Get results\nresults = requests.get(f\"http://localhost:11434/api/batch/results/{job_id}\")\nembeddings = results.json()[\"results\"]\n```\n\n**Batch API Endpoints:**\n- `POST /api/batch/embed` - Submit batch embedding job\n- `GET /api/batch/jobs/{job_id}` - Get job status with progress tracking\n- `GET /api/batch/results/{job_id}` - Retrieve job results and errors\n- `DELETE /api/batch/jobs/{job_id}` - Cancel running jobs\n- `GET /api/batch/jobs?limit=100` - List recent jobs\n\n**Features:**\n- UUID-based job tracking with 5 states (PENDING, RUNNING, COMPLETED, FAILED, CANCELLED)\n- Automatic TTL-based cleanup (1 hour default)\n- Progress tracking: completed_items, failed_items, percentage\n- Duration calculation and metadata storage\n- Async job execution via Dask distributed processing\n\n### Previous Features (v0.3.6+)\n\n**Synchronous API** - No async/await required:\n```python\nfrom sollol.sync_wrapper import OllamaPool\npool = OllamaPool.auto_configure()\nresponse = pool.chat(...)  # Synchronous call\n```\n\n**Priority Helpers** - Semantic priority levels:\n```python\nfrom sollol.priority_helpers import Priority\npriority = Priority.HIGH  # 7\n```\n\n**SOLLOL Detection:**\n- `X-Powered-By: SOLLOL` header on all responses\n- `/api/health` endpoint returns `{\"service\": \"SOLLOL\", \"version\": \"0.7.0\"}`\n\n---\n\n## \ud83c\udd9a Comparison\n\n### SOLLOL vs. Simple Load Balancers\n\n| Feature | nginx/HAProxy | SOLLOL |\n|---------|--------------|---------|\n| Routing | Round-robin/random | Context-aware, adapts from history |\n| Resource awareness | None | GPU/CPU/memory-aware |\n| Failover | Manual config | Automatic detection & recovery |\n| Model sharding | \u274c | \u2705 llama.cpp integration |\n| Task prioritization | \u274c | \u2705 Priority queue |\n| Observability | Basic | Rich metrics + dashboard |\n| Setup | Complex config | Auto-discover |\n\n### SOLLOL vs. Kubernetes\n\n| Feature | Kubernetes | SOLLOL |\n|---------|-----------|---------|\n| **Complexity** | High - requires cluster setup | Low - pip install |\n| **AI-specific** | Generic container orchestration | Purpose-built for LLMs |\n| **Intelligence** | None | Task-aware routing |\n| **Model sharding** | Manual | Automatic |\n| **Best for** | Large-scale production | AI-focused teams |\n\n**Use both!** Deploy SOLLOL on Kubernetes for ultimate scalability.\n\n---\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Areas we'd love help with:\n\n- ML-based routing predictions\n- Additional monitoring integrations\n- Cloud provider integrations\n- Performance optimizations\n- Documentation improvements\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n---\n\n## \ud83d\udcdc License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n---\n\n## \ud83d\ude4f Credits\n\nCreated by [BenevolentJoker-JohnL](https://github.com/BenevolentJoker-JohnL)\n\nPart of the [SynapticLlamas](https://github.com/BenevolentJoker-JohnL/SynapticLlamas) ecosystem.\n\nBuilt with: Ray, Dask, FastAPI, llama.cpp, Ollama\n\n---\n\n## \ud83c\udfaf What Makes SOLLOL Different?\n\n1. **Combines task distribution AND model sharding** in one system\n2. **Context-aware routing** that adapts based on performance metrics\n3. **Auto-discovery** of nodes with minimal configuration\n4. **Built-in failover** and priority queuing\n5. **Purpose-built for Ollama clusters** (understands GPU requirements, task types)\n\n**Limitations to know**:\n- Model sharding verified with 13B models; larger models not extensively tested\n- Performance benefits depend on network latency and workload patterns\n- Not a drop-in replacement for single-node setups in all scenarios\n\n---\n\n<div align=\"center\">\n\n**Stop manually managing your LLM cluster. Let SOLLOL optimize it for you.**\n\n[Get Started](#quick-start) \u2022 [View on GitHub](https://github.com/BenevolentJoker-JohnL/SOLLOL) \u2022 [Report Issue](https://github.com/BenevolentJoker-JohnL/SOLLOL/issues)\n\n</div>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Intelligent Load Balancer for Ollama Clusters - 3 Distribution Modes: Ray Parallel + Dask Batch + llama.cpp Sharding",
    "version": "0.8.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/BenevolentJoker-JohnL/SOLLOL/issues",
        "Documentation": "https://github.com/BenevolentJoker-JohnL/SOLLOL/blob/main/README.md",
        "Homepage": "https://github.com/BenevolentJoker-JohnL/SOLLOL",
        "Repository": "https://github.com/BenevolentJoker-JohnL/SOLLOL"
    },
    "split_keywords": [
        "ai",
        " llm",
        " distributed",
        " ollama",
        " load-balancing",
        " inference",
        " llama-cpp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bf82baf6d96a7a56d72c01c62887ed476f22f466cc9d11a4f0b34033c9413d1d",
                "md5": "519235a3cc35632a35cb38bbd7ecb940",
                "sha256": "224393c81b218550aab12617d75846c687ef73b4fe7986b441e885f3490c8448"
            },
            "downloads": -1,
            "filename": "sollol-0.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "519235a3cc35632a35cb38bbd7ecb940",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 169086,
            "upload_time": "2025-10-07T01:23:20",
            "upload_time_iso_8601": "2025-10-07T01:23:20.612737Z",
            "url": "https://files.pythonhosted.org/packages/bf/82/baf6d96a7a56d72c01c62887ed476f22f466cc9d11a4f0b34033c9413d1d/sollol-0.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2284507a9031a7ba403dc4424528d607acc401a75f0a07f5a48e8a02bfdd1c09",
                "md5": "6c14f38f68e80c98bcf75a9327224f81",
                "sha256": "eaab7b1f314628511062ba44ca5224c52b44ad87849b40d5a534352ea19f8456"
            },
            "downloads": -1,
            "filename": "sollol-0.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6c14f38f68e80c98bcf75a9327224f81",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 352966,
            "upload_time": "2025-10-07T01:23:22",
            "upload_time_iso_8601": "2025-10-07T01:23:22.152657Z",
            "url": "https://files.pythonhosted.org/packages/22/84/507a9031a7ba403dc4424528d607acc401a75f0a07f5a48e8a02bfdd1c09/sollol-0.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-07 01:23:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "BenevolentJoker-JohnL",
    "github_project": "SOLLOL",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sollol"
}
        
Elapsed time: 1.66247s