llmuxer

Name	llmuxer JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/mihirahuja1/llmuxer
Summary	Automatically find cheaper LLM alternatives while maintaining performance
upload_time	2025-08-10 21:21:03
maintainer	None
docs_url	None
author	Mihir Ahuja
requires_python	>=3.8
license	MIT
keywords	llm ai cost-optimization machine-learning openai anthropic
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # LLMuxer

[![PyPI version](https://badge.fury.io/py/llmuxer.svg)](https://pypi.org/project/llmuxer/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![CI Tests](https://github.com/mihirahuja1/llmuxer/workflows/CI/badge.svg)](https://github.com/mihirahuja1/llmuxer/actions)
[![Coverage](https://codecov.io/gh/mihirahuja1/llmuxer/branch/main/graph/badge.svg)](https://codecov.io/gh/mihirahuja1/llmuxer)
[![Downloads](https://pepy.tech/badge/llmuxer)](https://pepy.tech/project/llmuxer)
[![GitHub Stars](https://img.shields.io/github/stars/mihirahuja1/llmuxer)](https://github.com/mihirahuja1/llmuxer/stargazers)

**Find the cheapest LLM that meets your quality bar** *(Currently supports classification tasks only)*

## Quick Start

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mihirahuja1/llmuxer/blob/main/examples/quickstart.ipynb)

```python
import llmuxer

# Example: Classify sentiment with 90% accuracy requirement
examples = [
    {"input": "This product is amazing!", "label": "positive"},
    {"input": "Terrible service", "label": "negative"},
    {"input": "It's okay", "label": "neutral"}
]

result = llmuxer.optimize_cost(
    baseline="gpt-4",
    examples=examples,
    task="classification",  # Currently only classification is supported
    options=["positive", "negative", "neutral"],
    min_accuracy=0.9  # Require 90% accuracy
)

print(result)
# Takes ~30-60 seconds for small datasets, ~10-15 minutes for 1k samples
```

### Example Output
```python
{
    "model": "anthropic/claude-3-haiku",
    "accuracy": 0.92,
    "cost_per_million": 0.25,
    "cost_savings": 0.975,  # 97.5% cheaper than GPT-4
    "baseline_cost_per_million": 10.0,
    "tokens_evaluated": 1500
}
```

## The Problem

You're using GPT-4 for classification. It works well but costs $20/million tokens. Could GPT-3.5 do just as well for $0.50? What about Claude Haiku at $0.25? Or Llama-3.1 at $0.06?

**LLMuxer automatically tests your classification task across 18 models to find the cheapest one that maintains your required accuracy.**

## How It Works

```
Your Dataset → LLMuxer → Tests 18 Models → Returns Cheapest That Works
                  ↓
           Uses OpenRouter API
           (unified interface)
```

LLMuxer:
1. Takes your baseline model (e.g., GPT-4) and test dataset
2. Evaluates cheaper alternatives via OpenRouter
3. Returns the cheapest model meeting your accuracy threshold
4. Shows detailed cost breakdown and savings

## Installation

### Prerequisites
- Python 3.8+
- [OpenRouter API key](https://openrouter.ai/keys) (for model access)

### Install
```bash
pip install llmuxer
```

### Setup
```bash
export OPENROUTER_API_KEY="your-api-key-here"
```

## Key Features

- **18 models tested** - OpenAI, Anthropic, Google, Meta, Mistral, Qwen, DeepSeek  
- **Smart stopping** - Skips smaller models if larger ones fail  
- **Cost breakdown** - See token counts and costs per model  
- **Fast testing** - Use `sample_size` to test on subset first  
- **Simple API** - One function does everything  
- **Classification only** - Support for extraction, generation, and binary tasks coming in v0.2

## Benchmarks

### Tested Models

*Live pricing data from OpenRouter API (updated automatically):*

| Provider | Models | Price Range ($/M tokens) |
|----------|--------|--------------------------|
| OpenAI | gpt-4o-mini, gpt-3.5-turbo | $0.75 - $2.00 |
| Anthropic | claude-3-haiku | $1.50 |
| DeepSeek | deepseek-chat | $0.90 |
| Mistral | 3 models | $0.08 - $8.00 |
| Meta | llama-3.1-8b-instruct, llama-3.1-70b-instruct | $0.04 - $0.38 |

**Total: 9 models across 5 providers**

### Reproduce Our Benchmarks

```bash
# Test all 9 models on Banking77 dataset
python scripts/prepare_banking77.py
python examples/banking77_test.py
```

**Expected Results:** Most models achieve 85-92% accuracy on Banking77. Claude-3-haiku typically provides the best accuracy/cost ratio for classification tasks.

### Performance Benchmarks

**Fixed Dataset Results** *(50 job classification samples, tested 2025-08-10)*

| Metric | Baseline (GPT-4o) | Best Model (Claude-3-haiku) | Savings |
|--------|------------------|---------------------------|---------|
| **Accuracy** | ~95% (assumed) | **92.0%** | Quality maintained |
| **Cost/Million Tokens** | $12.50 | **$1.50** | **88.0% cheaper** |
| **Cost/Request*** | $0.001875 | **$0.000225** | **$0.00165 saved** |
| **Monthly (1K requests)** | $1.88 | **$0.23** | **$1.65 saved** |

*Conservative estimate: 150 tokens/request (100 input + 50 output)*

**[📊 Full Benchmark Report](docs/benchmarks.md)** | **[🔄 Reproduction Guide](#reproduction)**

### Reproduction

```bash
# Install and setup
pip install llmuxer
export OPENROUTER_API_KEY="your-key"

# Run exact benchmark  
./scripts/bench.sh

# Generates: benchmarks/bench_YYYYMMDD.json + docs/benchmarks.md
```

**Benchmark Notes:**
- Fixed dataset: `data/jobs_50.jsonl` (8 categories, 50 samples)
- Pinned models: 8 specific models with exact API versions  
- Conservative estimates: 150 tokens/request assumption
- No cherry-picking: Single test run results
- Quality threshold: 85%+ accuracy required

## API Reference

### `optimize_cost()`

Find the cheapest model meeting your requirements for classification tasks.

**Parameters:**
- `baseline` (str): Your current model (e.g., "gpt-4")
- `examples` (list): Test examples with input and label
- `dataset` (str): Path to JSONL file (alternative to examples)
- `task` (str): Must be "classification" (other tasks coming soon)
- `options` (list): Valid output classes for classification
- `min_accuracy` (float): Minimum acceptable accuracy (0.0-1.0)
- `sample_size` (float): Fraction of dataset to test (0.0-1.0)
- `prompt` (str): Optional system prompt

**Returns:**
Dictionary with model name, accuracy, cost, and savings.

**Error Handling:**
- Returns `{"error": "message"}` if no model meets threshold
- Retries on API failures
- Validates dataset format

## Full Example: Banking Intent Classification

```python
import llmuxer

# Using the Banking77 dataset (77 intent categories)
result = llmuxer.optimize_cost(
    baseline="gpt-4",
    dataset="data/banking77.jsonl",  # Your prepared dataset
    task="classification",
    min_accuracy=0.8,
    sample_size=0.2  # Test on 20% first for speed
)

if "error" in result:
    print(f"No model found: {result['error']}")
else:
    print(f"Switch from {baseline} to {result['model']}")
    print(f"Save {result['cost_savings']:.0%} on costs")
    print(f"Accuracy: {result['accuracy']:.1%}")
```

## Dataset Format

JSONL format with `input` and `label` fields:
```json
{"input": "What's my account balance?", "label": "balance_inquiry"}
{"input": "I lost my card", "label": "card_lost"}
```

## Performance Notes

### Timing Estimates

For a dataset with 1,000 samples:

| Model Type | Time per 1k samples | Token Speed |
|------------|-------------------|-------------|
| GPT-3.5-turbo | ~45-60 seconds | ~2,000 tokens/sec |
| Claude-3-haiku | ~30-45 seconds | ~2,500 tokens/sec |
| Gemini-1.5-flash | ~20-30 seconds | ~3,000 tokens/sec |
| Llama-3.1-8b | ~15-25 seconds | ~3,500 tokens/sec |
| **Total for 18 models** | **~10-15 minutes** | Sequential |

### Speed Considerations

- **Sequential Processing**: Currently tests one model at a time (parallel in v0.2)
- **Sample Size**: Use `sample_size=0.1` to test on 10% first for quick validation
- **Smart Stopping**: Saves 30-50% time by skipping smaller models when larger ones fail
- **Rate Limits**: Automatic handling with exponential backoff
- **Caching**: Not yet implemented (coming in v0.2 will reduce re-evaluation time by 90%)

## Links

- [PyPI Package](https://pypi.org/project/llmuxer/)
- [GitHub Repository](https://github.com/mihirahuja1/llmuxer)
- [OpenRouter API Keys](https://openrouter.ai/keys)
- [Full Documentation](https://github.com/mihirahuja1/llmuxer/tree/main/docs)
- [Roadmap](https://github.com/mihirahuja1/llmuxer/blob/main/ROADMAP.md)

## License

MIT - see [LICENSE](LICENSE) file.

## Support

- Issues: [GitHub Issues](https://github.com/mihirahuja1/llmuxer/issues)
- Email: mihirahuja09@gmail.com

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mihirahuja1/llmuxer",
    "name": "llmuxer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "llm, ai, cost-optimization, machine-learning, openai, anthropic",
    "author": "Mihir Ahuja",
    "author_email": "Mihir Ahuja <mihirahuja09@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f5/40/d71361264b1f9af78d01c00b00bf6f06f246620647e839d660b4e183db11/llmuxer-0.1.0.tar.gz",
    "platform": null,
    "description": "# LLMuxer\n\n[![PyPI version](https://badge.fury.io/py/llmuxer.svg)](https://pypi.org/project/llmuxer/)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![CI Tests](https://github.com/mihirahuja1/llmuxer/workflows/CI/badge.svg)](https://github.com/mihirahuja1/llmuxer/actions)\n[![Coverage](https://codecov.io/gh/mihirahuja1/llmuxer/branch/main/graph/badge.svg)](https://codecov.io/gh/mihirahuja1/llmuxer)\n[![Downloads](https://pepy.tech/badge/llmuxer)](https://pepy.tech/project/llmuxer)\n[![GitHub Stars](https://img.shields.io/github/stars/mihirahuja1/llmuxer)](https://github.com/mihirahuja1/llmuxer/stargazers)\n\n**Find the cheapest LLM that meets your quality bar** *(Currently supports classification tasks only)*\n\n## Quick Start\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mihirahuja1/llmuxer/blob/main/examples/quickstart.ipynb)\n\n```python\nimport llmuxer\n\n# Example: Classify sentiment with 90% accuracy requirement\nexamples = [\n    {\"input\": \"This product is amazing!\", \"label\": \"positive\"},\n    {\"input\": \"Terrible service\", \"label\": \"negative\"},\n    {\"input\": \"It's okay\", \"label\": \"neutral\"}\n]\n\nresult = llmuxer.optimize_cost(\n    baseline=\"gpt-4\",\n    examples=examples,\n    task=\"classification\",  # Currently only classification is supported\n    options=[\"positive\", \"negative\", \"neutral\"],\n    min_accuracy=0.9  # Require 90% accuracy\n)\n\nprint(result)\n# Takes ~30-60 seconds for small datasets, ~10-15 minutes for 1k samples\n```\n\n### Example Output\n```python\n{\n    \"model\": \"anthropic/claude-3-haiku\",\n    \"accuracy\": 0.92,\n    \"cost_per_million\": 0.25,\n    \"cost_savings\": 0.975,  # 97.5% cheaper than GPT-4\n    \"baseline_cost_per_million\": 10.0,\n    \"tokens_evaluated\": 1500\n}\n```\n\n## The Problem\n\nYou're using GPT-4 for classification. It works well but costs $20/million tokens. Could GPT-3.5 do just as well for $0.50? What about Claude Haiku at $0.25? Or Llama-3.1 at $0.06?\n\n**LLMuxer automatically tests your classification task across 18 models to find the cheapest one that maintains your required accuracy.**\n\n## How It Works\n\n```\nYour Dataset \u2192 LLMuxer \u2192 Tests 18 Models \u2192 Returns Cheapest That Works\n                  \u2193\n           Uses OpenRouter API\n           (unified interface)\n```\n\nLLMuxer:\n1. Takes your baseline model (e.g., GPT-4) and test dataset\n2. Evaluates cheaper alternatives via OpenRouter\n3. Returns the cheapest model meeting your accuracy threshold\n4. Shows detailed cost breakdown and savings\n\n## Installation\n\n### Prerequisites\n- Python 3.8+\n- [OpenRouter API key](https://openrouter.ai/keys) (for model access)\n\n### Install\n```bash\npip install llmuxer\n```\n\n### Setup\n```bash\nexport OPENROUTER_API_KEY=\"your-api-key-here\"\n```\n\n## Key Features\n\n- **18 models tested** - OpenAI, Anthropic, Google, Meta, Mistral, Qwen, DeepSeek  \n- **Smart stopping** - Skips smaller models if larger ones fail  \n- **Cost breakdown** - See token counts and costs per model  \n- **Fast testing** - Use `sample_size` to test on subset first  \n- **Simple API** - One function does everything  \n- **Classification only** - Support for extraction, generation, and binary tasks coming in v0.2\n\n## Benchmarks\n\n### Tested Models\n\n*Live pricing data from OpenRouter API (updated automatically):*\n\n| Provider | Models | Price Range ($/M tokens) |\n|----------|--------|--------------------------|\n| OpenAI | gpt-4o-mini, gpt-3.5-turbo | $0.75 - $2.00 |\n| Anthropic | claude-3-haiku | $1.50 |\n| DeepSeek | deepseek-chat | $0.90 |\n| Mistral | 3 models | $0.08 - $8.00 |\n| Meta | llama-3.1-8b-instruct, llama-3.1-70b-instruct | $0.04 - $0.38 |\n\n**Total: 9 models across 5 providers**\n\n### Reproduce Our Benchmarks\n\n```bash\n# Test all 9 models on Banking77 dataset\npython scripts/prepare_banking77.py\npython examples/banking77_test.py\n```\n\n**Expected Results:** Most models achieve 85-92% accuracy on Banking77. Claude-3-haiku typically provides the best accuracy/cost ratio for classification tasks.\n\n### Performance Benchmarks\n\n**Fixed Dataset Results** *(50 job classification samples, tested 2025-08-10)*\n\n| Metric | Baseline (GPT-4o) | Best Model (Claude-3-haiku) | Savings |\n|--------|------------------|---------------------------|---------|\n| **Accuracy** | ~95% (assumed) | **92.0%** | Quality maintained |\n| **Cost/Million Tokens** | $12.50 | **$1.50** | **88.0% cheaper** |\n| **Cost/Request*** | $0.001875 | **$0.000225** | **$0.00165 saved** |\n| **Monthly (1K requests)** | $1.88 | **$0.23** | **$1.65 saved** |\n\n*Conservative estimate: 150 tokens/request (100 input + 50 output)*\n\n**[\ud83d\udcca Full Benchmark Report](docs/benchmarks.md)** | **[\ud83d\udd04 Reproduction Guide](#reproduction)**\n\n### Reproduction\n\n```bash\n# Install and setup\npip install llmuxer\nexport OPENROUTER_API_KEY=\"your-key\"\n\n# Run exact benchmark  \n./scripts/bench.sh\n\n# Generates: benchmarks/bench_YYYYMMDD.json + docs/benchmarks.md\n```\n\n**Benchmark Notes:**\n- Fixed dataset: `data/jobs_50.jsonl` (8 categories, 50 samples)\n- Pinned models: 8 specific models with exact API versions  \n- Conservative estimates: 150 tokens/request assumption\n- No cherry-picking: Single test run results\n- Quality threshold: 85%+ accuracy required\n\n## API Reference\n\n### `optimize_cost()`\n\nFind the cheapest model meeting your requirements for classification tasks.\n\n**Parameters:**\n- `baseline` (str): Your current model (e.g., \"gpt-4\")\n- `examples` (list): Test examples with input and label\n- `dataset` (str): Path to JSONL file (alternative to examples)\n- `task` (str): Must be \"classification\" (other tasks coming soon)\n- `options` (list): Valid output classes for classification\n- `min_accuracy` (float): Minimum acceptable accuracy (0.0-1.0)\n- `sample_size` (float): Fraction of dataset to test (0.0-1.0)\n- `prompt` (str): Optional system prompt\n\n**Returns:**\nDictionary with model name, accuracy, cost, and savings.\n\n**Error Handling:**\n- Returns `{\"error\": \"message\"}` if no model meets threshold\n- Retries on API failures\n- Validates dataset format\n\n## Full Example: Banking Intent Classification\n\n```python\nimport llmuxer\n\n# Using the Banking77 dataset (77 intent categories)\nresult = llmuxer.optimize_cost(\n    baseline=\"gpt-4\",\n    dataset=\"data/banking77.jsonl\",  # Your prepared dataset\n    task=\"classification\",\n    min_accuracy=0.8,\n    sample_size=0.2  # Test on 20% first for speed\n)\n\nif \"error\" in result:\n    print(f\"No model found: {result['error']}\")\nelse:\n    print(f\"Switch from {baseline} to {result['model']}\")\n    print(f\"Save {result['cost_savings']:.0%} on costs\")\n    print(f\"Accuracy: {result['accuracy']:.1%}\")\n```\n\n## Dataset Format\n\nJSONL format with `input` and `label` fields:\n```json\n{\"input\": \"What's my account balance?\", \"label\": \"balance_inquiry\"}\n{\"input\": \"I lost my card\", \"label\": \"card_lost\"}\n```\n\n## Performance Notes\n\n### Timing Estimates\n\nFor a dataset with 1,000 samples:\n\n| Model Type | Time per 1k samples | Token Speed |\n|------------|-------------------|-------------|\n| GPT-3.5-turbo | ~45-60 seconds | ~2,000 tokens/sec |\n| Claude-3-haiku | ~30-45 seconds | ~2,500 tokens/sec |\n| Gemini-1.5-flash | ~20-30 seconds | ~3,000 tokens/sec |\n| Llama-3.1-8b | ~15-25 seconds | ~3,500 tokens/sec |\n| **Total for 18 models** | **~10-15 minutes** | Sequential |\n\n### Speed Considerations\n\n- **Sequential Processing**: Currently tests one model at a time (parallel in v0.2)\n- **Sample Size**: Use `sample_size=0.1` to test on 10% first for quick validation\n- **Smart Stopping**: Saves 30-50% time by skipping smaller models when larger ones fail\n- **Rate Limits**: Automatic handling with exponential backoff\n- **Caching**: Not yet implemented (coming in v0.2 will reduce re-evaluation time by 90%)\n\n## Links\n\n- [PyPI Package](https://pypi.org/project/llmuxer/)\n- [GitHub Repository](https://github.com/mihirahuja1/llmuxer)\n- [OpenRouter API Keys](https://openrouter.ai/keys)\n- [Full Documentation](https://github.com/mihirahuja1/llmuxer/tree/main/docs)\n- [Roadmap](https://github.com/mihirahuja1/llmuxer/blob/main/ROADMAP.md)\n\n## License\n\nMIT - see [LICENSE](LICENSE) file.\n\n## Support\n\n- Issues: [GitHub Issues](https://github.com/mihirahuja1/llmuxer/issues)\n- Email: mihirahuja09@gmail.com\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Automatically find cheaper LLM alternatives while maintaining performance",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/mihirahuja1/llmuxer#readme",
        "Homepage": "https://github.com/mihirahuja1/llmuxer",
        "Issues": "https://github.com/mihirahuja1/llmuxer/issues",
        "Repository": "https://github.com/mihirahuja1/llmuxer"
    },
    "split_keywords": [
        "llm",
        " ai",
        " cost-optimization",
        " machine-learning",
        " openai",
        " anthropic"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "02a9d135d59f39d286191fc781aae04ab21d678db4bdbcdb9740e27e2383382a",
                "md5": "82522facc3a3144fab398197dfb24238",
                "sha256": "9a102850831ceef123ba08d74208e1ed2e417fb9597b2bed8a921ab14eef1c15"
            },
            "downloads": -1,
            "filename": "llmuxer-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "82522facc3a3144fab398197dfb24238",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 20453,
            "upload_time": "2025-08-10T21:21:02",
            "upload_time_iso_8601": "2025-08-10T21:21:02.381149Z",
            "url": "https://files.pythonhosted.org/packages/02/a9/d135d59f39d286191fc781aae04ab21d678db4bdbcdb9740e27e2383382a/llmuxer-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f540d71361264b1f9af78d01c00b00bf6f06f246620647e839d660b4e183db11",
                "md5": "f0c571ac7a58f5750a87ec29ce956b77",
                "sha256": "bb21b7abfa44bc0e2bb6a72fb6e0f33270193e16a962d0b23b7d0eba915c491d"
            },
            "downloads": -1,
            "filename": "llmuxer-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f0c571ac7a58f5750a87ec29ce956b77",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 27137,
            "upload_time": "2025-08-10T21:21:03",
            "upload_time_iso_8601": "2025-08-10T21:21:03.719866Z",
            "url": "https://files.pythonhosted.org/packages/f5/40/d71361264b1f9af78d01c00b00bf6f06f246620647e839d660b4e183db11/llmuxer-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-10 21:21:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mihirahuja1",
    "github_project": "llmuxer",
    "github_not_found": true,
    "lcname": "llmuxer"
}

Mihir Ahuja