# LLMuxer
[](https://pypi.org/project/llmuxer/)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/mihirahuja1/llmuxer/actions)
[](https://codecov.io/gh/mihirahuja1/llmuxer)
[](https://pepy.tech/project/llmuxer)
[](https://github.com/mihirahuja1/llmuxer/stargazers)
**Find the cheapest LLM that meets your quality bar** *(Currently supports classification tasks only)*
## Quick Start
[](https://colab.research.google.com/github/mihirahuja1/llmuxer/blob/main/examples/quickstart.ipynb)
```python
import llmuxer
# Example: Classify sentiment with 90% accuracy requirement
examples = [
{"input": "This product is amazing!", "label": "positive"},
{"input": "Terrible service", "label": "negative"},
{"input": "It's okay", "label": "neutral"}
]
result = llmuxer.optimize_cost(
baseline="gpt-4",
examples=examples,
task="classification", # Currently only classification is supported
options=["positive", "negative", "neutral"],
min_accuracy=0.9 # Require 90% accuracy
)
print(result)
# Takes ~30-60 seconds for small datasets, ~10-15 minutes for 1k samples
```
### Example Output
```python
{
"model": "anthropic/claude-3-haiku",
"accuracy": 0.92,
"cost_per_million": 0.25,
"cost_savings": 0.975, # 97.5% cheaper than GPT-4
"baseline_cost_per_million": 10.0,
"tokens_evaluated": 1500
}
```
## The Problem
You're using GPT-4 for classification. It works well but costs $20/million tokens. Could GPT-3.5 do just as well for $0.50? What about Claude Haiku at $0.25? Or Llama-3.1 at $0.06?
**LLMuxer automatically tests your classification task across 18 models to find the cheapest one that maintains your required accuracy.**
## How It Works
```
Your Dataset → LLMuxer → Tests 18 Models → Returns Cheapest That Works
↓
Uses OpenRouter API
(unified interface)
```
LLMuxer:
1. Takes your baseline model (e.g., GPT-4) and test dataset
2. Evaluates cheaper alternatives via OpenRouter
3. Returns the cheapest model meeting your accuracy threshold
4. Shows detailed cost breakdown and savings
## Installation
### Prerequisites
- Python 3.8+
- [OpenRouter API key](https://openrouter.ai/keys) (for model access)
### Install
```bash
pip install llmuxer
```
### Setup
```bash
export OPENROUTER_API_KEY="your-api-key-here"
```
## Key Features
- **18 models tested** - OpenAI, Anthropic, Google, Meta, Mistral, Qwen, DeepSeek
- **Smart stopping** - Skips smaller models if larger ones fail
- **Cost breakdown** - See token counts and costs per model
- **Fast testing** - Use `sample_size` to test on subset first
- **Simple API** - One function does everything
- **Classification only** - Support for extraction, generation, and binary tasks coming in v0.2
## Benchmarks
### Tested Models
*Live pricing data from OpenRouter API (updated automatically):*
| Provider | Models | Price Range ($/M tokens) |
|----------|--------|--------------------------|
| OpenAI | gpt-4o-mini, gpt-3.5-turbo | $0.75 - $2.00 |
| Anthropic | claude-3-haiku | $1.50 |
| DeepSeek | deepseek-chat | $0.90 |
| Mistral | 3 models | $0.08 - $8.00 |
| Meta | llama-3.1-8b-instruct, llama-3.1-70b-instruct | $0.04 - $0.38 |
**Total: 9 models across 5 providers**
### Reproduce Our Benchmarks
```bash
# Test all 9 models on Banking77 dataset
python scripts/prepare_banking77.py
python examples/banking77_test.py
```
**Expected Results:** Most models achieve 85-92% accuracy on Banking77. Claude-3-haiku typically provides the best accuracy/cost ratio for classification tasks.
### Performance Benchmarks
**Fixed Dataset Results** *(50 job classification samples, tested 2025-08-10)*
| Metric | Baseline (GPT-4o) | Best Model (Claude-3-haiku) | Savings |
|--------|------------------|---------------------------|---------|
| **Accuracy** | ~95% (assumed) | **92.0%** | Quality maintained |
| **Cost/Million Tokens** | $12.50 | **$1.50** | **88.0% cheaper** |
| **Cost/Request*** | $0.001875 | **$0.000225** | **$0.00165 saved** |
| **Monthly (1K requests)** | $1.88 | **$0.23** | **$1.65 saved** |
*Conservative estimate: 150 tokens/request (100 input + 50 output)*
**[📊 Full Benchmark Report](docs/benchmarks.md)** | **[🔄 Reproduction Guide](#reproduction)**
### Reproduction
```bash
# Install and setup
pip install llmuxer
export OPENROUTER_API_KEY="your-key"
# Run exact benchmark
./scripts/bench.sh
# Generates: benchmarks/bench_YYYYMMDD.json + docs/benchmarks.md
```
**Benchmark Notes:**
- Fixed dataset: `data/jobs_50.jsonl` (8 categories, 50 samples)
- Pinned models: 8 specific models with exact API versions
- Conservative estimates: 150 tokens/request assumption
- No cherry-picking: Single test run results
- Quality threshold: 85%+ accuracy required
## API Reference
### `optimize_cost()`
Find the cheapest model meeting your requirements for classification tasks.
**Parameters:**
- `baseline` (str): Your current model (e.g., "gpt-4")
- `examples` (list): Test examples with input and label
- `dataset` (str): Path to JSONL file (alternative to examples)
- `task` (str): Must be "classification" (other tasks coming soon)
- `options` (list): Valid output classes for classification
- `min_accuracy` (float): Minimum acceptable accuracy (0.0-1.0)
- `sample_size` (float): Fraction of dataset to test (0.0-1.0)
- `prompt` (str): Optional system prompt
**Returns:**
Dictionary with model name, accuracy, cost, and savings.
**Error Handling:**
- Returns `{"error": "message"}` if no model meets threshold
- Retries on API failures
- Validates dataset format
## Full Example: Banking Intent Classification
```python
import llmuxer
# Using the Banking77 dataset (77 intent categories)
result = llmuxer.optimize_cost(
baseline="gpt-4",
dataset="data/banking77.jsonl", # Your prepared dataset
task="classification",
min_accuracy=0.8,
sample_size=0.2 # Test on 20% first for speed
)
if "error" in result:
print(f"No model found: {result['error']}")
else:
print(f"Switch from {baseline} to {result['model']}")
print(f"Save {result['cost_savings']:.0%} on costs")
print(f"Accuracy: {result['accuracy']:.1%}")
```
## Dataset Format
JSONL format with `input` and `label` fields:
```json
{"input": "What's my account balance?", "label": "balance_inquiry"}
{"input": "I lost my card", "label": "card_lost"}
```
## Performance Notes
### Timing Estimates
For a dataset with 1,000 samples:
| Model Type | Time per 1k samples | Token Speed |
|------------|-------------------|-------------|
| GPT-3.5-turbo | ~45-60 seconds | ~2,000 tokens/sec |
| Claude-3-haiku | ~30-45 seconds | ~2,500 tokens/sec |
| Gemini-1.5-flash | ~20-30 seconds | ~3,000 tokens/sec |
| Llama-3.1-8b | ~15-25 seconds | ~3,500 tokens/sec |
| **Total for 18 models** | **~10-15 minutes** | Sequential |
### Speed Considerations
- **Sequential Processing**: Currently tests one model at a time (parallel in v0.2)
- **Sample Size**: Use `sample_size=0.1` to test on 10% first for quick validation
- **Smart Stopping**: Saves 30-50% time by skipping smaller models when larger ones fail
- **Rate Limits**: Automatic handling with exponential backoff
- **Caching**: Not yet implemented (coming in v0.2 will reduce re-evaluation time by 90%)
## Links
- [PyPI Package](https://pypi.org/project/llmuxer/)
- [GitHub Repository](https://github.com/mihirahuja1/llmuxer)
- [OpenRouter API Keys](https://openrouter.ai/keys)
- [Full Documentation](https://github.com/mihirahuja1/llmuxer/tree/main/docs)
- [Roadmap](https://github.com/mihirahuja1/llmuxer/blob/main/ROADMAP.md)
## License
MIT - see [LICENSE](LICENSE) file.
## Support
- Issues: [GitHub Issues](https://github.com/mihirahuja1/llmuxer/issues)
- Email: mihirahuja09@gmail.com
Raw data
{
"_id": null,
"home_page": "https://github.com/mihirahuja1/llmuxer",
"name": "llmuxer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "llm, ai, cost-optimization, machine-learning, openai, anthropic",
"author": "Mihir Ahuja",
"author_email": "Mihir Ahuja <mihirahuja09@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/f5/40/d71361264b1f9af78d01c00b00bf6f06f246620647e839d660b4e183db11/llmuxer-0.1.0.tar.gz",
"platform": null,
"description": "# LLMuxer\n\n[](https://pypi.org/project/llmuxer/)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/mihirahuja1/llmuxer/actions)\n[](https://codecov.io/gh/mihirahuja1/llmuxer)\n[](https://pepy.tech/project/llmuxer)\n[](https://github.com/mihirahuja1/llmuxer/stargazers)\n\n**Find the cheapest LLM that meets your quality bar** *(Currently supports classification tasks only)*\n\n## Quick Start\n\n[](https://colab.research.google.com/github/mihirahuja1/llmuxer/blob/main/examples/quickstart.ipynb)\n\n```python\nimport llmuxer\n\n# Example: Classify sentiment with 90% accuracy requirement\nexamples = [\n {\"input\": \"This product is amazing!\", \"label\": \"positive\"},\n {\"input\": \"Terrible service\", \"label\": \"negative\"},\n {\"input\": \"It's okay\", \"label\": \"neutral\"}\n]\n\nresult = llmuxer.optimize_cost(\n baseline=\"gpt-4\",\n examples=examples,\n task=\"classification\", # Currently only classification is supported\n options=[\"positive\", \"negative\", \"neutral\"],\n min_accuracy=0.9 # Require 90% accuracy\n)\n\nprint(result)\n# Takes ~30-60 seconds for small datasets, ~10-15 minutes for 1k samples\n```\n\n### Example Output\n```python\n{\n \"model\": \"anthropic/claude-3-haiku\",\n \"accuracy\": 0.92,\n \"cost_per_million\": 0.25,\n \"cost_savings\": 0.975, # 97.5% cheaper than GPT-4\n \"baseline_cost_per_million\": 10.0,\n \"tokens_evaluated\": 1500\n}\n```\n\n## The Problem\n\nYou're using GPT-4 for classification. It works well but costs $20/million tokens. Could GPT-3.5 do just as well for $0.50? What about Claude Haiku at $0.25? Or Llama-3.1 at $0.06?\n\n**LLMuxer automatically tests your classification task across 18 models to find the cheapest one that maintains your required accuracy.**\n\n## How It Works\n\n```\nYour Dataset \u2192 LLMuxer \u2192 Tests 18 Models \u2192 Returns Cheapest That Works\n \u2193\n Uses OpenRouter API\n (unified interface)\n```\n\nLLMuxer:\n1. Takes your baseline model (e.g., GPT-4) and test dataset\n2. Evaluates cheaper alternatives via OpenRouter\n3. Returns the cheapest model meeting your accuracy threshold\n4. Shows detailed cost breakdown and savings\n\n## Installation\n\n### Prerequisites\n- Python 3.8+\n- [OpenRouter API key](https://openrouter.ai/keys) (for model access)\n\n### Install\n```bash\npip install llmuxer\n```\n\n### Setup\n```bash\nexport OPENROUTER_API_KEY=\"your-api-key-here\"\n```\n\n## Key Features\n\n- **18 models tested** - OpenAI, Anthropic, Google, Meta, Mistral, Qwen, DeepSeek \n- **Smart stopping** - Skips smaller models if larger ones fail \n- **Cost breakdown** - See token counts and costs per model \n- **Fast testing** - Use `sample_size` to test on subset first \n- **Simple API** - One function does everything \n- **Classification only** - Support for extraction, generation, and binary tasks coming in v0.2\n\n## Benchmarks\n\n### Tested Models\n\n*Live pricing data from OpenRouter API (updated automatically):*\n\n| Provider | Models | Price Range ($/M tokens) |\n|----------|--------|--------------------------|\n| OpenAI | gpt-4o-mini, gpt-3.5-turbo | $0.75 - $2.00 |\n| Anthropic | claude-3-haiku | $1.50 |\n| DeepSeek | deepseek-chat | $0.90 |\n| Mistral | 3 models | $0.08 - $8.00 |\n| Meta | llama-3.1-8b-instruct, llama-3.1-70b-instruct | $0.04 - $0.38 |\n\n**Total: 9 models across 5 providers**\n\n### Reproduce Our Benchmarks\n\n```bash\n# Test all 9 models on Banking77 dataset\npython scripts/prepare_banking77.py\npython examples/banking77_test.py\n```\n\n**Expected Results:** Most models achieve 85-92% accuracy on Banking77. Claude-3-haiku typically provides the best accuracy/cost ratio for classification tasks.\n\n### Performance Benchmarks\n\n**Fixed Dataset Results** *(50 job classification samples, tested 2025-08-10)*\n\n| Metric | Baseline (GPT-4o) | Best Model (Claude-3-haiku) | Savings |\n|--------|------------------|---------------------------|---------|\n| **Accuracy** | ~95% (assumed) | **92.0%** | Quality maintained |\n| **Cost/Million Tokens** | $12.50 | **$1.50** | **88.0% cheaper** |\n| **Cost/Request*** | $0.001875 | **$0.000225** | **$0.00165 saved** |\n| **Monthly (1K requests)** | $1.88 | **$0.23** | **$1.65 saved** |\n\n*Conservative estimate: 150 tokens/request (100 input + 50 output)*\n\n**[\ud83d\udcca Full Benchmark Report](docs/benchmarks.md)** | **[\ud83d\udd04 Reproduction Guide](#reproduction)**\n\n### Reproduction\n\n```bash\n# Install and setup\npip install llmuxer\nexport OPENROUTER_API_KEY=\"your-key\"\n\n# Run exact benchmark \n./scripts/bench.sh\n\n# Generates: benchmarks/bench_YYYYMMDD.json + docs/benchmarks.md\n```\n\n**Benchmark Notes:**\n- Fixed dataset: `data/jobs_50.jsonl` (8 categories, 50 samples)\n- Pinned models: 8 specific models with exact API versions \n- Conservative estimates: 150 tokens/request assumption\n- No cherry-picking: Single test run results\n- Quality threshold: 85%+ accuracy required\n\n## API Reference\n\n### `optimize_cost()`\n\nFind the cheapest model meeting your requirements for classification tasks.\n\n**Parameters:**\n- `baseline` (str): Your current model (e.g., \"gpt-4\")\n- `examples` (list): Test examples with input and label\n- `dataset` (str): Path to JSONL file (alternative to examples)\n- `task` (str): Must be \"classification\" (other tasks coming soon)\n- `options` (list): Valid output classes for classification\n- `min_accuracy` (float): Minimum acceptable accuracy (0.0-1.0)\n- `sample_size` (float): Fraction of dataset to test (0.0-1.0)\n- `prompt` (str): Optional system prompt\n\n**Returns:**\nDictionary with model name, accuracy, cost, and savings.\n\n**Error Handling:**\n- Returns `{\"error\": \"message\"}` if no model meets threshold\n- Retries on API failures\n- Validates dataset format\n\n## Full Example: Banking Intent Classification\n\n```python\nimport llmuxer\n\n# Using the Banking77 dataset (77 intent categories)\nresult = llmuxer.optimize_cost(\n baseline=\"gpt-4\",\n dataset=\"data/banking77.jsonl\", # Your prepared dataset\n task=\"classification\",\n min_accuracy=0.8,\n sample_size=0.2 # Test on 20% first for speed\n)\n\nif \"error\" in result:\n print(f\"No model found: {result['error']}\")\nelse:\n print(f\"Switch from {baseline} to {result['model']}\")\n print(f\"Save {result['cost_savings']:.0%} on costs\")\n print(f\"Accuracy: {result['accuracy']:.1%}\")\n```\n\n## Dataset Format\n\nJSONL format with `input` and `label` fields:\n```json\n{\"input\": \"What's my account balance?\", \"label\": \"balance_inquiry\"}\n{\"input\": \"I lost my card\", \"label\": \"card_lost\"}\n```\n\n## Performance Notes\n\n### Timing Estimates\n\nFor a dataset with 1,000 samples:\n\n| Model Type | Time per 1k samples | Token Speed |\n|------------|-------------------|-------------|\n| GPT-3.5-turbo | ~45-60 seconds | ~2,000 tokens/sec |\n| Claude-3-haiku | ~30-45 seconds | ~2,500 tokens/sec |\n| Gemini-1.5-flash | ~20-30 seconds | ~3,000 tokens/sec |\n| Llama-3.1-8b | ~15-25 seconds | ~3,500 tokens/sec |\n| **Total for 18 models** | **~10-15 minutes** | Sequential |\n\n### Speed Considerations\n\n- **Sequential Processing**: Currently tests one model at a time (parallel in v0.2)\n- **Sample Size**: Use `sample_size=0.1` to test on 10% first for quick validation\n- **Smart Stopping**: Saves 30-50% time by skipping smaller models when larger ones fail\n- **Rate Limits**: Automatic handling with exponential backoff\n- **Caching**: Not yet implemented (coming in v0.2 will reduce re-evaluation time by 90%)\n\n## Links\n\n- [PyPI Package](https://pypi.org/project/llmuxer/)\n- [GitHub Repository](https://github.com/mihirahuja1/llmuxer)\n- [OpenRouter API Keys](https://openrouter.ai/keys)\n- [Full Documentation](https://github.com/mihirahuja1/llmuxer/tree/main/docs)\n- [Roadmap](https://github.com/mihirahuja1/llmuxer/blob/main/ROADMAP.md)\n\n## License\n\nMIT - see [LICENSE](LICENSE) file.\n\n## Support\n\n- Issues: [GitHub Issues](https://github.com/mihirahuja1/llmuxer/issues)\n- Email: mihirahuja09@gmail.com\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Automatically find cheaper LLM alternatives while maintaining performance",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://github.com/mihirahuja1/llmuxer#readme",
"Homepage": "https://github.com/mihirahuja1/llmuxer",
"Issues": "https://github.com/mihirahuja1/llmuxer/issues",
"Repository": "https://github.com/mihirahuja1/llmuxer"
},
"split_keywords": [
"llm",
" ai",
" cost-optimization",
" machine-learning",
" openai",
" anthropic"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "02a9d135d59f39d286191fc781aae04ab21d678db4bdbcdb9740e27e2383382a",
"md5": "82522facc3a3144fab398197dfb24238",
"sha256": "9a102850831ceef123ba08d74208e1ed2e417fb9597b2bed8a921ab14eef1c15"
},
"downloads": -1,
"filename": "llmuxer-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "82522facc3a3144fab398197dfb24238",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 20453,
"upload_time": "2025-08-10T21:21:02",
"upload_time_iso_8601": "2025-08-10T21:21:02.381149Z",
"url": "https://files.pythonhosted.org/packages/02/a9/d135d59f39d286191fc781aae04ab21d678db4bdbcdb9740e27e2383382a/llmuxer-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f540d71361264b1f9af78d01c00b00bf6f06f246620647e839d660b4e183db11",
"md5": "f0c571ac7a58f5750a87ec29ce956b77",
"sha256": "bb21b7abfa44bc0e2bb6a72fb6e0f33270193e16a962d0b23b7d0eba915c491d"
},
"downloads": -1,
"filename": "llmuxer-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "f0c571ac7a58f5750a87ec29ce956b77",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 27137,
"upload_time": "2025-08-10T21:21:03",
"upload_time_iso_8601": "2025-08-10T21:21:03.719866Z",
"url": "https://files.pythonhosted.org/packages/f5/40/d71361264b1f9af78d01c00b00bf6f06f246620647e839d660b4e183db11/llmuxer-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-10 21:21:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mihirahuja1",
"github_project": "llmuxer",
"github_not_found": true,
"lcname": "llmuxer"
}