# vLLM Semantic Router Benchmark Suite
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/Apache-2.0)
A comprehensive benchmark suite for evaluating **semantic router** performance against **direct vLLM** across multiple reasoning datasets. Perfect for researchers and developers working on LLM routing, evaluation, and performance optimization.
## 🎯 Key Features
- **6 Major Reasoning Datasets**: MMLU-Pro, ARC, GPQA, TruthfulQA, CommonsenseQA, HellaSwag
- **Router vs vLLM Comparison**: Side-by-side performance evaluation
- **Multiple Evaluation Modes**: NR (neutral), XC (explicit CoT), NR_REASONING (auto-reasoning)
- **Research-Ready Output**: CSV files and publication-quality plots
- **Dataset-Agnostic Architecture**: Easy to extend with new datasets
- **CLI Tools**: Simple command-line interface for common operations
## 🚀 Quick Start
### Installation
```bash
pip install vllm-semantic-router-bench
```
### Basic Usage
```bash
# Quick test on MMLU dataset
vllm-semantic-router-bench test --dataset mmlu --samples 5
# Full comparison between router and vLLM
vllm-semantic-router-bench compare --dataset arc --samples 10
# List available datasets
vllm-semantic-router-bench list-datasets
# Run comprehensive multi-dataset benchmark
vllm-semantic-router-bench comprehensive
```
### Python API
```python
from vllm_semantic_router_bench import DatasetFactory, list_available_datasets
# Load a dataset
factory = DatasetFactory()
dataset = factory.create_dataset("mmlu")
questions, info = dataset.load_dataset(samples_per_category=10)
print(f"Loaded {len(questions)} questions from {info.name}")
print(f"Categories: {info.categories}")
```
## 📊 Supported Datasets
| Dataset | Domain | Categories | Difficulty | CoT Support |
|---------|--------|------------|------------|-------------|
| **MMLU-Pro** | Academic Knowledge | 57 subjects | Undergraduate | ✅ |
| **ARC** | Scientific Reasoning | Science | Grade School | ❌ |
| **GPQA** | Graduate Q&A | Graduate-level | Graduate | ❌ |
| **TruthfulQA** | Truthfulness | Truthfulness | Hard | ❌ |
| **CommonsenseQA** | Common Sense | Common Sense | Hard | ❌ |
| **HellaSwag** | Commonsense NLI | ~50 activities | Moderate | ❌ |
## 🔧 Advanced Usage
### Custom Evaluation Script
```python
import subprocess
import sys
# Run detailed benchmark with custom parameters
cmd = [
"router-bench", # Main benchmark script
"--dataset", "mmlu",
"--samples-per-category", "20",
"--run-router", "--router-models", "auto",
"--run-vllm", "--vllm-models", "openai/gpt-oss-20b",
"--vllm-exec-modes", "NR", "NR_REASONING",
"--output-dir", "results/custom_test"
]
subprocess.run(cmd)
```
### Plotting Results
```bash
# Generate plots from benchmark results
bench-plot --router-dir results/router_mmlu \
--vllm-dir results/vllm_mmlu \
--output-dir results/plots \
--dataset-name "MMLU-Pro"
```
## 📈 Research Output
The benchmark generates research-ready outputs:
- **CSV Files**: Detailed per-question results and aggregated metrics
- **Master CSV**: Combined results across all test runs
- **Plots**: Accuracy and token usage comparisons
- **Summary Reports**: Markdown reports with key findings
### Example Output Structure
```
results/
├── research_results_master.csv # Main research data
├── comparison_20250115_143022/
│ ├── router_mmlu/
│ │ └── detailed_results.csv
│ ├── vllm_mmlu/
│ │ └── detailed_results.csv
│ ├── plots/
│ │ ├── accuracy_comparison.png
│ │ └── token_usage_comparison.png
│ └── RESEARCH_SUMMARY.md
```
## 🛠️ Development
### Local Installation
```bash
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/bench
pip install -e ".[dev]"
```
### Adding New Datasets
1. Create a new dataset implementation in `dataset_implementations/`
2. Inherit from `DatasetInterface`
3. Register in `dataset_factory.py`
4. Add tests and documentation
```python
from vllm_semantic_router_bench import DatasetInterface, Question, DatasetInfo
class MyDataset(DatasetInterface):
def load_dataset(self, **kwargs):
# Implementation here
pass
def format_prompt(self, question, style="plain"):
# Implementation here
pass
```
## 📋 Requirements
- Python 3.8+
- OpenAI API access (for model evaluation)
- Hugging Face account (for dataset access)
- 4GB+ RAM (for larger datasets)
### Dependencies
- `openai>=1.0.0` - OpenAI API client
- `datasets>=2.14.0` - Hugging Face datasets
- `pandas>=1.5.0` - Data manipulation
- `matplotlib>=3.5.0` - Plotting
- `seaborn>=0.11.0` - Advanced plotting
- `tqdm>=4.64.0` - Progress bars
## 🤝 Contributing
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
### Common Contributions
- Adding new datasets
- Improving evaluation metrics
- Enhancing visualization
- Performance optimizations
- Documentation improvements
## 📄 License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
## 🔗 Links
- **Documentation**: https://vllm-semantic-router.com
- **GitHub**: https://github.com/vllm-project/semantic-router
- **Issues**: https://github.com/vllm-project/semantic-router/issues
- **PyPI**: https://pypi.org/project/vllm-semantic-router-bench/
## 📞 Support
- **GitHub Issues**: Bug reports and feature requests
- **Documentation**: Comprehensive guides and API reference
- **Community**: Join our discussions and get help from other users
---
**Made with ❤️ by the vLLM Semantic Router Team**
Raw data
{
"_id": null,
"home_page": "https://github.com/vllm-project/semantic-router",
"name": "vllm-semantic-router-bench",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "vllm-semantic-router, benchmark, vllm, llm, evaluation, reasoning, multiple-choice, mmlu, arc, gpqa, commonsense, hellaswag, truthfulqa",
"author": "vLLM Semantic Router Team",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/75/30/268500ec3ea185d761f17e20504e8e2c6fcaa7d556e018a14a0c85611899/vllm_semantic_router_bench-1.0.0.tar.gz",
"platform": null,
"description": "# vLLM Semantic Router Benchmark Suite\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/Apache-2.0)\n\nA comprehensive benchmark suite for evaluating **semantic router** performance against **direct vLLM** across multiple reasoning datasets. Perfect for researchers and developers working on LLM routing, evaluation, and performance optimization.\n\n## \ud83c\udfaf Key Features\n\n- **6 Major Reasoning Datasets**: MMLU-Pro, ARC, GPQA, TruthfulQA, CommonsenseQA, HellaSwag\n- **Router vs vLLM Comparison**: Side-by-side performance evaluation\n- **Multiple Evaluation Modes**: NR (neutral), XC (explicit CoT), NR_REASONING (auto-reasoning)\n- **Research-Ready Output**: CSV files and publication-quality plots\n- **Dataset-Agnostic Architecture**: Easy to extend with new datasets\n- **CLI Tools**: Simple command-line interface for common operations\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\npip install vllm-semantic-router-bench\n```\n\n### Basic Usage\n\n```bash\n# Quick test on MMLU dataset\nvllm-semantic-router-bench test --dataset mmlu --samples 5\n\n# Full comparison between router and vLLM\nvllm-semantic-router-bench compare --dataset arc --samples 10\n\n# List available datasets\nvllm-semantic-router-bench list-datasets\n\n# Run comprehensive multi-dataset benchmark\nvllm-semantic-router-bench comprehensive\n```\n\n### Python API\n\n```python\nfrom vllm_semantic_router_bench import DatasetFactory, list_available_datasets\n\n# Load a dataset\nfactory = DatasetFactory()\ndataset = factory.create_dataset(\"mmlu\")\nquestions, info = dataset.load_dataset(samples_per_category=10)\n\nprint(f\"Loaded {len(questions)} questions from {info.name}\")\nprint(f\"Categories: {info.categories}\")\n```\n\n## \ud83d\udcca Supported Datasets\n\n| Dataset | Domain | Categories | Difficulty | CoT Support |\n|---------|--------|------------|------------|-------------|\n| **MMLU-Pro** | Academic Knowledge | 57 subjects | Undergraduate | \u2705 |\n| **ARC** | Scientific Reasoning | Science | Grade School | \u274c |\n| **GPQA** | Graduate Q&A | Graduate-level | Graduate | \u274c |\n| **TruthfulQA** | Truthfulness | Truthfulness | Hard | \u274c |\n| **CommonsenseQA** | Common Sense | Common Sense | Hard | \u274c |\n| **HellaSwag** | Commonsense NLI | ~50 activities | Moderate | \u274c |\n\n## \ud83d\udd27 Advanced Usage\n\n### Custom Evaluation Script\n\n```python\nimport subprocess\nimport sys\n\n# Run detailed benchmark with custom parameters\ncmd = [\n \"router-bench\", # Main benchmark script\n \"--dataset\", \"mmlu\",\n \"--samples-per-category\", \"20\", \n \"--run-router\", \"--router-models\", \"auto\",\n \"--run-vllm\", \"--vllm-models\", \"openai/gpt-oss-20b\",\n \"--vllm-exec-modes\", \"NR\", \"NR_REASONING\",\n \"--output-dir\", \"results/custom_test\"\n]\n\nsubprocess.run(cmd)\n```\n\n### Plotting Results\n\n```bash\n# Generate plots from benchmark results\nbench-plot --router-dir results/router_mmlu \\\n --vllm-dir results/vllm_mmlu \\\n --output-dir results/plots \\\n --dataset-name \"MMLU-Pro\"\n```\n\n## \ud83d\udcc8 Research Output\n\nThe benchmark generates research-ready outputs:\n\n- **CSV Files**: Detailed per-question results and aggregated metrics\n- **Master CSV**: Combined results across all test runs\n- **Plots**: Accuracy and token usage comparisons\n- **Summary Reports**: Markdown reports with key findings\n\n### Example Output Structure\n\n```\nresults/\n\u251c\u2500\u2500 research_results_master.csv # Main research data\n\u251c\u2500\u2500 comparison_20250115_143022/\n\u2502 \u251c\u2500\u2500 router_mmlu/\n\u2502 \u2502 \u2514\u2500\u2500 detailed_results.csv\n\u2502 \u251c\u2500\u2500 vllm_mmlu/ \n\u2502 \u2502 \u2514\u2500\u2500 detailed_results.csv\n\u2502 \u251c\u2500\u2500 plots/\n\u2502 \u2502 \u251c\u2500\u2500 accuracy_comparison.png\n\u2502 \u2502 \u2514\u2500\u2500 token_usage_comparison.png\n\u2502 \u2514\u2500\u2500 RESEARCH_SUMMARY.md\n```\n\n## \ud83d\udee0\ufe0f Development\n\n### Local Installation\n\n```bash\ngit clone https://github.com/vllm-project/semantic-router\ncd semantic-router/bench\npip install -e \".[dev]\"\n```\n\n### Adding New Datasets\n\n1. Create a new dataset implementation in `dataset_implementations/`\n2. Inherit from `DatasetInterface`\n3. Register in `dataset_factory.py`\n4. Add tests and documentation\n\n```python\nfrom vllm_semantic_router_bench import DatasetInterface, Question, DatasetInfo\n\nclass MyDataset(DatasetInterface):\n def load_dataset(self, **kwargs):\n # Implementation here\n pass\n \n def format_prompt(self, question, style=\"plain\"):\n # Implementation here \n pass\n```\n\n## \ud83d\udccb Requirements\n\n- Python 3.8+\n- OpenAI API access (for model evaluation)\n- Hugging Face account (for dataset access)\n- 4GB+ RAM (for larger datasets)\n\n### Dependencies\n\n- `openai>=1.0.0` - OpenAI API client\n- `datasets>=2.14.0` - Hugging Face datasets\n- `pandas>=1.5.0` - Data manipulation\n- `matplotlib>=3.5.0` - Plotting\n- `seaborn>=0.11.0` - Advanced plotting\n- `tqdm>=4.64.0` - Progress bars\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.\n\n### Common Contributions\n\n- Adding new datasets\n- Improving evaluation metrics\n- Enhancing visualization\n- Performance optimizations\n- Documentation improvements\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udd17 Links\n\n- **Documentation**: https://vllm-semantic-router.com\n- **GitHub**: https://github.com/vllm-project/semantic-router\n- **Issues**: https://github.com/vllm-project/semantic-router/issues\n- **PyPI**: https://pypi.org/project/vllm-semantic-router-bench/\n\n## \ud83d\udcde Support\n\n- **GitHub Issues**: Bug reports and feature requests\n- **Documentation**: Comprehensive guides and API reference\n- **Community**: Join our discussions and get help from other users\n\n---\n\n**Made with \u2764\ufe0f by the vLLM Semantic Router Team**\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Comprehensive benchmark suite for semantic router vs direct vLLM evaluation across multiple reasoning datasets",
"version": "1.0.0",
"project_urls": {
"Bug Tracker": "https://github.com/vllm-project/semantic-router/issues",
"Documentation": "https://vllm-semantic-router.com",
"Homepage": "https://github.com/vllm-project/semantic-router",
"Repository": "https://github.com/vllm-project/semantic-router"
},
"split_keywords": [
"vllm-semantic-router",
" benchmark",
" vllm",
" llm",
" evaluation",
" reasoning",
" multiple-choice",
" mmlu",
" arc",
" gpqa",
" commonsense",
" hellaswag",
" truthfulqa"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "114a28f3aa49c315f39919c9d7ae4be0ea07f8f8bf72e1972c070717a67a7e4a",
"md5": "24705d06f5ffb7aaa703188629e28902",
"sha256": "7ab7d6ac106ab1169fa880ca86a19d9f680cdadcb546db437c1c5a33c4467b7d"
},
"downloads": -1,
"filename": "vllm_semantic_router_bench-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "24705d06f5ffb7aaa703188629e28902",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 45531,
"upload_time": "2025-09-12T21:10:14",
"upload_time_iso_8601": "2025-09-12T21:10:14.869752Z",
"url": "https://files.pythonhosted.org/packages/11/4a/28f3aa49c315f39919c9d7ae4be0ea07f8f8bf72e1972c070717a67a7e4a/vllm_semantic_router_bench-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7530268500ec3ea185d761f17e20504e8e2c6fcaa7d556e018a14a0c85611899",
"md5": "04c17bc03a0a3c4cfd2dd15e730ebbc5",
"sha256": "3e11ce635573814d1aebbc6a9f087edfdd044754dd8bc44b33ed17ad5d845f29"
},
"downloads": -1,
"filename": "vllm_semantic_router_bench-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "04c17bc03a0a3c4cfd2dd15e730ebbc5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 53477,
"upload_time": "2025-09-12T21:10:15",
"upload_time_iso_8601": "2025-09-12T21:10:15.887868Z",
"url": "https://files.pythonhosted.org/packages/75/30/268500ec3ea185d761f17e20504e8e2c6fcaa7d556e018a14a0c85611899/vllm_semantic_router_bench-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-12 21:10:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vllm-project",
"github_project": "semantic-router",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "torch",
"specs": [
[
">=",
"2.7.1"
]
]
},
{
"name": "accelerate",
"specs": [
[
">=",
"0.26.0"
]
]
},
{
"name": "sentence-transformers",
"specs": [
[
">=",
"2.2.0"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.54.0"
]
]
},
{
"name": "datasets",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.25.0"
]
]
},
{
"name": "huggingface-hub",
"specs": [
[
">=",
"0.10.0"
]
]
},
{
"name": "psutil",
"specs": [
[
">=",
"7.0.0"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.10"
]
]
},
{
"name": "seaborn",
"specs": [
[
">=",
"0.13"
]
]
},
{
"name": "openai",
"specs": [
[
">=",
"1.100"
]
]
}
],
"lcname": "vllm-semantic-router-bench"
}