# LLM Katan - Lightweight LLM Server for Testing
A lightweight LLM serving package using FastAPI and HuggingFace transformers,
designed for testing and development with real tiny models.
> **🎬 [See Live Demo](https://vllm-project.github.io/semantic-router/e2e-tests/llm-katan/terminal-demo.html)**
> Interactive terminal showing multi-instance setup in action!
## Features
- 🚀 **FastAPI-based**: High-performance async web server
- 🤗 **HuggingFace Integration**: Real model inference with transformers
- ⚡ **Tiny Models**: Ultra-lightweight models for fast testing (Qwen3-0.6B, etc.)
- 🔄 **Multi-Instance**: Run same model on different ports with different names
- 🎯 **OpenAI Compatible**: Drop-in replacement for OpenAI API endpoints
- 📦 **PyPI Ready**: Easy installation and distribution
- 🛠️ **vLLM Support**: Optional vLLM backend for production-like performance
## Quick Start
### Installation
#### Option 1: PyPI
```bash
pip install llm-katan
```
#### Option 2: Docker
```bash
# Pull and run the latest Docker image
docker pull ghcr.io/vllm-project/semantic-router/llm-katan:latest
docker run -p 8000:8000 ghcr.io/vllm-project/semantic-router/llm-katan:latest
# Or with custom model
docker run -p 8000:8000 ghcr.io/vllm-project/semantic-router/llm-katan:latest \
llm-katan --served-model-name "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
```
### Setup
#### HuggingFace Token (Required)
LLM Katan uses HuggingFace transformers to download models.
You'll need a HuggingFace token for:
- Private models
- Avoiding rate limits
- Reliable model downloads
#### Option 1: Environment Variable
```bash
export HUGGINGFACE_HUB_TOKEN="your_token_here"
```
#### Option 2: Login via CLI
```bash
huggingface-cli login
```
#### Option 3: Token file in home directory
```bash
# Create ~/.cache/huggingface/token file with your token
echo "your_token_here" > ~/.cache/huggingface/token
```
**Get your token:**
Visit [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
### Basic Usage
```bash
# Start server with a tiny model
llm-katan --model Qwen/Qwen3-0.6B --port 8000
# Start with custom served model name
llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# With vLLM backend (optional)
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --backend vllm
```
### Multi-Instance Testing
**🎬 [Live Demo](https://vllm-project.github.io/semantic-router/e2e-tests/llm-katan/terminal-demo.html)**
See this in action with animated terminals!
> *Note: If GitHub Pages isn't enabled, you can also
> [download and open the demo locally](./terminal-demo.html)*
<!-- markdownlint-disable MD033 -->
<details>
<summary>📺 Preview (click to expand)</summary>
<!-- markdownlint-enable MD033 -->
```bash
# Terminal 1: Installing and starting GPT-3.5-Turbo mock
$ pip install llm-katan
Successfully installed llm-katan-0.1.8
$ llm-katan --model Qwen/Qwen3-0.6B --port 8000 --served-model-name "gpt-3.5-turbo"
🚀 Starting LLM Katan server with model: Qwen/Qwen3-0.6B
📛 Served model name: gpt-3.5-turbo
✅ Server running on http://0.0.0.0:8000
# Terminal 2: Starting Claude-3-Haiku mock
$ llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "claude-3-haiku"
🚀 Starting LLM Katan server with model: Qwen/Qwen3-0.6B
📛 Served model name: claude-3-haiku
✅ Server running on http://0.0.0.0:8001
# Terminal 3: Testing both endpoints
$ curl localhost:8000/v1/models | jq '.data[0].id'
"gpt-3.5-turbo"
$ curl localhost:8001/v1/models | jq '.data[0].id'
"claude-3-haiku"
# Same tiny model, different API names! 🎯
```
</details>
```bash
# Terminal 1: Mock GPT-3.5-Turbo
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --served-model-name "gpt-3.5-turbo"
# Terminal 2: Mock Claude-3-Haiku
llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name "claude-3-haiku"
# Terminal 3: Test both endpoints
curl http://localhost:8000/v1/models # Returns "gpt-3.5-turbo"
curl http://localhost:8001/v1/models # Returns "claude-3-haiku"
```
**Perfect for testing multi-provider scenarios with one tiny model!**
## API Endpoints
- `GET /health` - Health check
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completions (OpenAI compatible)
### Example API Usage
```bash
# Basic chat completion
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2-0.5B-Instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 50,
"temperature": 0.7
}'
# Creative writing example
curl -X POST http://127.0.0.1:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [
{"role": "user", "content": "Write a short poem about coding"}
],
"max_tokens": 100,
"temperature": 0.8
}'
# Check available models
curl http://127.0.0.1:8000/v1/models
# Health check
curl http://127.0.0.1:8000/health
```
## Use Cases
### Strengths
- **Fastest time-to-test**: 30 seconds from install to running
- **Minimal resource footprint**: Designed for tiny models and efficient testing
- **No GPU required**: Runs on laptops, Macs, and any CPU-only environment
- **CI/CD integration friendly**: Lightweight and automation-ready
- **Multiple instances**: Run same model with different names on different ports
### Ideal For
- **Automated testing pipelines**: Quick LLM endpoint setup for test suites
- **Development environment mocking**: Real inference without production overhead
- **Quick prototyping**: Fast iteration with actual model behavior
- **Educational/learning scenarios**: Easy setup for AI development learning
### Not Ideal For
- **Production workloads**: Use Ollama or vLLM for production deployments
- **Large model serving**: Designed for tiny models (< 1B parameters)
- **Complex multi-agent workflows**: Use Semantic Kernel or similar frameworks
- **High-performance inference**: Use vLLM or specialized serving solutions
## Configuration
### Command Line Options
```bash
# All available options
llm-katan [OPTIONS]
Required:
-m, --model TEXT Model name to load (e.g., 'Qwen/Qwen3-0.6B') [required]
Optional:
-n, --name, --served-model-name TEXT
Model name to serve via API (defaults to model name)
-p, --port INTEGER Port to serve on (default: 8000)
-h, --host TEXT Host to bind to (default: 0.0.0.0)
-b, --backend [transformers|vllm] Backend to use (default: transformers)
--max, --max-tokens INTEGER Maximum tokens to generate (default: 512)
-t, --temperature FLOAT Sampling temperature (default: 0.7)
-d, --device [auto|cpu|cuda] Device to use (default: auto)
--log-level [debug|info|warning|error] Log level (default: INFO)
--version Show version and exit
--help Show help and exit
```
#### Advanced Usage Examples
```bash
# Custom generation settings
llm-katan --model Qwen/Qwen3-0.6B --max-tokens 1024 --temperature 0.9
# Force specific device
llm-katan --model Qwen/Qwen3-0.6B --device cpu --log-level debug
# Custom host and port
llm-katan --model Qwen/Qwen3-0.6B --host 127.0.0.1 --port 9000
# Multiple servers with different settings
llm-katan --model Qwen/Qwen3-0.6B --port 8000 --max-tokens 512 --temperature 0.1
llm-katan --model Qwen/Qwen3-0.6B --port 8001 \
--name "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --max-tokens 256 --temperature 0.9
```
### Environment Variables
- `LLM_KATAN_MODEL`: Default model to load
- `LLM_KATAN_PORT`: Default port (8000)
- `LLM_KATAN_BACKEND`: Backend type (transformers|vllm)
## Development
```bash
# Clone and install in development mode
git clone <repo>
cd e2e-tests/llm-katan
pip install -e .
# Run with development dependencies
pip install -e ".[dev]"
```
## License
Apache-2.0 License
## Contributing
Contributions welcome! Please see the main repository for guidelines.
---
*Part of the [semantic-router project ecosystem](https://vllm-semantic-router.com/)*
Raw data
{
"_id": null,
"home_page": null,
"name": "llm-katan",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Yossi Ovadia <yovadia@redhat.com>",
"keywords": "llm, testing, fastapi, huggingface, vllm, ai, ml",
"author": null,
"author_email": "Yossi Ovadia <yovadia@redhat.com>",
"download_url": "https://files.pythonhosted.org/packages/5b/db/dcc5cdf22015b4cf95ac5d02b03030269759ac55d3d99c9fbde050b1c3b5/llm_katan-0.1.9.tar.gz",
"platform": null,
"description": "# LLM Katan - Lightweight LLM Server for Testing\n\nA lightweight LLM serving package using FastAPI and HuggingFace transformers,\ndesigned for testing and development with real tiny models.\n\n> **\ud83c\udfac [See Live Demo](https://vllm-project.github.io/semantic-router/e2e-tests/llm-katan/terminal-demo.html)**\n> Interactive terminal showing multi-instance setup in action!\n\n## Features\n\n- \ud83d\ude80 **FastAPI-based**: High-performance async web server\n- \ud83e\udd17 **HuggingFace Integration**: Real model inference with transformers\n- \u26a1 **Tiny Models**: Ultra-lightweight models for fast testing (Qwen3-0.6B, etc.)\n- \ud83d\udd04 **Multi-Instance**: Run same model on different ports with different names\n- \ud83c\udfaf **OpenAI Compatible**: Drop-in replacement for OpenAI API endpoints\n- \ud83d\udce6 **PyPI Ready**: Easy installation and distribution\n- \ud83d\udee0\ufe0f **vLLM Support**: Optional vLLM backend for production-like performance\n\n## Quick Start\n\n### Installation\n\n#### Option 1: PyPI\n\n```bash\npip install llm-katan\n```\n\n#### Option 2: Docker\n\n```bash\n# Pull and run the latest Docker image\ndocker pull ghcr.io/vllm-project/semantic-router/llm-katan:latest\ndocker run -p 8000:8000 ghcr.io/vllm-project/semantic-router/llm-katan:latest\n\n# Or with custom model\ndocker run -p 8000:8000 ghcr.io/vllm-project/semantic-router/llm-katan:latest \\\n llm-katan --served-model-name \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\"\n```\n\n### Setup\n\n#### HuggingFace Token (Required)\n\nLLM Katan uses HuggingFace transformers to download models.\nYou'll need a HuggingFace token for:\n\n- Private models\n- Avoiding rate limits\n- Reliable model downloads\n\n#### Option 1: Environment Variable\n\n```bash\nexport HUGGINGFACE_HUB_TOKEN=\"your_token_here\"\n```\n\n#### Option 2: Login via CLI\n\n```bash\nhuggingface-cli login\n```\n\n#### Option 3: Token file in home directory\n\n```bash\n# Create ~/.cache/huggingface/token file with your token\necho \"your_token_here\" > ~/.cache/huggingface/token\n```\n\n**Get your token:**\nVisit [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)\n\n### Basic Usage\n\n```bash\n# Start server with a tiny model\nllm-katan --model Qwen/Qwen3-0.6B --port 8000\n\n# Start with custom served model name\nllm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\"\n\n# With vLLM backend (optional)\nllm-katan --model Qwen/Qwen3-0.6B --port 8000 --backend vllm\n```\n\n### Multi-Instance Testing\n\n**\ud83c\udfac [Live Demo](https://vllm-project.github.io/semantic-router/e2e-tests/llm-katan/terminal-demo.html)**\nSee this in action with animated terminals!\n\n> *Note: If GitHub Pages isn't enabled, you can also\n> [download and open the demo locally](./terminal-demo.html)*\n\n<!-- markdownlint-disable MD033 -->\n<details>\n<summary>\ud83d\udcfa Preview (click to expand)</summary>\n<!-- markdownlint-enable MD033 -->\n\n```bash\n# Terminal 1: Installing and starting GPT-3.5-Turbo mock\n$ pip install llm-katan\nSuccessfully installed llm-katan-0.1.8\n\n$ llm-katan --model Qwen/Qwen3-0.6B --port 8000 --served-model-name \"gpt-3.5-turbo\"\n\ud83d\ude80 Starting LLM Katan server with model: Qwen/Qwen3-0.6B\n\ud83d\udcdb Served model name: gpt-3.5-turbo\n\u2705 Server running on http://0.0.0.0:8000\n\n# Terminal 2: Starting Claude-3-Haiku mock\n$ llm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name \"claude-3-haiku\"\n\ud83d\ude80 Starting LLM Katan server with model: Qwen/Qwen3-0.6B\n\ud83d\udcdb Served model name: claude-3-haiku\n\u2705 Server running on http://0.0.0.0:8001\n\n# Terminal 3: Testing both endpoints\n$ curl localhost:8000/v1/models | jq '.data[0].id'\n\"gpt-3.5-turbo\"\n\n$ curl localhost:8001/v1/models | jq '.data[0].id'\n\"claude-3-haiku\"\n\n# Same tiny model, different API names! \ud83c\udfaf\n```\n\n</details>\n\n```bash\n# Terminal 1: Mock GPT-3.5-Turbo\nllm-katan --model Qwen/Qwen3-0.6B --port 8000 --served-model-name \"gpt-3.5-turbo\"\n\n# Terminal 2: Mock Claude-3-Haiku\nllm-katan --model Qwen/Qwen3-0.6B --port 8001 --served-model-name \"claude-3-haiku\"\n\n# Terminal 3: Test both endpoints\ncurl http://localhost:8000/v1/models # Returns \"gpt-3.5-turbo\"\ncurl http://localhost:8001/v1/models # Returns \"claude-3-haiku\"\n```\n\n**Perfect for testing multi-provider scenarios with one tiny model!**\n\n## API Endpoints\n\n- `GET /health` - Health check\n- `GET /v1/models` - List available models\n- `POST /v1/chat/completions` - Chat completions (OpenAI compatible)\n\n### Example API Usage\n\n```bash\n# Basic chat completion\ncurl -X POST http://127.0.0.1:8000/v1/chat/completions \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"model\": \"Qwen/Qwen2-0.5B-Instruct\",\n \"messages\": [\n {\"role\": \"user\", \"content\": \"What is the capital of France?\"}\n ],\n \"max_tokens\": 50,\n \"temperature\": 0.7\n }'\n\n# Creative writing example\ncurl -X POST http://127.0.0.1:8001/v1/chat/completions \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"model\": \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n \"messages\": [\n {\"role\": \"user\", \"content\": \"Write a short poem about coding\"}\n ],\n \"max_tokens\": 100,\n \"temperature\": 0.8\n }'\n\n# Check available models\ncurl http://127.0.0.1:8000/v1/models\n\n# Health check\ncurl http://127.0.0.1:8000/health\n```\n\n## Use Cases\n\n### Strengths\n\n- **Fastest time-to-test**: 30 seconds from install to running\n- **Minimal resource footprint**: Designed for tiny models and efficient testing\n- **No GPU required**: Runs on laptops, Macs, and any CPU-only environment\n- **CI/CD integration friendly**: Lightweight and automation-ready\n- **Multiple instances**: Run same model with different names on different ports\n\n### Ideal For\n\n- **Automated testing pipelines**: Quick LLM endpoint setup for test suites\n- **Development environment mocking**: Real inference without production overhead\n- **Quick prototyping**: Fast iteration with actual model behavior\n- **Educational/learning scenarios**: Easy setup for AI development learning\n\n### Not Ideal For\n\n- **Production workloads**: Use Ollama or vLLM for production deployments\n- **Large model serving**: Designed for tiny models (< 1B parameters)\n- **Complex multi-agent workflows**: Use Semantic Kernel or similar frameworks\n- **High-performance inference**: Use vLLM or specialized serving solutions\n\n## Configuration\n\n### Command Line Options\n\n```bash\n# All available options\nllm-katan [OPTIONS]\n\nRequired:\n -m, --model TEXT Model name to load (e.g., 'Qwen/Qwen3-0.6B') [required]\n\nOptional:\n -n, --name, --served-model-name TEXT\n Model name to serve via API (defaults to model name)\n -p, --port INTEGER Port to serve on (default: 8000)\n -h, --host TEXT Host to bind to (default: 0.0.0.0)\n -b, --backend [transformers|vllm] Backend to use (default: transformers)\n --max, --max-tokens INTEGER Maximum tokens to generate (default: 512)\n -t, --temperature FLOAT Sampling temperature (default: 0.7)\n -d, --device [auto|cpu|cuda] Device to use (default: auto)\n --log-level [debug|info|warning|error] Log level (default: INFO)\n --version Show version and exit\n --help Show help and exit\n```\n\n#### Advanced Usage Examples\n\n```bash\n# Custom generation settings\nllm-katan --model Qwen/Qwen3-0.6B --max-tokens 1024 --temperature 0.9\n\n# Force specific device\nllm-katan --model Qwen/Qwen3-0.6B --device cpu --log-level debug\n\n# Custom host and port\nllm-katan --model Qwen/Qwen3-0.6B --host 127.0.0.1 --port 9000\n\n# Multiple servers with different settings\nllm-katan --model Qwen/Qwen3-0.6B --port 8000 --max-tokens 512 --temperature 0.1\nllm-katan --model Qwen/Qwen3-0.6B --port 8001 \\\n --name \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\" --max-tokens 256 --temperature 0.9\n```\n\n### Environment Variables\n\n- `LLM_KATAN_MODEL`: Default model to load\n- `LLM_KATAN_PORT`: Default port (8000)\n- `LLM_KATAN_BACKEND`: Backend type (transformers|vllm)\n\n## Development\n\n```bash\n# Clone and install in development mode\ngit clone <repo>\ncd e2e-tests/llm-katan\npip install -e .\n\n# Run with development dependencies\npip install -e \".[dev]\"\n```\n\n## License\n\nApache-2.0 License\n\n## Contributing\n\nContributions welcome! Please see the main repository for guidelines.\n\n---\n\n*Part of the [semantic-router project ecosystem](https://vllm-semantic-router.com/)*\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "LLM Katan - Lightweight LLM Server for Testing - Real tiny models with FastAPI and HuggingFace",
"version": "0.1.9",
"project_urls": {
"Documentation": "https://github.com/vllm-project/semantic-router/tree/main/e2e-tests/llm-katan",
"Homepage": "https://github.com/vllm-project/semantic-router",
"Issues": "https://github.com/vllm-project/semantic-router/issues",
"Repository": "https://github.com/vllm-project/semantic-router.git"
},
"split_keywords": [
"llm",
" testing",
" fastapi",
" huggingface",
" vllm",
" ai",
" ml"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f6bc85669bd43e00d48dfa9c32f92b817578b3d5b4bcbc3e383db3b7e55232c6",
"md5": "d269f00318c73e5db33b249ef547ade9",
"sha256": "f6cde7c07f961a9a1a0b82872ade950544472c3f44165c110badf73fcbf239f2"
},
"downloads": -1,
"filename": "llm_katan-0.1.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d269f00318c73e5db33b249ef547ade9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 13891,
"upload_time": "2025-10-06T20:24:45",
"upload_time_iso_8601": "2025-10-06T20:24:45.540688Z",
"url": "https://files.pythonhosted.org/packages/f6/bc/85669bd43e00d48dfa9c32f92b817578b3d5b4bcbc3e383db3b7e55232c6/llm_katan-0.1.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5bdbdcc5cdf22015b4cf95ac5d02b03030269759ac55d3d99c9fbde050b1c3b5",
"md5": "8e3a19dfd33b5ad50294b417d4192491",
"sha256": "9190ac3702d7e3319269dbf8838d8762fcdb2897f8da3f233e679a5b5ae490b3"
},
"downloads": -1,
"filename": "llm_katan-0.1.9.tar.gz",
"has_sig": false,
"md5_digest": "8e3a19dfd33b5ad50294b417d4192491",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 15208,
"upload_time": "2025-10-06T20:24:46",
"upload_time_iso_8601": "2025-10-06T20:24:46.746466Z",
"url": "https://files.pythonhosted.org/packages/5b/db/dcc5cdf22015b4cf95ac5d02b03030269759ac55d3d99c9fbde050b1c3b5/llm_katan-0.1.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-06 20:24:46",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vllm-project",
"github_project": "semantic-router",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "torch",
"specs": [
[
">=",
"2.7.1"
]
]
},
{
"name": "accelerate",
"specs": [
[
">=",
"0.26.0"
]
]
},
{
"name": "sentence-transformers",
"specs": [
[
">=",
"2.2.0"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.54.0"
]
]
},
{
"name": "datasets",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.25.0"
]
]
},
{
"name": "huggingface-hub",
"specs": [
[
">=",
"0.10.0"
]
]
},
{
"name": "psutil",
"specs": [
[
">=",
"7.0.0"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.10"
]
]
},
{
"name": "seaborn",
"specs": [
[
">=",
"0.13"
]
]
},
{
"name": "openai",
"specs": [
[
">=",
"1.100"
]
]
}
],
"lcname": "llm-katan"
}