cascadeflow


Namecascadeflow JSON
Version 0.5.0 PyPI version JSON
download
home_pageNone
SummarySmart AI model cascading for cost optimization - Save 40-85% on LLM costs with 2-6x faster responses. Available for Python and TypeScript/JavaScript.
upload_time2025-11-07 18:44:27
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords ai llm cost-optimization model-routing cascade inference openai anthropic gpt claude machine-learning groq typescript javascript browser edge-functions
VCS
bugtrack_url
requirements pydantic httpx tiktoken rich
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

<picture>
  <source media="(prefers-color-scheme: dark)" srcset=".github/assets/CF_logo_bright.svg">
  <source media="(prefers-color-scheme: light)" srcset=".github/assets/CF_logo_dark.svg">
  <img alt="cascadeflow Logo" src=".github/assets/CF_logo_dark.svg" width="533">
</picture>

# Smart AI model cascading for cost optimization

[![PyPI version](https://img.shields.io/pypi/v/cascadeflow?color=blue&label=Python)](https://pypi.org/project/cascadeflow/)
[![npm version](https://img.shields.io/npm/v/@cascadeflow/core?color=red&label=TypeScript)](https://www.npmjs.com/package/@cascadeflow/core)
[![n8n version](https://img.shields.io/npm/v/@cascadeflow/n8n-nodes-cascadeflow?color=orange&label=n8n)](https://www.npmjs.com/package/@cascadeflow/n8n-nodes-cascadeflow)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](./LICENSE)
[![Downloads](https://static.pepy.tech/badge/cascadeflow)](https://pepy.tech/project/cascadeflow)
[![Tests](https://github.com/lemony-ai/cascadeflow/actions/workflows/test.yml/badge.svg)](https://github.com/lemony-ai/cascadeflow/actions/workflows/test.yml)
[![Python Docs](https://img.shields.io/badge/docs-Python-blue)](./docs/)
[![TypeScript Docs](https://img.shields.io/badge/docs-TypeScript-red)](./docs/)
[![X Follow](https://img.shields.io/twitter/follow/saschabuehrle?style=social)](https://x.com/saschabuehrle)
[![GitHub Stars](https://img.shields.io/github/stars/lemony-ai/cascadeflow?style=social)](https://github.com/lemony-ai/cascadeflow)

**[<img src=".github/assets/CF_python_color.svg" width="22" height="22" alt="Python" style="vertical-align: middle;"/> Python](#-python) â€ĸ [<img src=".github/assets/CF_ts_color.svg" width="22" height="22" alt="TypeScript" style="vertical-align: middle;"/> TypeScript](#-typescript) â€ĸ [<img src=".github/assets/CF_n8n_color.svg" width="22" height="22" alt="n8n" style="vertical-align: middle;"/> n8n](#-n8n-integration) â€ĸ [📖 Docs](./docs/) â€ĸ [💡 Examples](#examples)**

</div>

---

**Stop Bleeding Money on AI Calls. Cut Costs 30-65% in 3 Lines of Code.**

40-70% of text prompts and 20-60% of agent calls don't need expensive flagship models. You're overpaying every single day.

*cascadeflow fixes this with intelligent model cascading, available in Python and TypeScript.*

```python
pip install cascadeflow
```

```tsx
npm install @cascadeflow/core
```

---

## Why cascadeflow?

cascadeflow is an intelligent AI model cascading library that dynamically selects the optimal model for each query or tool call through speculative execution. It's based on the research that 40-70% of queries don't require slow, expensive flagship models, and domain-specific smaller models often outperform large general-purpose models on specialized tasks. For the remaining queries that need advanced reasoning, cascadeflow automatically escalates to flagship models if needed.

### Use Cases

Use cascadeflow for:

- **Cost Optimization.** Reduce API costs by 40-85% through intelligent model cascading and speculative execution with automatic per-query cost tracking.
- **Cost Control and Transparency.** Built-in telemetry for query, model, and provider-level cost tracking with configurable budget limits and programmable spending caps.
- **Low Latency & Speed Optimization**. Sub-2ms framework overhead with fast provider routing (Groq sub-50ms). Cascade simple queries to fast models while reserving expensive models for complex reasoning, achieving 2-10x latency reduction overall. (use preset `PRESET_ULTRA_FAST`)
- **Multi-Provider Flexibility.** Unified API across **`OpenAI`, `Anthropic`, `Groq`, `Ollama`, `vLLM`, `Together`, and `Hugging Face`** with automatic provider detection and zero vendor lock-in. Optional **`LiteLLM`** integration for 100+ additional providers.
- **Edge & Local-Hosted AI Deployment.** Use best of both worlds: handle most queries with local models (vLLM, Ollama), then automatically escalate complex queries to cloud providers only when needed.

> **â„šī¸ Note:** SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. [Research paper](https://www.researchgate.net/publication/392371267_Small_Language_Models_are_the_Future_of_Agentic_AI)

---

## How cascadeflow Works

cascadeflow uses **speculative execution with quality validation**:

1. **Speculatively executes** small, fast models first - optimistic execution ($0.15-0.30/1M tokens)
2. **Validates quality** of responses using configurable thresholds (completeness, confidence, correctness)
3. **Dynamically escalates** to larger models only when quality validation fails ($1.25-3.00/1M tokens)
4. **Learns patterns** to optimize future cascading decisions and domain specific routing

Zero configuration. Works with YOUR existing models (7 Providers currently supported).

In practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation

**Result:** 40-85% cost reduction, 2-10x faster responses, zero quality loss.

```
┌─────────────────────────────────────────────────────────────┐
│                      cascadeflow Stack                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Cascade Agent                                        │  │
│  │                                                       │  │
│  │  Orchestrates the entire cascade execution            │  │
│  │  â€ĸ Query routing & model selection                    │  │
│  │  â€ĸ Drafter -> Verifier coordination                   │  │
│  │  â€ĸ Cost tracking & telemetry                          │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Domain Pipeline                                      │  │
│  │                                                       │  │
│  │  Automatic domain classification                      │  │
│  │  â€ĸ Rule-based detection (CODE, MATH, DATA, etc.)      │  │
│  │  â€ĸ Optional ML semantic classification                │  │
│  │  â€ĸ Domain-optimized pipelines & model selection       │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Quality Validation Engine                            │  │
│  │                                                       │  │
│  │  Multi-dimensional quality checks                     │  │
│  │  â€ĸ Length validation (too short/verbose)              │  │
│  │  â€ĸ Confidence scoring (logprobs analysis)             │  │
│  │  â€ĸ Format validation (JSON, structured output)        │  │
│  │  â€ĸ Semantic alignment (intent matching)               │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Cascading Engine (<2ms overhead)                     │  │
│  │                                                       │  │
│  │  Smart model escalation strategy                      │  │
│  │  â€ĸ Try cheap models first (speculative execution)     │  │
│  │  â€ĸ Validate quality instantly                         │  │
│  │  â€ĸ Escalate only when needed                          │  │
│  │  â€ĸ Automatic retry & fallback                         │  │
│  └───────────────────────────────────────────────────────┘  │
│                          ↓                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Provider Abstraction Layer                           │  │
│  │                                                       │  │
│  │  Unified interface for 7+ providers                   │  │
│  │  â€ĸ OpenAI â€ĸ Anthropic â€ĸ Groq â€ĸ Ollama                 │  │
│  │  â€ĸ Together â€ĸ vLLM â€ĸ HuggingFace â€ĸ LiteLLM            │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

---

## Quick Start

### <img src=".github/assets/CF_python_color.svg" width="24" height="24" alt="Python"/> Python

```python
pip install cascadeflow[all]
```

```python
from cascadeflow import CascadeAgent, ModelConfig

# Define your cascade - try cheap model first, escalate if needed
agent = CascadeAgent(models=[
    ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),  # Draft model (~$0.375/1M tokens)
    ModelConfig(name="gpt-5", provider="openai", cost=0.00562),         # Verifier model (~$5.62/1M tokens)
])

# Run query - automatically routes to optimal model
result = await agent.run("What's the capital of France?")

print(f"Answer: {result.content}")
print(f"Model used: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")
```

<details>
<summary><b>💡 Optional: Use ML-based Semantic Quality Validation</b></summary>

For advanced use cases, you can add ML-based semantic similarity checking to validate that responses align with queries.

**Step 1:** Install the optional ML package:

```bash
pip install cascadeflow[ml]  # Adds semantic similarity via FastEmbed (~80MB model)
```

**Step 2:** Use semantic quality validation:

```python
from cascadeflow.quality.semantic import SemanticQualityChecker

# Initialize semantic checker (downloads model on first use)
checker = SemanticQualityChecker(
    similarity_threshold=0.5,  # Minimum similarity score (0-1)
    toxicity_threshold=0.7     # Maximum toxicity score (0-1)
)

# Validate query-response alignment
query = "Explain Python decorators"
response = "Decorators are a way to modify functions using @syntax..."

result = checker.validate(query, response, check_toxicity=True)

print(f"Similarity: {result.similarity:.2%}")
print(f"Passed: {result.passed}")
print(f"Toxic: {result.is_toxic}")
```

**What you get:**
- đŸŽ¯ Semantic similarity scoring (query ↔ response alignment)
- đŸ›Ąī¸ Optional toxicity detection
- 🔄 Automatic model download and caching
- 🚀 Fast inference (~100ms per check)

**Full example:** See [semantic_quality_domain_detection.py](./examples/semantic_quality_domain_detection.py)

</details>

> **âš ī¸ GPT-5 Note:** GPT-5 streaming requires organization verification. Non-streaming works for all users. [Verify here](https://platform.openai.com/settings/organization/general) if needed (~15 min). Basic cascadeflow examples work without - GPT-5 is only called when needed (typically 20-30% of requests).

📖 **Learn more:** [Python Documentation](./docs/README.md) | [Quickstart Guide](./docs/guides/quickstart.md) | [Providers Guide](./docs/guides/providers.md)

### <img src=".github/assets/CF_ts_color.svg" width="24" height="24" alt="TypeScript"/> TypeScript

```bash
npm install @cascadeflow/core
```

```tsx
import { CascadeAgent, ModelConfig } from '@cascadeflow/core';

// Same API as Python!
const agent = new CascadeAgent({
  models: [
    { name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
  ],
});

const result = await agent.run('What is TypeScript?');
console.log(`Model: ${result.modelUsed}`);
console.log(`Cost: $${result.totalCost}`);
console.log(`Saved: ${result.savingsPercentage}%`);
```

<details>
<summary><b>💡 Optional: ML-based Semantic Quality Validation</b></summary>

For advanced quality validation, enable ML-based semantic similarity checking to ensure responses align with queries.

**Step 1:** Install the optional ML packages:

```bash
npm install @cascadeflow/ml @xenova/transformers
```

**Step 2:** Enable semantic validation in your cascade:

```tsx
import { CascadeAgent, SemanticQualityChecker } from '@cascadeflow/core';

const agent = new CascadeAgent({
  models: [
    { name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },
    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },
  ],
  quality: {
    threshold: 0.40,                    // Traditional confidence threshold
    requireMinimumTokens: 5,            // Minimum response length
    useSemanticValidation: true,        // Enable ML validation
    semanticThreshold: 0.5,             // 50% minimum similarity
  },
});

// Responses now validated for semantic alignment
const result = await agent.run('Explain TypeScript generics');
```

**Step 3:** Or use semantic validation directly:

```tsx
import { SemanticQualityChecker } from '@cascadeflow/core';

const checker = new SemanticQualityChecker();

if (await checker.isAvailable()) {
  const result = await checker.checkSimilarity(
    'What is TypeScript?',
    'TypeScript is a typed superset of JavaScript.'
  );

  console.log(`Similarity: ${(result.similarity * 100).toFixed(1)}%`);
  console.log(`Passed: ${result.passed}`);
}
```

**What you get:**
- đŸŽ¯ Query-response semantic alignment detection
- đŸšĢ Off-topic response filtering
- đŸ“Ļ BGE-small-en-v1.5 embeddings (~40MB, auto-downloads)
- ⚡ Fast CPU inference (~50-100ms with caching)
- 🔄 Request-scoped caching (50% latency reduction)
- 🌐 Works in Node.js, Browser, and Edge Functions

**Example:** [semantic-quality.ts](./packages/core/examples/nodejs/semantic-quality.ts)

</details>

📖 **Learn more:** [TypeScript Documentation](./packages/core/) | [Quickstart Guide](./docs/guides/quickstart-typescript.md) | [Node.js Examples](./packages/core/examples/nodejs/) | [Browser/Edge Guide](./docs/guides/browser_cascading.md)

### 🔄 Migration Example

**Migrate in 5min from direct Provider implementation to cost savings and full cost control and transparency.**

#### Before (Standard Approach)

Cost: $0.000113, Latency: 850ms

```python
# Using expensive model for everything
result = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What's 2+2?"}]
)
```

#### After (With cascadeflow)

Cost: $0.000007, Latency: 234ms

```python
agent = CascadeAgent(models=[
    ModelConfig(name="gpt-4o-mini", provider="openai", cost=0.000375),
    ModelConfig(name="gpt-4o", provider="openai", cost=0.00625),
])

result = await agent.run("What's 2+2?")
```

> **đŸ”Ĩ Saved:** $0.000106 (94% reduction), 3.6x faster

📊 **Learn more:** [Cost Tracking Guide](./docs/guides/cost_tracking.md) | [Production Best Practices](./docs/guides/production.md) | [Performance Optimization](./docs/guides/performance.md)

---

## <img src=".github/assets/CF_n8n_color.svg" width="24" height="24" alt="n8n"/> n8n Integration

Use cascadeflow in n8n workflows for no-code AI automation with automatic cost optimization!

### Installation

1. Open n8n
2. Go to **Settings** → **Community Nodes**
3. Search for: `@cascadeflow/n8n-nodes-cascadeflow`
4. Click **Install**

### Quick Start

CascadeFlow is a **Language Model sub-node** that connects two AI Chat Model nodes (drafter + verifier) and intelligently cascades between them:

**Setup:**
1. Add two **AI Chat Model nodes** (cheap drafter + powerful verifier)
2. Add **CascadeFlow node** and connect both models
3. Connect CascadeFlow to **Basic LLM Chain** or **Chain** nodes
4. Check **Logs tab** to see cascade decisions in real-time!

**Result:** 40-85% cost savings in your n8n workflows!

**Features:**

- ✅ Works with any AI Chat Model node (OpenAI, Anthropic, Ollama, Azure, etc.)
- ✅ Mix providers (e.g., Ollama drafter + GPT-4o verifier)
- ✅ Real-time flow visualization in Logs tab
- ✅ Detailed metrics: confidence scores, latency, cost savings



🔌 **Learn more:** [n8n Integration Guide](./packages/integrations/n8n/) | [n8n Documentation](./docs/guides/n8n_integration.md)

---

## Resources

### Examples

**<img src=".github/assets/CF_python_color.svg" width="20" height="20" alt="Python" style="vertical-align: middle;"/> Python Examples:**

<details open>
<summary><b>Basic Examples</b> - Get started quickly</summary>

| Example | Description | Link |
|---------|-------------|------|
| **Basic Usage** | Simple cascade setup with OpenAI models | [View](./examples/basic_usage.py) |
| **Preset Usage** | Use built-in presets for quick setup | [View](./docs/guides/presets.md) |
| **Multi-Provider** | Mix multiple AI providers in one cascade | [View](./examples/multi_provider.py) |
| **Reasoning Models**  | Use reasoning models (o1/o3, Claude 3.7, DeepSeek-R1) | [View](./examples/reasoning_models.py) |
| **Tool Execution** | Function calling and tool usage | [View](./examples/tool_execution.py) |
| **Streaming Text** | Stream responses from cascade agents | [View](./examples/streaming_text.py) |
| **Cost Tracking** | Track and analyze costs across queries | [View](./examples/cost_tracking.py) |

</details>

<details>
<summary><b>Advanced Examples</b> - Production & customization</summary>

| Example | Description | Link |
|---------|-------------|------|
| **Production Patterns** | Best practices for production deployments | [View](./examples/production_patterns.py) |
| **FastAPI Integration** | Integrate cascades with FastAPI | [View](./examples/fastapi_integration.py) |
| **Streaming Tools** | Stream tool calls and responses | [View](./examples/streaming_tools.py) |
| **Batch Processing** | Process multiple queries efficiently | [View](./examples/batch_processing.py) |
| **Multi-Step Cascade** | Build complex multi-step cascades | [View](./examples/multi_step_cascade.py) |
| **Edge Device** | Run cascades on edge devices with local models | [View](./examples/edge_device.py) |
| **vLLM Example** | Use vLLM for local model deployment | [View](./examples/vllm_example.py) |
| **Multi-Instance Ollama** | Run draft/verifier on separate Ollama instances | [View](./examples/multi_instance_ollama.py) |
| **Multi-Instance vLLM** | Run draft/verifier on separate vLLM instances | [View](./examples/multi_instance_vllm.py) |
| **Custom Cascade** | Build custom cascade strategies | [View](./examples/custom_cascade.py) |
| **Custom Validation** | Implement custom quality validators | [View](./examples/custom_validation.py) |
| **User Budget Tracking** | Per-user budget enforcement and tracking | [View](./examples/user_budget_tracking.py) |
| **User Profile Usage** | User-specific routing and configurations | [View](./examples/user_profile_usage.py) |
| **Rate Limiting** | Implement rate limiting for cascades | [View](./examples/rate_limiting_usage.py) |
| **Guardrails** | Add safety and content guardrails | [View](./examples/guardrails_usage.py) |
| **Cost Forecasting** | Forecast costs and detect anomalies | [View](./examples/cost_forecasting_anomaly_detection.py) |
| **Semantic Quality Detection** | ML-based domain and quality detection | [View](./examples/semantic_quality_domain_detection.py) |
| **Profile Database Integration** | Integrate user profiles with databases | [View](./examples/profile_database_integration.py) |

</details>

**<img src=".github/assets/CF_ts_color.svg" width="20" height="20" alt="TypeScript" style="vertical-align: middle;"/> TypeScript Examples:**

<details open>
<summary><b>Basic Examples</b> - Get started quickly</summary>

| Example | Description | Link |
|---------|-------------|------|
| **Basic Usage** | Simple cascade setup (Node.js) | [View](./packages/core/examples/nodejs/basic-usage.ts) |
| **Tool Calling** | Function calling with tools (Node.js) | [View](./packages/core/examples/nodejs/tool-calling.ts) |
| **Multi-Provider** | Mix providers in TypeScript (Node.js) | [View](./packages/core/examples/nodejs/multi-provider.ts) |
| **Reasoning Models**  | Use reasoning models (o1/o3, Claude 3.7, DeepSeek-R1) | [View](./packages/core/examples/nodejs/reasoning-models.ts) |
| **Cost Tracking** | Track and analyze costs across queries | [View](./packages/core/examples/nodejs/cost-tracking.ts) |
| **Semantic Quality**  | ML-based semantic validation with embeddings | [View](./packages/core/examples/nodejs/semantic-quality.ts) |
| **Streaming** | Stream responses in TypeScript | [View](./packages/core/examples/streaming.ts) |

</details>

<details>
<summary><b>Advanced Examples</b> - Production & edge deployment</summary>

| Example | Description | Link |
|---------|-------------|------|
| **Production Patterns** | Production best practices (Node.js) | [View](./packages/core/examples/nodejs/production-patterns.ts) |
| **Multi-Instance Ollama** | Run draft/verifier on separate Ollama instances | [View](./packages/core/examples/nodejs/multi-instance-ollama.ts) |
| **Multi-Instance vLLM** | Run draft/verifier on separate vLLM instances | [View](./packages/core/examples/nodejs/multi-instance-vllm.ts) |
| **Browser/Edge** | Vercel Edge runtime example | [View](./packages/core/examples/browser/vercel-edge/) |

</details>

📂 **[View All Python Examples →](./examples/)** | **[View All TypeScript Examples →](./packages/core/examples/)**

### Documentation

<details open>
<summary><b>Getting Started</b> - Core concepts and basics</summary>

| Guide | Description | Link |
|-------|-------------|------|
| **Quickstart** | Get started with cascadeflow in 5 minutes | [Read](./docs/guides/quickstart.md) |
| **Providers Guide** | Configure and use different AI providers | [Read](./docs/guides/providers.md) |
| **Presets Guide** | Using and creating custom presets | [Read](./docs/guides/presets.md) |
| **Streaming Guide** | Stream responses from cascade agents | [Read](./docs/guides/streaming.md) |
| **Tools Guide** | Function calling and tool usage | [Read](./docs/guides/tools.md) |
| **Cost Tracking** | Track and analyze API costs | [Read](./docs/guides/cost_tracking.md) |

</details>

<details>
<summary><b>Advanced Topics</b> - Production, customization & integrations</summary>

| Guide | Description | Link |
|-------|-------------|------|
| **Production Guide** | Best practices for production deployments | [Read](./docs/guides/production.md) |
| **Performance Guide** | Optimize cascade performance and latency | [Read](./docs/guides/performance.md) |
| **Custom Cascade** | Build custom cascade strategies | [Read](./docs/guides/custom_cascade.md) |
| **Custom Validation** | Implement custom quality validators | [Read](./docs/guides/custom_validation.md) |
| **Edge Device** | Deploy cascades on edge devices | [Read](./docs/guides/edge_device.md) |
| **Browser Cascading** | Run cascades in the browser/edge | [Read](./docs/guides/browser_cascading.md) |
| **FastAPI Integration** | Integrate with FastAPI applications | [Read](./docs/guides/fastapi.md) |
| **n8n Integration** | Use cascadeflow in n8n workflows | [Read](./docs/guides/n8n_integration.md) |

</details>

📚 **[View All Documentation →](./docs/)**

---

## Features

| **Feature** | **Benefit**                                                                                                                            |
| --- |----------------------------------------------------------------------------------------------------------------------------------------|
| đŸŽ¯ **Speculative Cascading** | Tries cheap models first, escalates intelligently                                                                                      |
| 💰 **40-85% Cost Savings** | Research-backed, proven in production                                                                                                  |
| ⚡ **2-10x Faster** | Small models respond in <50ms vs 500-2000ms                                                                                            |
| ⚡ **Low Latency**  | Sub-2ms framework overhead, negligible performance impact                                                                              |
| 🔄 **Mix Any Providers**  | OpenAI, Anthropic, Groq, Ollama, vLLM, Together + LiteLLM (optional)                                                                   |
| 👤 **User Profile System**  | Per-user budgets, tier-aware routing, enforcement callbacks                                                                            |
| ✅ **Quality Validation**  | Automatic checks + semantic similarity (optional ML, ~80MB, CPU)                                                                       |
| 🎨 **Cascading Policies**  | Domain-specific pipelines, multi-step validation strategies                                                                            |
| 🧠 **Domain Understanding**  | Auto-detects code/medical/legal/math/structured data, routes to specialists                                                            |
| 🤖 **Drafter/Validator Pattern** | 20-60% savings for agent/tool systems                                                                                                  |
| 🔧 **Tool Calling Support**  | Universal format, works across all providers                                                                                           |
| 📊 **Cost Tracking**  | Built-in analytics + OpenTelemetry export (vendor-neutral)                                                                             |
| 🚀 **3-Line Integration** | Zero architecture changes needed                                                                                                       |
| 🏭 **Production Ready**  | Streaming, batch processing, tool handling, reasoning model support, caching, error recovery, anomaly detection |

---

## License

MIT Š  see [LICENSE](https://github.com/lemony-ai/cascadeflow/blob/main/LICENSE) file.

Free for commercial use. Attribution appreciated but not required.

---

## Contributing

We â¤ī¸ contributions!

📝 [**Contributing Guide**](./CONTRIBUTING.md) - Python & TypeScript development setup

---

## Roadmap

- **Cascade Profiler** - Analyzes your AI API logs to calculate cost savings potential and generate optimized cascadeflow configurations automatically
- **User Tier Management** - Cost controls and limits per user tier with advanced routing
- **Semantic Quality Validators** - Optional lightweight local quality scoring (200MB CPU model, no external API calls)
- **Code Complexity Detection** - Dynamic cascading based on task complexity analysis
- **Domain Aware Cascading** - Multi-stage pipelines tailored to specific domains
- **Benchmark Reports** - Automated performance and cost benchmarking

---

## Support

- 📖 [**GitHub Discussions**](https://github.com/lemony-ai/cascadeflow/discussions) - Searchable Q&A
- 🐛 [**GitHub Issues**](https://github.com/lemony-ai/cascadeflow/issues) - Bug reports & feature requests
- 📧 [**Email Support**](mailto:hello@lemony.ai) - Direct support

---

## Citation

If you use cascadeflow in your research or project, please cite:

```bibtex
@software{cascadeflow2025,
  author = {Lemony Inc., Sascha Buehrle and Contributors},
  title = {cascadeflow: Smart AI model cascading for cost optimization},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/lemony-ai/cascadeflow}
}
```

**Ready to cut your AI costs by 40-85%?**

```bash
pip install cascadeflow
```

```bash
npm install @cascadeflow/core
```

[Read the Docs](./docs/) â€ĸ [View Python Examples](./examples/) â€ĸ [View TypeScript Examples](./packages/core/examples/) â€ĸ [Join Discussions](https://github.com/lemony-ai/cascadeflow/discussions)

---

## About

**Built with â¤ī¸ by [Lemony Inc.](https://lemony.ai/) and the cascadeflow Community**

One cascade. Hundreds of specialists.

New York | Zurich

**⭐ Star us on GitHub if cascadeflow helps you save money!**

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cascadeflow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "\"Lemony Inc.\" <hello@lemony.ai>",
    "keywords": "ai, llm, cost-optimization, model-routing, cascade, inference, openai, anthropic, gpt, claude, machine-learning, groq, typescript, javascript, browser, edge-functions",
    "author": null,
    "author_email": "\"Lemony Inc.\" <hello@lemony.ai>",
    "download_url": "https://files.pythonhosted.org/packages/b6/85/1a89e5cfdd6ec94ea8d228760aa31d1e96851947408f7e24684c7cc28f17/cascadeflow-0.5.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n<picture>\n  <source media=\"(prefers-color-scheme: dark)\" srcset=\".github/assets/CF_logo_bright.svg\">\n  <source media=\"(prefers-color-scheme: light)\" srcset=\".github/assets/CF_logo_dark.svg\">\n  <img alt=\"cascadeflow Logo\" src=\".github/assets/CF_logo_dark.svg\" width=\"533\">\n</picture>\n\n# Smart AI model cascading for cost optimization\n\n[![PyPI version](https://img.shields.io/pypi/v/cascadeflow?color=blue&label=Python)](https://pypi.org/project/cascadeflow/)\n[![npm version](https://img.shields.io/npm/v/@cascadeflow/core?color=red&label=TypeScript)](https://www.npmjs.com/package/@cascadeflow/core)\n[![n8n version](https://img.shields.io/npm/v/@cascadeflow/n8n-nodes-cascadeflow?color=orange&label=n8n)](https://www.npmjs.com/package/@cascadeflow/n8n-nodes-cascadeflow)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](./LICENSE)\n[![Downloads](https://static.pepy.tech/badge/cascadeflow)](https://pepy.tech/project/cascadeflow)\n[![Tests](https://github.com/lemony-ai/cascadeflow/actions/workflows/test.yml/badge.svg)](https://github.com/lemony-ai/cascadeflow/actions/workflows/test.yml)\n[![Python Docs](https://img.shields.io/badge/docs-Python-blue)](./docs/)\n[![TypeScript Docs](https://img.shields.io/badge/docs-TypeScript-red)](./docs/)\n[![X Follow](https://img.shields.io/twitter/follow/saschabuehrle?style=social)](https://x.com/saschabuehrle)\n[![GitHub Stars](https://img.shields.io/github/stars/lemony-ai/cascadeflow?style=social)](https://github.com/lemony-ai/cascadeflow)\n\n**[<img src=\".github/assets/CF_python_color.svg\" width=\"22\" height=\"22\" alt=\"Python\" style=\"vertical-align: middle;\"/> Python](#-python) \u2022 [<img src=\".github/assets/CF_ts_color.svg\" width=\"22\" height=\"22\" alt=\"TypeScript\" style=\"vertical-align: middle;\"/> TypeScript](#-typescript) \u2022 [<img src=\".github/assets/CF_n8n_color.svg\" width=\"22\" height=\"22\" alt=\"n8n\" style=\"vertical-align: middle;\"/> n8n](#-n8n-integration) \u2022 [\ud83d\udcd6 Docs](./docs/) \u2022 [\ud83d\udca1 Examples](#examples)**\n\n</div>\n\n---\n\n**Stop Bleeding Money on AI Calls. Cut Costs 30-65% in 3 Lines of Code.**\n\n40-70% of text prompts and 20-60% of agent calls don't need expensive flagship models. You're overpaying every single day.\n\n*cascadeflow fixes this with intelligent model cascading, available in Python and TypeScript.*\n\n```python\npip install cascadeflow\n```\n\n```tsx\nnpm install @cascadeflow/core\n```\n\n---\n\n## Why cascadeflow?\n\ncascadeflow is an intelligent AI model cascading library that dynamically selects the optimal model for each query or tool call through speculative execution. It's based on the research that 40-70% of queries don't require slow, expensive flagship models, and domain-specific smaller models often outperform large general-purpose models on specialized tasks. For the remaining queries that need advanced reasoning, cascadeflow automatically escalates to flagship models if needed.\n\n### Use Cases\n\nUse cascadeflow for:\n\n- **Cost Optimization.** Reduce API costs by 40-85% through intelligent model cascading and speculative execution with automatic per-query cost tracking.\n- **Cost Control and Transparency.** Built-in telemetry for query, model, and provider-level cost tracking with configurable budget limits and programmable spending caps.\n- **Low Latency & Speed Optimization**. Sub-2ms framework overhead with fast provider routing (Groq sub-50ms). Cascade simple queries to fast models while reserving expensive models for complex reasoning, achieving 2-10x latency reduction overall. (use preset `PRESET_ULTRA_FAST`)\n- **Multi-Provider Flexibility.** Unified API across **`OpenAI`, `Anthropic`, `Groq`, `Ollama`, `vLLM`, `Together`, and `Hugging Face`** with automatic provider detection and zero vendor lock-in. Optional **`LiteLLM`** integration for 100+ additional providers.\n- **Edge & Local-Hosted AI Deployment.** Use best of both worlds: handle most queries with local models (vLLM, Ollama), then automatically escalate complex queries to cloud providers only when needed.\n\n> **\u2139\ufe0f Note:** SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. [Research paper](https://www.researchgate.net/publication/392371267_Small_Language_Models_are_the_Future_of_Agentic_AI)\n\n---\n\n## How cascadeflow Works\n\ncascadeflow uses **speculative execution with quality validation**:\n\n1. **Speculatively executes** small, fast models first - optimistic execution ($0.15-0.30/1M tokens)\n2. **Validates quality** of responses using configurable thresholds (completeness, confidence, correctness)\n3. **Dynamically escalates** to larger models only when quality validation fails ($1.25-3.00/1M tokens)\n4. **Learns patterns** to optimize future cascading decisions and domain specific routing\n\nZero configuration. Works with YOUR existing models (7 Providers currently supported).\n\nIn practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation\n\n**Result:** 40-85% cost reduction, 2-10x faster responses, zero quality loss.\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                      cascadeflow Stack                      \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502                                                             \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u2502\n\u2502  \u2502  Cascade Agent                                        \u2502  \u2502\n\u2502  \u2502                                                       \u2502  \u2502\n\u2502  \u2502  Orchestrates the entire cascade execution            \u2502  \u2502\n\u2502  \u2502  \u2022 Query routing & model selection                    \u2502  \u2502\n\u2502  \u2502  \u2022 Drafter -> Verifier coordination                   \u2502  \u2502\n\u2502  \u2502  \u2022 Cost tracking & telemetry                          \u2502  \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2502\n\u2502                          \u2193                                  \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u2502\n\u2502  \u2502  Domain Pipeline                                      \u2502  \u2502\n\u2502  \u2502                                                       \u2502  \u2502\n\u2502  \u2502  Automatic domain classification                      \u2502  \u2502\n\u2502  \u2502  \u2022 Rule-based detection (CODE, MATH, DATA, etc.)      \u2502  \u2502\n\u2502  \u2502  \u2022 Optional ML semantic classification                \u2502  \u2502\n\u2502  \u2502  \u2022 Domain-optimized pipelines & model selection       \u2502  \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2502\n\u2502                          \u2193                                  \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u2502\n\u2502  \u2502  Quality Validation Engine                            \u2502  \u2502\n\u2502  \u2502                                                       \u2502  \u2502\n\u2502  \u2502  Multi-dimensional quality checks                     \u2502  \u2502\n\u2502  \u2502  \u2022 Length validation (too short/verbose)              \u2502  \u2502\n\u2502  \u2502  \u2022 Confidence scoring (logprobs analysis)             \u2502  \u2502\n\u2502  \u2502  \u2022 Format validation (JSON, structured output)        \u2502  \u2502\n\u2502  \u2502  \u2022 Semantic alignment (intent matching)               \u2502  \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2502\n\u2502                          \u2193                                  \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u2502\n\u2502  \u2502  Cascading Engine (<2ms overhead)                     \u2502  \u2502\n\u2502  \u2502                                                       \u2502  \u2502\n\u2502  \u2502  Smart model escalation strategy                      \u2502  \u2502\n\u2502  \u2502  \u2022 Try cheap models first (speculative execution)     \u2502  \u2502\n\u2502  \u2502  \u2022 Validate quality instantly                         \u2502  \u2502\n\u2502  \u2502  \u2022 Escalate only when needed                          \u2502  \u2502\n\u2502  \u2502  \u2022 Automatic retry & fallback                         \u2502  \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2502\n\u2502                          \u2193                                  \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510  \u2502\n\u2502  \u2502  Provider Abstraction Layer                           \u2502  \u2502\n\u2502  \u2502                                                       \u2502  \u2502\n\u2502  \u2502  Unified interface for 7+ providers                   \u2502  \u2502\n\u2502  \u2502  \u2022 OpenAI \u2022 Anthropic \u2022 Groq \u2022 Ollama                 \u2502  \u2502\n\u2502  \u2502  \u2022 Together \u2022 vLLM \u2022 HuggingFace \u2022 LiteLLM            \u2502  \u2502\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518  \u2502\n\u2502                                                             \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n---\n\n## Quick Start\n\n### <img src=\".github/assets/CF_python_color.svg\" width=\"24\" height=\"24\" alt=\"Python\"/> Python\n\n```python\npip install cascadeflow[all]\n```\n\n```python\nfrom cascadeflow import CascadeAgent, ModelConfig\n\n# Define your cascade - try cheap model first, escalate if needed\nagent = CascadeAgent(models=[\n    ModelConfig(name=\"gpt-4o-mini\", provider=\"openai\", cost=0.000375),  # Draft model (~$0.375/1M tokens)\n    ModelConfig(name=\"gpt-5\", provider=\"openai\", cost=0.00562),         # Verifier model (~$5.62/1M tokens)\n])\n\n# Run query - automatically routes to optimal model\nresult = await agent.run(\"What's the capital of France?\")\n\nprint(f\"Answer: {result.content}\")\nprint(f\"Model used: {result.model_used}\")\nprint(f\"Cost: ${result.total_cost:.6f}\")\n```\n\n<details>\n<summary><b>\ud83d\udca1 Optional: Use ML-based Semantic Quality Validation</b></summary>\n\nFor advanced use cases, you can add ML-based semantic similarity checking to validate that responses align with queries.\n\n**Step 1:** Install the optional ML package:\n\n```bash\npip install cascadeflow[ml]  # Adds semantic similarity via FastEmbed (~80MB model)\n```\n\n**Step 2:** Use semantic quality validation:\n\n```python\nfrom cascadeflow.quality.semantic import SemanticQualityChecker\n\n# Initialize semantic checker (downloads model on first use)\nchecker = SemanticQualityChecker(\n    similarity_threshold=0.5,  # Minimum similarity score (0-1)\n    toxicity_threshold=0.7     # Maximum toxicity score (0-1)\n)\n\n# Validate query-response alignment\nquery = \"Explain Python decorators\"\nresponse = \"Decorators are a way to modify functions using @syntax...\"\n\nresult = checker.validate(query, response, check_toxicity=True)\n\nprint(f\"Similarity: {result.similarity:.2%}\")\nprint(f\"Passed: {result.passed}\")\nprint(f\"Toxic: {result.is_toxic}\")\n```\n\n**What you get:**\n- \ud83c\udfaf Semantic similarity scoring (query \u2194 response alignment)\n- \ud83d\udee1\ufe0f Optional toxicity detection\n- \ud83d\udd04 Automatic model download and caching\n- \ud83d\ude80 Fast inference (~100ms per check)\n\n**Full example:** See [semantic_quality_domain_detection.py](./examples/semantic_quality_domain_detection.py)\n\n</details>\n\n> **\u26a0\ufe0f GPT-5 Note:** GPT-5 streaming requires organization verification. Non-streaming works for all users. [Verify here](https://platform.openai.com/settings/organization/general) if needed (~15 min). Basic cascadeflow examples work without - GPT-5 is only called when needed (typically 20-30% of requests).\n\n\ud83d\udcd6 **Learn more:** [Python Documentation](./docs/README.md) | [Quickstart Guide](./docs/guides/quickstart.md) | [Providers Guide](./docs/guides/providers.md)\n\n### <img src=\".github/assets/CF_ts_color.svg\" width=\"24\" height=\"24\" alt=\"TypeScript\"/> TypeScript\n\n```bash\nnpm install @cascadeflow/core\n```\n\n```tsx\nimport { CascadeAgent, ModelConfig } from '@cascadeflow/core';\n\n// Same API as Python!\nconst agent = new CascadeAgent({\n  models: [\n    { name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },\n    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },\n  ],\n});\n\nconst result = await agent.run('What is TypeScript?');\nconsole.log(`Model: ${result.modelUsed}`);\nconsole.log(`Cost: $${result.totalCost}`);\nconsole.log(`Saved: ${result.savingsPercentage}%`);\n```\n\n<details>\n<summary><b>\ud83d\udca1 Optional: ML-based Semantic Quality Validation</b></summary>\n\nFor advanced quality validation, enable ML-based semantic similarity checking to ensure responses align with queries.\n\n**Step 1:** Install the optional ML packages:\n\n```bash\nnpm install @cascadeflow/ml @xenova/transformers\n```\n\n**Step 2:** Enable semantic validation in your cascade:\n\n```tsx\nimport { CascadeAgent, SemanticQualityChecker } from '@cascadeflow/core';\n\nconst agent = new CascadeAgent({\n  models: [\n    { name: 'gpt-4o-mini', provider: 'openai', cost: 0.000375 },\n    { name: 'gpt-4o', provider: 'openai', cost: 0.00625 },\n  ],\n  quality: {\n    threshold: 0.40,                    // Traditional confidence threshold\n    requireMinimumTokens: 5,            // Minimum response length\n    useSemanticValidation: true,        // Enable ML validation\n    semanticThreshold: 0.5,             // 50% minimum similarity\n  },\n});\n\n// Responses now validated for semantic alignment\nconst result = await agent.run('Explain TypeScript generics');\n```\n\n**Step 3:** Or use semantic validation directly:\n\n```tsx\nimport { SemanticQualityChecker } from '@cascadeflow/core';\n\nconst checker = new SemanticQualityChecker();\n\nif (await checker.isAvailable()) {\n  const result = await checker.checkSimilarity(\n    'What is TypeScript?',\n    'TypeScript is a typed superset of JavaScript.'\n  );\n\n  console.log(`Similarity: ${(result.similarity * 100).toFixed(1)}%`);\n  console.log(`Passed: ${result.passed}`);\n}\n```\n\n**What you get:**\n- \ud83c\udfaf Query-response semantic alignment detection\n- \ud83d\udeab Off-topic response filtering\n- \ud83d\udce6 BGE-small-en-v1.5 embeddings (~40MB, auto-downloads)\n- \u26a1 Fast CPU inference (~50-100ms with caching)\n- \ud83d\udd04 Request-scoped caching (50% latency reduction)\n- \ud83c\udf10 Works in Node.js, Browser, and Edge Functions\n\n**Example:** [semantic-quality.ts](./packages/core/examples/nodejs/semantic-quality.ts)\n\n</details>\n\n\ud83d\udcd6 **Learn more:** [TypeScript Documentation](./packages/core/) | [Quickstart Guide](./docs/guides/quickstart-typescript.md) | [Node.js Examples](./packages/core/examples/nodejs/) | [Browser/Edge Guide](./docs/guides/browser_cascading.md)\n\n### \ud83d\udd04 Migration Example\n\n**Migrate in 5min from direct Provider implementation to cost savings and full cost control and transparency.**\n\n#### Before (Standard Approach)\n\nCost: $0.000113, Latency: 850ms\n\n```python\n# Using expensive model for everything\nresult = openai.chat.completions.create(\n    model=\"gpt-4o\",\n    messages=[{\"role\": \"user\", \"content\": \"What's 2+2?\"}]\n)\n```\n\n#### After (With cascadeflow)\n\nCost: $0.000007, Latency: 234ms\n\n```python\nagent = CascadeAgent(models=[\n    ModelConfig(name=\"gpt-4o-mini\", provider=\"openai\", cost=0.000375),\n    ModelConfig(name=\"gpt-4o\", provider=\"openai\", cost=0.00625),\n])\n\nresult = await agent.run(\"What's 2+2?\")\n```\n\n> **\ud83d\udd25 Saved:** $0.000106 (94% reduction), 3.6x faster\n\n\ud83d\udcca **Learn more:** [Cost Tracking Guide](./docs/guides/cost_tracking.md) | [Production Best Practices](./docs/guides/production.md) | [Performance Optimization](./docs/guides/performance.md)\n\n---\n\n## <img src=\".github/assets/CF_n8n_color.svg\" width=\"24\" height=\"24\" alt=\"n8n\"/> n8n Integration\n\nUse cascadeflow in n8n workflows for no-code AI automation with automatic cost optimization!\n\n### Installation\n\n1. Open n8n\n2. Go to **Settings** \u2192 **Community Nodes**\n3. Search for: `@cascadeflow/n8n-nodes-cascadeflow`\n4. Click **Install**\n\n### Quick Start\n\nCascadeFlow is a **Language Model sub-node** that connects two AI Chat Model nodes (drafter + verifier) and intelligently cascades between them:\n\n**Setup:**\n1. Add two **AI Chat Model nodes** (cheap drafter + powerful verifier)\n2. Add **CascadeFlow node** and connect both models\n3. Connect CascadeFlow to **Basic LLM Chain** or **Chain** nodes\n4. Check **Logs tab** to see cascade decisions in real-time!\n\n**Result:** 40-85% cost savings in your n8n workflows!\n\n**Features:**\n\n- \u2705 Works with any AI Chat Model node (OpenAI, Anthropic, Ollama, Azure, etc.)\n- \u2705 Mix providers (e.g., Ollama drafter + GPT-4o verifier)\n- \u2705 Real-time flow visualization in Logs tab\n- \u2705 Detailed metrics: confidence scores, latency, cost savings\n\n\n\n\ud83d\udd0c **Learn more:** [n8n Integration Guide](./packages/integrations/n8n/) | [n8n Documentation](./docs/guides/n8n_integration.md)\n\n---\n\n## Resources\n\n### Examples\n\n**<img src=\".github/assets/CF_python_color.svg\" width=\"20\" height=\"20\" alt=\"Python\" style=\"vertical-align: middle;\"/> Python Examples:**\n\n<details open>\n<summary><b>Basic Examples</b> - Get started quickly</summary>\n\n| Example | Description | Link |\n|---------|-------------|------|\n| **Basic Usage** | Simple cascade setup with OpenAI models | [View](./examples/basic_usage.py) |\n| **Preset Usage** | Use built-in presets for quick setup | [View](./docs/guides/presets.md) |\n| **Multi-Provider** | Mix multiple AI providers in one cascade | [View](./examples/multi_provider.py) |\n| **Reasoning Models**  | Use reasoning models (o1/o3, Claude 3.7, DeepSeek-R1) | [View](./examples/reasoning_models.py) |\n| **Tool Execution** | Function calling and tool usage | [View](./examples/tool_execution.py) |\n| **Streaming Text** | Stream responses from cascade agents | [View](./examples/streaming_text.py) |\n| **Cost Tracking** | Track and analyze costs across queries | [View](./examples/cost_tracking.py) |\n\n</details>\n\n<details>\n<summary><b>Advanced Examples</b> - Production & customization</summary>\n\n| Example | Description | Link |\n|---------|-------------|------|\n| **Production Patterns** | Best practices for production deployments | [View](./examples/production_patterns.py) |\n| **FastAPI Integration** | Integrate cascades with FastAPI | [View](./examples/fastapi_integration.py) |\n| **Streaming Tools** | Stream tool calls and responses | [View](./examples/streaming_tools.py) |\n| **Batch Processing** | Process multiple queries efficiently | [View](./examples/batch_processing.py) |\n| **Multi-Step Cascade** | Build complex multi-step cascades | [View](./examples/multi_step_cascade.py) |\n| **Edge Device** | Run cascades on edge devices with local models | [View](./examples/edge_device.py) |\n| **vLLM Example** | Use vLLM for local model deployment | [View](./examples/vllm_example.py) |\n| **Multi-Instance Ollama** | Run draft/verifier on separate Ollama instances | [View](./examples/multi_instance_ollama.py) |\n| **Multi-Instance vLLM** | Run draft/verifier on separate vLLM instances | [View](./examples/multi_instance_vllm.py) |\n| **Custom Cascade** | Build custom cascade strategies | [View](./examples/custom_cascade.py) |\n| **Custom Validation** | Implement custom quality validators | [View](./examples/custom_validation.py) |\n| **User Budget Tracking** | Per-user budget enforcement and tracking | [View](./examples/user_budget_tracking.py) |\n| **User Profile Usage** | User-specific routing and configurations | [View](./examples/user_profile_usage.py) |\n| **Rate Limiting** | Implement rate limiting for cascades | [View](./examples/rate_limiting_usage.py) |\n| **Guardrails** | Add safety and content guardrails | [View](./examples/guardrails_usage.py) |\n| **Cost Forecasting** | Forecast costs and detect anomalies | [View](./examples/cost_forecasting_anomaly_detection.py) |\n| **Semantic Quality Detection** | ML-based domain and quality detection | [View](./examples/semantic_quality_domain_detection.py) |\n| **Profile Database Integration** | Integrate user profiles with databases | [View](./examples/profile_database_integration.py) |\n\n</details>\n\n**<img src=\".github/assets/CF_ts_color.svg\" width=\"20\" height=\"20\" alt=\"TypeScript\" style=\"vertical-align: middle;\"/> TypeScript Examples:**\n\n<details open>\n<summary><b>Basic Examples</b> - Get started quickly</summary>\n\n| Example | Description | Link |\n|---------|-------------|------|\n| **Basic Usage** | Simple cascade setup (Node.js) | [View](./packages/core/examples/nodejs/basic-usage.ts) |\n| **Tool Calling** | Function calling with tools (Node.js) | [View](./packages/core/examples/nodejs/tool-calling.ts) |\n| **Multi-Provider** | Mix providers in TypeScript (Node.js) | [View](./packages/core/examples/nodejs/multi-provider.ts) |\n| **Reasoning Models**  | Use reasoning models (o1/o3, Claude 3.7, DeepSeek-R1) | [View](./packages/core/examples/nodejs/reasoning-models.ts) |\n| **Cost Tracking** | Track and analyze costs across queries | [View](./packages/core/examples/nodejs/cost-tracking.ts) |\n| **Semantic Quality**  | ML-based semantic validation with embeddings | [View](./packages/core/examples/nodejs/semantic-quality.ts) |\n| **Streaming** | Stream responses in TypeScript | [View](./packages/core/examples/streaming.ts) |\n\n</details>\n\n<details>\n<summary><b>Advanced Examples</b> - Production & edge deployment</summary>\n\n| Example | Description | Link |\n|---------|-------------|------|\n| **Production Patterns** | Production best practices (Node.js) | [View](./packages/core/examples/nodejs/production-patterns.ts) |\n| **Multi-Instance Ollama** | Run draft/verifier on separate Ollama instances | [View](./packages/core/examples/nodejs/multi-instance-ollama.ts) |\n| **Multi-Instance vLLM** | Run draft/verifier on separate vLLM instances | [View](./packages/core/examples/nodejs/multi-instance-vllm.ts) |\n| **Browser/Edge** | Vercel Edge runtime example | [View](./packages/core/examples/browser/vercel-edge/) |\n\n</details>\n\n\ud83d\udcc2 **[View All Python Examples \u2192](./examples/)** | **[View All TypeScript Examples \u2192](./packages/core/examples/)**\n\n### Documentation\n\n<details open>\n<summary><b>Getting Started</b> - Core concepts and basics</summary>\n\n| Guide | Description | Link |\n|-------|-------------|------|\n| **Quickstart** | Get started with cascadeflow in 5 minutes | [Read](./docs/guides/quickstart.md) |\n| **Providers Guide** | Configure and use different AI providers | [Read](./docs/guides/providers.md) |\n| **Presets Guide** | Using and creating custom presets | [Read](./docs/guides/presets.md) |\n| **Streaming Guide** | Stream responses from cascade agents | [Read](./docs/guides/streaming.md) |\n| **Tools Guide** | Function calling and tool usage | [Read](./docs/guides/tools.md) |\n| **Cost Tracking** | Track and analyze API costs | [Read](./docs/guides/cost_tracking.md) |\n\n</details>\n\n<details>\n<summary><b>Advanced Topics</b> - Production, customization & integrations</summary>\n\n| Guide | Description | Link |\n|-------|-------------|------|\n| **Production Guide** | Best practices for production deployments | [Read](./docs/guides/production.md) |\n| **Performance Guide** | Optimize cascade performance and latency | [Read](./docs/guides/performance.md) |\n| **Custom Cascade** | Build custom cascade strategies | [Read](./docs/guides/custom_cascade.md) |\n| **Custom Validation** | Implement custom quality validators | [Read](./docs/guides/custom_validation.md) |\n| **Edge Device** | Deploy cascades on edge devices | [Read](./docs/guides/edge_device.md) |\n| **Browser Cascading** | Run cascades in the browser/edge | [Read](./docs/guides/browser_cascading.md) |\n| **FastAPI Integration** | Integrate with FastAPI applications | [Read](./docs/guides/fastapi.md) |\n| **n8n Integration** | Use cascadeflow in n8n workflows | [Read](./docs/guides/n8n_integration.md) |\n\n</details>\n\n\ud83d\udcda **[View All Documentation \u2192](./docs/)**\n\n---\n\n## Features\n\n| **Feature** | **Benefit**                                                                                                                            |\n| --- |----------------------------------------------------------------------------------------------------------------------------------------|\n| \ud83c\udfaf **Speculative Cascading** | Tries cheap models first, escalates intelligently                                                                                      |\n| \ud83d\udcb0 **40-85% Cost Savings** | Research-backed, proven in production                                                                                                  |\n| \u26a1 **2-10x Faster** | Small models respond in <50ms vs 500-2000ms                                                                                            |\n| \u26a1 **Low Latency**  | Sub-2ms framework overhead, negligible performance impact                                                                              |\n| \ud83d\udd04 **Mix Any Providers**  | OpenAI, Anthropic, Groq, Ollama, vLLM, Together + LiteLLM (optional)                                                                   |\n| \ud83d\udc64 **User Profile System**  | Per-user budgets, tier-aware routing, enforcement callbacks                                                                            |\n| \u2705 **Quality Validation**  | Automatic checks + semantic similarity (optional ML, ~80MB, CPU)                                                                       |\n| \ud83c\udfa8 **Cascading Policies**  | Domain-specific pipelines, multi-step validation strategies                                                                            |\n| \ud83e\udde0 **Domain Understanding**  | Auto-detects code/medical/legal/math/structured data, routes to specialists                                                            |\n| \ud83e\udd16 **Drafter/Validator Pattern** | 20-60% savings for agent/tool systems                                                                                                  |\n| \ud83d\udd27 **Tool Calling Support**  | Universal format, works across all providers                                                                                           |\n| \ud83d\udcca **Cost Tracking**  | Built-in analytics + OpenTelemetry export (vendor-neutral)                                                                             |\n| \ud83d\ude80 **3-Line Integration** | Zero architecture changes needed                                                                                                       |\n| \ud83c\udfed **Production Ready**  | Streaming, batch processing, tool handling, reasoning model support, caching, error recovery, anomaly detection |\n\n---\n\n## License\n\nMIT \u00a9  see [LICENSE](https://github.com/lemony-ai/cascadeflow/blob/main/LICENSE) file.\n\nFree for commercial use. Attribution appreciated but not required.\n\n---\n\n## Contributing\n\nWe \u2764\ufe0f contributions!\n\n\ud83d\udcdd [**Contributing Guide**](./CONTRIBUTING.md) - Python & TypeScript development setup\n\n---\n\n## Roadmap\n\n- **Cascade Profiler** - Analyzes your AI API logs to calculate cost savings potential and generate optimized cascadeflow configurations automatically\n- **User Tier Management** - Cost controls and limits per user tier with advanced routing\n- **Semantic Quality Validators** - Optional lightweight local quality scoring (200MB CPU model, no external API calls)\n- **Code Complexity Detection** - Dynamic cascading based on task complexity analysis\n- **Domain Aware Cascading** - Multi-stage pipelines tailored to specific domains\n- **Benchmark Reports** - Automated performance and cost benchmarking\n\n---\n\n## Support\n\n- \ud83d\udcd6 [**GitHub Discussions**](https://github.com/lemony-ai/cascadeflow/discussions) - Searchable Q&A\n- \ud83d\udc1b [**GitHub Issues**](https://github.com/lemony-ai/cascadeflow/issues) - Bug reports & feature requests\n- \ud83d\udce7 [**Email Support**](mailto:hello@lemony.ai) - Direct support\n\n---\n\n## Citation\n\nIf you use cascadeflow in your research or project, please cite:\n\n```bibtex\n@software{cascadeflow2025,\n  author = {Lemony Inc., Sascha Buehrle and Contributors},\n  title = {cascadeflow: Smart AI model cascading for cost optimization},\n  year = {2025},\n  publisher = {GitHub},\n  url = {https://github.com/lemony-ai/cascadeflow}\n}\n```\n\n**Ready to cut your AI costs by 40-85%?**\n\n```bash\npip install cascadeflow\n```\n\n```bash\nnpm install @cascadeflow/core\n```\n\n[Read the Docs](./docs/) \u2022 [View Python Examples](./examples/) \u2022 [View TypeScript Examples](./packages/core/examples/) \u2022 [Join Discussions](https://github.com/lemony-ai/cascadeflow/discussions)\n\n---\n\n## About\n\n**Built with \u2764\ufe0f by [Lemony Inc.](https://lemony.ai/) and the cascadeflow Community**\n\nOne cascade. Hundreds of specialists.\n\nNew York | Zurich\n\n**\u2b50 Star us on GitHub if cascadeflow helps you save money!**\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Smart AI model cascading for cost optimization - Save 40-85% on LLM costs with 2-6x faster responses. Available for Python and TypeScript/JavaScript.",
    "version": "0.5.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/lemony-ai/cascadeflow/issues",
        "Changelog": "https://github.com/lemony-ai/cascadeflow/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/lemony-ai/cascadeflow",
        "Homepage": "https://lemony.ai",
        "Repository": "https://github.com/lemony-ai/cascadeflow"
    },
    "split_keywords": [
        "ai",
        " llm",
        " cost-optimization",
        " model-routing",
        " cascade",
        " inference",
        " openai",
        " anthropic",
        " gpt",
        " claude",
        " machine-learning",
        " groq",
        " typescript",
        " javascript",
        " browser",
        " edge-functions"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f988d47c44a068743e2b764dce9ec0cf62b5a89fca4816196121672156c7dcfc",
                "md5": "2fb8c64c5f965bdc84f24d990731fd97",
                "sha256": "60ad7400ed92e63bcb5402f76260ea6a5beee29a40c884b4ce75a51fb4739d24"
            },
            "downloads": -1,
            "filename": "cascadeflow-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2fb8c64c5f965bdc84f24d990731fd97",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 331524,
            "upload_time": "2025-11-07T18:44:25",
            "upload_time_iso_8601": "2025-11-07T18:44:25.785311Z",
            "url": "https://files.pythonhosted.org/packages/f9/88/d47c44a068743e2b764dce9ec0cf62b5a89fca4816196121672156c7dcfc/cascadeflow-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b6851a89e5cfdd6ec94ea8d228760aa31d1e96851947408f7e24684c7cc28f17",
                "md5": "bc949b78c2bbd1ce5fa85507c7b651e1",
                "sha256": "4282b6be32dcf5f7002957e4577ac5a1df12289c7e45e30c2d999b8337f16ded"
            },
            "downloads": -1,
            "filename": "cascadeflow-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "bc949b78c2bbd1ce5fa85507c7b651e1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 359993,
            "upload_time": "2025-11-07T18:44:27",
            "upload_time_iso_8601": "2025-11-07T18:44:27.497180Z",
            "url": "https://files.pythonhosted.org/packages/b6/85/1a89e5cfdd6ec94ea8d228760aa31d1e96851947408f7e24684c7cc28f17/cascadeflow-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-07 18:44:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lemony-ai",
    "github_project": "cascadeflow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "httpx",
            "specs": [
                [
                    ">=",
                    "0.25.0"
                ]
            ]
        },
        {
            "name": "tiktoken",
            "specs": [
                [
                    ">=",
                    "0.5.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    ">=",
                    "13.0.0"
                ]
            ]
        }
    ],
    "lcname": "cascadeflow"
}
        
Elapsed time: 1.10812s