plsno429


Nameplsno429 JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryA tiny Python library that politely says pls no 429 by auto-handling OpenAI rate limits.
upload_time2025-08-24 23:38:15
maintainerNone
docs_urlNone
authorNone
requires_python<4.0,>=3.12
licenseNone
keywords 429-error api-client artificial-intelligence async backoff http-client openai rate-limiting retry throttling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # OpenAI Throttling Library (plsno429)

A simple Python library that automatically handles 429 rate limit errors for OpenAI API calls with multiple throttling algorithms.

## Features

- 🚀 **Simple Usage**: Just add one decorator
- 🔄 **Retry-After Header Support**: Uses OpenAI's exact retry timing
- 📚 **Multiple HTTP Libraries**: Works with requests, httpx, and OpenAI SDK
- ⚡ **Sync/Async Support**: Both synchronous and asynchronous functions
- 🎲 **Request Distribution**: Jitter prevents concurrent request collisions
- 🧠 **Multiple Throttling Algorithms**: Choose the best strategy for your use case
- 📊 **TPM Limit Awareness**: Minute-level throttling to avoid TPM limits
- 🔧 **Adaptive Learning**: Learns from 429 patterns and adjusts automatically
- 📦 **Minimal Dependencies**: Only uses standard library

## Installation

```bash
uv add plsno429
```

## Quick Start

### Basic Usage (All Libraries)

```python
from plsno429 import throttle_requests, throttle_httpx, throttle_httpx_async, throttle_openai, throttle_openai_async

# Simple retry-based throttling
@throttle_requests()
def simple_call():
    # Your API call

# Adaptive throttling with TPM awareness
@throttle_requests(algorithm="adaptive", tpm_limit=90000)
def smart_call():
    # Your API call with intelligent throttling
```

## Throttling Algorithms

### 1. Simple Retry (`algorithm="retry"`) - Default

**How it works**: A reactive approach that waits for 429 errors, then retries with exponential backoff. 

The algorithm doubles the delay after each failed attempt: 1s → 2s → 4s → 8s, etc. If the API provides a `Retry-After` header, it uses that exact timing instead of exponential backoff. This is the simplest and most reliable approach for basic use cases.

**When to use**: 
- Simple applications with occasional API calls
- When you want predictable, well-tested behavior
- Low to medium request volumes
- Getting started with rate limiting

```python
@throttle_requests(
    algorithm="retry",
    max_retries=3,        # Stop after 3 failed attempts
    base_delay=1.0,       # Start with 1 second delay
    max_delay=60.0,       # Never wait more than 60 seconds
    backoff_multiplier=2.0  # Double delay each time
)
```

**Pros**: Simple, reliable, works with any API, respects server hints  
**Cons**: Reactive only (waits for errors), may waste time on repeated failures

### 2. Token Bucket (`algorithm="token_bucket"`)

**How it works**: Implements a classic token bucket for smooth rate limiting. Imagine a bucket that holds tokens - each API request consumes tokens, and the bucket refills at a steady rate.

The algorithm prevents sudden bursts while allowing natural request patterns. If you need 100 tokens but only have 50, it calculates exactly how long to wait for 50 more tokens to be added to the bucket.

**When to use**:
- Steady, predictable request rates
- Applications that need to handle bursts gracefully  
- When you know your token consumption patterns
- Production apps with consistent load

```python
@throttle_requests(
    algorithm="token_bucket",
    tpm_limit=90000,      # Total TPM limit
    burst_size=1000,      # Max tokens in bucket (allows bursts)
    refill_rate=1500,     # Add 1500 tokens per second
    token_estimate_func=custom_estimator  # Custom token counting
)
```

**Pros**: Smooth rate limiting, allows bursts, predictable behavior  
**Cons**: Requires accurate token estimation, more complex than retry

### 3. Adaptive Learning (`algorithm="adaptive"`)

**How it works**: The smartest algorithm that learns from your actual usage patterns and API behavior. It tracks success rates, response times, and 429 error patterns to predict optimal delays.

The algorithm adjusts its behavior based on:
- **Time-of-day patterns**: Different delays for peak vs. off-peak hours
- **Success rate feedback**: Increases delays if seeing many failures
- **429 error clustering**: Detects when errors come in waves
- **Server response analysis**: Learns from `Retry-After` headers

**When to use**:
- Production applications with varying loads
- Long-running applications that can learn over time
- Complex usage patterns (batch processing + real-time)
- When you want "set it and forget it" behavior

```python
@throttle_requests(
    algorithm="adaptive",
    tpm_limit=90000,
    learning_window=100,    # Analyze last 100 requests
    adaptation_rate=0.1,    # How quickly to adapt (0.0-1.0) 
    min_delay=0.1,         # Never go below 100ms
    max_delay=300.0        # Never wait more than 5 minutes
)
```

**Pros**: Self-optimizing, learns patterns, prevents 429s proactively, handles varying loads  
**Cons**: Complex behavior, needs warm-up period, harder to debug

### 4. Sliding Window (`algorithm="sliding_window"`)

**How it works**: Maintains a precise sliding time window of all requests. Unlike simple counters that reset every minute, this tracks the exact timestamp of each request and continuously slides the window forward.

For example, with a 60-second window allowing 1500 requests: at 2:30:45, it counts all requests from 2:29:45 to 2:30:45. This provides the most accurate rate limiting possible.

**When to use**:
- High-volume applications with strict rate limits
- When you need precise control over request timing
- Applications that can't afford to exceed rate limits
- Compliance-critical environments

```python
@throttle_requests(
    algorithm="sliding_window",
    window_size=60,       # 60-second sliding window
    max_requests=1500,    # Max 1500 requests per window
    cleanup_interval=10,  # Clean up old entries every 10s
    tpm_limit=90000      # Also respect TPM limits
)
```

**Pros**: Most precise rate limiting, prevents violations, handles bursts well  
**Cons**: Higher memory usage, more CPU overhead for large volumes

### 5. Circuit Breaker (`algorithm="circuit_breaker"`)

**How it works**: Implements the circuit breaker pattern to prevent cascading failures. Like an electrical circuit breaker, it has three states:

- **Closed**: Normal operation, requests pass through
- **Open**: Too many failures detected, blocks all requests temporarily  
- **Half-Open**: Testing if the service has recovered

This prevents your application from overwhelming a failing service and allows graceful degradation.

**When to use**:
- Mission-critical applications
- Services that must handle downstream failures gracefully
- When you need to prevent cascading failures
- Applications with multiple API dependencies

```python
@throttle_requests(
    algorithm="circuit_breaker",
    failure_threshold=5,    # Open after 5 consecutive failures
    recovery_timeout=300,   # Wait 5 minutes before testing recovery
    half_open_max_calls=3   # Test with max 3 calls in half-open state
)
```

**Pros**: Prevents cascade failures, fast failure detection, graceful degradation  
**Cons**: May be overly conservative, can block valid requests, complex state management

## Configuration Options

### Common Options (All Algorithms)

| Parameter | Default | Description |
|-----------|---------|-------------|
| `algorithm` | "retry" | Throttling algorithm to use |
| `jitter` | True | Add random delay to distribute requests |
| `tpm_limit` | 90000 | Tokens per minute limit |
| `safety_margin` | 0.9 | Stop at 90% of TPM limit |
| `max_wait_minutes` | 5 | Maximum wait time in minutes |

### Algorithm-Specific Options

#### Retry Algorithm
| Parameter | Default | Description |
|-----------|---------|-------------|
| `max_retries` | 3 | Maximum number of retry attempts |
| `base_delay` | 1.0 | Base delay in seconds |
| `max_delay` | 60.0 | Maximum delay in seconds |
| `backoff_multiplier` | 2.0 | Exponential backoff multiplier |

#### Token Bucket Algorithm
| Parameter | Default | Description |
|-----------|---------|-------------|
| `burst_size` | 1000 | Maximum burst tokens |
| `refill_rate` | 1500 | Tokens per second refill rate |
| `token_estimate_func` | None | Custom token estimation function |

#### Adaptive Algorithm
| Parameter | Default | Description |
|-----------|---------|-------------|
| `learning_window` | 100 | Number of requests to analyze |
| `adaptation_rate` | 0.1 | How quickly to adapt (0.0-1.0) |
| `min_delay` | 0.1 | Minimum delay between requests |
| `max_delay` | 300.0 | Maximum adaptive delay |

#### Sliding Window Algorithm
| Parameter | Default | Description |
|-----------|---------|-------------|
| `window_size` | 60 | Time window in seconds |
| `max_requests` | 1500 | Max requests per window |
| `cleanup_interval` | 10 | Cleanup old entries interval |

#### Circuit Breaker Algorithm
| Parameter | Default | Description |
|-----------|---------|-------------|
| `failure_threshold` | 5 | Failures before opening circuit |
| `recovery_timeout` | 300 | Seconds before attempting recovery |
| `half_open_max_calls` | 3 | Max calls in half-open state |

## Advanced Features

### TPM-Aware Throttling
Automatically tracks tokens per minute and waits until next minute boundary when approaching limits.

### Pattern Learning
Adaptive algorithm learns from:
- Time-of-day patterns
- Consecutive 429 error patterns  
- Success/failure ratios
- Response time trends

### Minute-Level Recovery
When TPM limits are hit, automatically waits until the next minute boundary instead of arbitrary delays.

### Multi-Model Support
Different TPM limits for different OpenAI models:

```python
@throttle_requests(
    algorithm="adaptive",
    model_limits={
        "gpt-4": 90000,
        "gpt-3.5-turbo": 90000,
        "text-embedding-ada-002": 1000000
    }
)
```

## How Different Algorithms Work

### Retry Algorithm Flow
```
Request comes in
├── Execute function immediately
├── Success? → Return result ✅
└── 429 error?
    ├── Check retry count < max_retries?
    │   ├── Yes: Parse Retry-After header or use exponential backoff
    │   ├── Wait calculated delay (with jitter)
    │   └── Retry request
    └── No: Raise exception ❌
```

### Token Bucket Flow
```
Request comes in
├── Refill bucket based on time elapsed
├── Estimate tokens needed for request
├── Enough tokens in bucket?
│   ├── Yes: Consume tokens → Execute request → Return result ✅
│   └── No: Calculate wait time for token availability
└── Wait for tokens → Retry token check
```

### Adaptive Flow
```
Request comes in
├── Analyze historical patterns (success rate, timing, 429s)
├── Calculate optimal delay based on learning
├── Time since last request > optimal delay?
│   ├── Yes: Execute immediately
│   └── No: Wait remaining time
├── Execute request
├── Record outcome (success/failure, tokens, delay)
└── Update learning model for future requests
```

### Sliding Window Flow
```
Request comes in
├── Clean up old requests outside window
├── Count current requests in sliding window
├── Current count < max_requests?
│   ├── Yes: Add timestamp to window → Execute → Return ✅
│   └── No: Calculate wait time until oldest request expires
└── Wait for window to slide → Retry count check
```

### Circuit Breaker Flow
```
Request comes in
├── Check circuit state
├── CLOSED (normal): Execute request
│   ├── Success: Reset failure counter
│   └── Failure: Increment counter → Threshold reached? → Open circuit
├── OPEN (blocking): Check recovery timeout
│   ├── Timeout passed: Transition to HALF_OPEN
│   └── Still cooling down: Block request ❌
└── HALF_OPEN (testing): Allow limited test requests
    ├── Success: Close circuit (recovery successful)
    └── Failure: Back to OPEN (still failing)
```

## Use Cases by Algorithm

### Retry Algorithm
- **Best for**: Simple applications, low request volume
- **Pros**: Simple, reliable, works with any API
- **Cons**: Reactive only, may waste time on retries

### Token Bucket Algorithm  
- **Best for**: Steady request rates, predictable workloads
- **Pros**: Smooth rate limiting, allows bursts
- **Cons**: Requires accurate token estimation

### Adaptive Algorithm
- **Best for**: Production applications, varying workloads
- **Pros**: Self-optimizing, learns patterns, prevents 429s
- **Cons**: More complex, requires warm-up period

### Sliding Window Algorithm
- **Best for**: High-volume applications with strict rate limits
- **Pros**: Precise rate limiting, prevents TPM violations
- **Cons**: Memory overhead for tracking requests

### Circuit Breaker Algorithm
- **Best for**: Critical applications, cascading failure prevention
- **Pros**: Prevents system overload, fast failure recovery
- **Cons**: May be overly conservative

## Performance Comparison

| Algorithm | Memory Usage | CPU Overhead | Learning Ability | Prevention vs Reaction |
|-----------|--------------|--------------|------------------|----------------------|
| Retry | Low | Very Low | None | Reactive |
| Token Bucket | Low | Low | None | Preventive |
| Adaptive | Medium | Medium | High | Predictive |
| Sliding Window | High | Medium | None | Preventive |
| Circuit Breaker | Low | Low | Basic | Protective |

## Real-World Examples

### Example 1: Data Processing Pipeline (Adaptive)
```python
from plsno429 import throttle_openai
import openai

@throttle_openai(
    algorithm="adaptive",
    tpm_limit=90000,
    learning_window=50,
    adaptation_rate=0.15,  # Learn relatively quickly
    min_delay=0.05,        # Very responsive
    max_delay=120.0        # Don't wait too long
)
def process_document_batch(documents):
    """Process a batch of documents with adaptive learning."""
    results = []
    for doc in documents:
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": f"Summarize: {doc}"}]
        )
        results.append(response.choices[0].message.content)
    return results
```

### Example 2: High-Volume API Service (Sliding Window)
```python
from plsno429 import throttle_requests
import httpx

@throttle_requests(
    algorithm="sliding_window",
    window_size=60,        # 1-minute window
    max_requests=1500,     # OpenAI's RPM limit
    tpm_limit=90000,       # Also respect TPM
    cleanup_interval=5     # Clean up frequently
)
async def translation_service(texts):
    """High-volume translation service with precise rate limiting."""
    async with httpx.AsyncClient() as client:
        tasks = []
        for text in texts:
            task = client.post(
                "https://api.openai.com/v1/chat/completions",
                json={
                    "model": "gpt-3.5-turbo",
                    "messages": [{"role": "user", "content": f"Translate: {text}"}]
                }
            )
            tasks.append(task)
        
        results = await asyncio.gather(*tasks)
        return [r.json() for r in results]
```

### Example 3: Mission-Critical Application (Circuit Breaker)
```python
from plsno429 import throttle_openai
from plsno429.exceptions import CircuitBreakerOpen
import logging

@throttle_openai(
    algorithm="circuit_breaker",
    failure_threshold=3,     # Open after 3 failures
    recovery_timeout=180,    # 3-minute recovery
    half_open_max_calls=2    # Test with 2 calls
)
def critical_ai_service(prompt):
    """Mission-critical service with graceful degradation."""
    try:
        return openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
    except CircuitBreakerOpen as e:
        logging.warning(f"AI service unavailable: {e}")
        # Fallback to cached responses or simpler processing
        return {"fallback": "Service temporarily unavailable"}
```

### Example 4: Multi-Model Application (Model-Specific Limits)
```python
from plsno429 import throttle_openai

def extract_model_from_request(**kwargs):
    """Extract model name from request parameters."""
    return kwargs.get('model', 'gpt-3.5-turbo')

@throttle_openai(
    algorithm="adaptive",
    model_func=extract_model_from_request,
    model_limits={
        "gpt-4": 40000,                    # Lower limit for expensive model
        "gpt-3.5-turbo": 90000,           # Standard limit
        "text-embedding-ada-002": 1000000  # Higher limit for embeddings
    },
    tpm_limit=90000  # Global fallback limit
)
def multi_model_ai_service(prompt, model="gpt-3.5-turbo", task_type="chat"):
    """Service supporting multiple models with different limits."""
    if task_type == "embedding":
        return openai.Embedding.create(
            model="text-embedding-ada-002",
            input=prompt
        )
    else:
        return openai.ChatCompletion.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
```

## Requirements

- Python 3.12+
- No external dependencies (uses only standard library)
- Compatible with requests, httpx, and OpenAI Python SDK

## Development

### Install dependencies

```bash
uv sync --group dev --group docs
```

### Run tests

```bash
uv run pytest
```

### Formatting and linting

```bash
uv run ruff format
uv run ruff check --fix .
```

### Build package

```bash
uv build
```

## License

Apache 2.0 License

## Contributing

Issues and pull requests are welcome!

## Notes

- Choose algorithm based on your specific use case and requirements
- Adaptive algorithm provides best long-term performance but needs time to learn
- All algorithms respect OpenAI's Retry-After headers when provided
- Consider your application's latency requirements when choosing algorithms
- Circuit breaker is recommended for mission-critical applications
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "plsno429",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.12",
    "maintainer_email": null,
    "keywords": "429-error, api-client, artificial-intelligence, async, backoff, http-client, openai, rate-limiting, retry, throttling",
    "author": null,
    "author_email": "Jongsu Liam Kim <jongsukim8@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/19/26/21d5cd4fe873290ef451e2564aa48e3f6e5315c6e7e46e8cf611bcfe7ea2/plsno429-0.2.2.tar.gz",
    "platform": null,
    "description": "# OpenAI Throttling Library (plsno429)\n\nA simple Python library that automatically handles 429 rate limit errors for OpenAI API calls with multiple throttling algorithms.\n\n## Features\n\n- \ud83d\ude80 **Simple Usage**: Just add one decorator\n- \ud83d\udd04 **Retry-After Header Support**: Uses OpenAI's exact retry timing\n- \ud83d\udcda **Multiple HTTP Libraries**: Works with requests, httpx, and OpenAI SDK\n- \u26a1 **Sync/Async Support**: Both synchronous and asynchronous functions\n- \ud83c\udfb2 **Request Distribution**: Jitter prevents concurrent request collisions\n- \ud83e\udde0 **Multiple Throttling Algorithms**: Choose the best strategy for your use case\n- \ud83d\udcca **TPM Limit Awareness**: Minute-level throttling to avoid TPM limits\n- \ud83d\udd27 **Adaptive Learning**: Learns from 429 patterns and adjusts automatically\n- \ud83d\udce6 **Minimal Dependencies**: Only uses standard library\n\n## Installation\n\n```bash\nuv add plsno429\n```\n\n## Quick Start\n\n### Basic Usage (All Libraries)\n\n```python\nfrom plsno429 import throttle_requests, throttle_httpx, throttle_httpx_async, throttle_openai, throttle_openai_async\n\n# Simple retry-based throttling\n@throttle_requests()\ndef simple_call():\n    # Your API call\n\n# Adaptive throttling with TPM awareness\n@throttle_requests(algorithm=\"adaptive\", tpm_limit=90000)\ndef smart_call():\n    # Your API call with intelligent throttling\n```\n\n## Throttling Algorithms\n\n### 1. Simple Retry (`algorithm=\"retry\"`) - Default\n\n**How it works**: A reactive approach that waits for 429 errors, then retries with exponential backoff. \n\nThe algorithm doubles the delay after each failed attempt: 1s \u2192 2s \u2192 4s \u2192 8s, etc. If the API provides a `Retry-After` header, it uses that exact timing instead of exponential backoff. This is the simplest and most reliable approach for basic use cases.\n\n**When to use**: \n- Simple applications with occasional API calls\n- When you want predictable, well-tested behavior\n- Low to medium request volumes\n- Getting started with rate limiting\n\n```python\n@throttle_requests(\n    algorithm=\"retry\",\n    max_retries=3,        # Stop after 3 failed attempts\n    base_delay=1.0,       # Start with 1 second delay\n    max_delay=60.0,       # Never wait more than 60 seconds\n    backoff_multiplier=2.0  # Double delay each time\n)\n```\n\n**Pros**: Simple, reliable, works with any API, respects server hints  \n**Cons**: Reactive only (waits for errors), may waste time on repeated failures\n\n### 2. Token Bucket (`algorithm=\"token_bucket\"`)\n\n**How it works**: Implements a classic token bucket for smooth rate limiting. Imagine a bucket that holds tokens - each API request consumes tokens, and the bucket refills at a steady rate.\n\nThe algorithm prevents sudden bursts while allowing natural request patterns. If you need 100 tokens but only have 50, it calculates exactly how long to wait for 50 more tokens to be added to the bucket.\n\n**When to use**:\n- Steady, predictable request rates\n- Applications that need to handle bursts gracefully  \n- When you know your token consumption patterns\n- Production apps with consistent load\n\n```python\n@throttle_requests(\n    algorithm=\"token_bucket\",\n    tpm_limit=90000,      # Total TPM limit\n    burst_size=1000,      # Max tokens in bucket (allows bursts)\n    refill_rate=1500,     # Add 1500 tokens per second\n    token_estimate_func=custom_estimator  # Custom token counting\n)\n```\n\n**Pros**: Smooth rate limiting, allows bursts, predictable behavior  \n**Cons**: Requires accurate token estimation, more complex than retry\n\n### 3. Adaptive Learning (`algorithm=\"adaptive\"`)\n\n**How it works**: The smartest algorithm that learns from your actual usage patterns and API behavior. It tracks success rates, response times, and 429 error patterns to predict optimal delays.\n\nThe algorithm adjusts its behavior based on:\n- **Time-of-day patterns**: Different delays for peak vs. off-peak hours\n- **Success rate feedback**: Increases delays if seeing many failures\n- **429 error clustering**: Detects when errors come in waves\n- **Server response analysis**: Learns from `Retry-After` headers\n\n**When to use**:\n- Production applications with varying loads\n- Long-running applications that can learn over time\n- Complex usage patterns (batch processing + real-time)\n- When you want \"set it and forget it\" behavior\n\n```python\n@throttle_requests(\n    algorithm=\"adaptive\",\n    tpm_limit=90000,\n    learning_window=100,    # Analyze last 100 requests\n    adaptation_rate=0.1,    # How quickly to adapt (0.0-1.0) \n    min_delay=0.1,         # Never go below 100ms\n    max_delay=300.0        # Never wait more than 5 minutes\n)\n```\n\n**Pros**: Self-optimizing, learns patterns, prevents 429s proactively, handles varying loads  \n**Cons**: Complex behavior, needs warm-up period, harder to debug\n\n### 4. Sliding Window (`algorithm=\"sliding_window\"`)\n\n**How it works**: Maintains a precise sliding time window of all requests. Unlike simple counters that reset every minute, this tracks the exact timestamp of each request and continuously slides the window forward.\n\nFor example, with a 60-second window allowing 1500 requests: at 2:30:45, it counts all requests from 2:29:45 to 2:30:45. This provides the most accurate rate limiting possible.\n\n**When to use**:\n- High-volume applications with strict rate limits\n- When you need precise control over request timing\n- Applications that can't afford to exceed rate limits\n- Compliance-critical environments\n\n```python\n@throttle_requests(\n    algorithm=\"sliding_window\",\n    window_size=60,       # 60-second sliding window\n    max_requests=1500,    # Max 1500 requests per window\n    cleanup_interval=10,  # Clean up old entries every 10s\n    tpm_limit=90000      # Also respect TPM limits\n)\n```\n\n**Pros**: Most precise rate limiting, prevents violations, handles bursts well  \n**Cons**: Higher memory usage, more CPU overhead for large volumes\n\n### 5. Circuit Breaker (`algorithm=\"circuit_breaker\"`)\n\n**How it works**: Implements the circuit breaker pattern to prevent cascading failures. Like an electrical circuit breaker, it has three states:\n\n- **Closed**: Normal operation, requests pass through\n- **Open**: Too many failures detected, blocks all requests temporarily  \n- **Half-Open**: Testing if the service has recovered\n\nThis prevents your application from overwhelming a failing service and allows graceful degradation.\n\n**When to use**:\n- Mission-critical applications\n- Services that must handle downstream failures gracefully\n- When you need to prevent cascading failures\n- Applications with multiple API dependencies\n\n```python\n@throttle_requests(\n    algorithm=\"circuit_breaker\",\n    failure_threshold=5,    # Open after 5 consecutive failures\n    recovery_timeout=300,   # Wait 5 minutes before testing recovery\n    half_open_max_calls=3   # Test with max 3 calls in half-open state\n)\n```\n\n**Pros**: Prevents cascade failures, fast failure detection, graceful degradation  \n**Cons**: May be overly conservative, can block valid requests, complex state management\n\n## Configuration Options\n\n### Common Options (All Algorithms)\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `algorithm` | \"retry\" | Throttling algorithm to use |\n| `jitter` | True | Add random delay to distribute requests |\n| `tpm_limit` | 90000 | Tokens per minute limit |\n| `safety_margin` | 0.9 | Stop at 90% of TPM limit |\n| `max_wait_minutes` | 5 | Maximum wait time in minutes |\n\n### Algorithm-Specific Options\n\n#### Retry Algorithm\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `max_retries` | 3 | Maximum number of retry attempts |\n| `base_delay` | 1.0 | Base delay in seconds |\n| `max_delay` | 60.0 | Maximum delay in seconds |\n| `backoff_multiplier` | 2.0 | Exponential backoff multiplier |\n\n#### Token Bucket Algorithm\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `burst_size` | 1000 | Maximum burst tokens |\n| `refill_rate` | 1500 | Tokens per second refill rate |\n| `token_estimate_func` | None | Custom token estimation function |\n\n#### Adaptive Algorithm\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `learning_window` | 100 | Number of requests to analyze |\n| `adaptation_rate` | 0.1 | How quickly to adapt (0.0-1.0) |\n| `min_delay` | 0.1 | Minimum delay between requests |\n| `max_delay` | 300.0 | Maximum adaptive delay |\n\n#### Sliding Window Algorithm\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `window_size` | 60 | Time window in seconds |\n| `max_requests` | 1500 | Max requests per window |\n| `cleanup_interval` | 10 | Cleanup old entries interval |\n\n#### Circuit Breaker Algorithm\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `failure_threshold` | 5 | Failures before opening circuit |\n| `recovery_timeout` | 300 | Seconds before attempting recovery |\n| `half_open_max_calls` | 3 | Max calls in half-open state |\n\n## Advanced Features\n\n### TPM-Aware Throttling\nAutomatically tracks tokens per minute and waits until next minute boundary when approaching limits.\n\n### Pattern Learning\nAdaptive algorithm learns from:\n- Time-of-day patterns\n- Consecutive 429 error patterns  \n- Success/failure ratios\n- Response time trends\n\n### Minute-Level Recovery\nWhen TPM limits are hit, automatically waits until the next minute boundary instead of arbitrary delays.\n\n### Multi-Model Support\nDifferent TPM limits for different OpenAI models:\n\n```python\n@throttle_requests(\n    algorithm=\"adaptive\",\n    model_limits={\n        \"gpt-4\": 90000,\n        \"gpt-3.5-turbo\": 90000,\n        \"text-embedding-ada-002\": 1000000\n    }\n)\n```\n\n## How Different Algorithms Work\n\n### Retry Algorithm Flow\n```\nRequest comes in\n\u251c\u2500\u2500 Execute function immediately\n\u251c\u2500\u2500 Success? \u2192 Return result \u2705\n\u2514\u2500\u2500 429 error?\n    \u251c\u2500\u2500 Check retry count < max_retries?\n    \u2502   \u251c\u2500\u2500 Yes: Parse Retry-After header or use exponential backoff\n    \u2502   \u251c\u2500\u2500 Wait calculated delay (with jitter)\n    \u2502   \u2514\u2500\u2500 Retry request\n    \u2514\u2500\u2500 No: Raise exception \u274c\n```\n\n### Token Bucket Flow\n```\nRequest comes in\n\u251c\u2500\u2500 Refill bucket based on time elapsed\n\u251c\u2500\u2500 Estimate tokens needed for request\n\u251c\u2500\u2500 Enough tokens in bucket?\n\u2502   \u251c\u2500\u2500 Yes: Consume tokens \u2192 Execute request \u2192 Return result \u2705\n\u2502   \u2514\u2500\u2500 No: Calculate wait time for token availability\n\u2514\u2500\u2500 Wait for tokens \u2192 Retry token check\n```\n\n### Adaptive Flow\n```\nRequest comes in\n\u251c\u2500\u2500 Analyze historical patterns (success rate, timing, 429s)\n\u251c\u2500\u2500 Calculate optimal delay based on learning\n\u251c\u2500\u2500 Time since last request > optimal delay?\n\u2502   \u251c\u2500\u2500 Yes: Execute immediately\n\u2502   \u2514\u2500\u2500 No: Wait remaining time\n\u251c\u2500\u2500 Execute request\n\u251c\u2500\u2500 Record outcome (success/failure, tokens, delay)\n\u2514\u2500\u2500 Update learning model for future requests\n```\n\n### Sliding Window Flow\n```\nRequest comes in\n\u251c\u2500\u2500 Clean up old requests outside window\n\u251c\u2500\u2500 Count current requests in sliding window\n\u251c\u2500\u2500 Current count < max_requests?\n\u2502   \u251c\u2500\u2500 Yes: Add timestamp to window \u2192 Execute \u2192 Return \u2705\n\u2502   \u2514\u2500\u2500 No: Calculate wait time until oldest request expires\n\u2514\u2500\u2500 Wait for window to slide \u2192 Retry count check\n```\n\n### Circuit Breaker Flow\n```\nRequest comes in\n\u251c\u2500\u2500 Check circuit state\n\u251c\u2500\u2500 CLOSED (normal): Execute request\n\u2502   \u251c\u2500\u2500 Success: Reset failure counter\n\u2502   \u2514\u2500\u2500 Failure: Increment counter \u2192 Threshold reached? \u2192 Open circuit\n\u251c\u2500\u2500 OPEN (blocking): Check recovery timeout\n\u2502   \u251c\u2500\u2500 Timeout passed: Transition to HALF_OPEN\n\u2502   \u2514\u2500\u2500 Still cooling down: Block request \u274c\n\u2514\u2500\u2500 HALF_OPEN (testing): Allow limited test requests\n    \u251c\u2500\u2500 Success: Close circuit (recovery successful)\n    \u2514\u2500\u2500 Failure: Back to OPEN (still failing)\n```\n\n## Use Cases by Algorithm\n\n### Retry Algorithm\n- **Best for**: Simple applications, low request volume\n- **Pros**: Simple, reliable, works with any API\n- **Cons**: Reactive only, may waste time on retries\n\n### Token Bucket Algorithm  \n- **Best for**: Steady request rates, predictable workloads\n- **Pros**: Smooth rate limiting, allows bursts\n- **Cons**: Requires accurate token estimation\n\n### Adaptive Algorithm\n- **Best for**: Production applications, varying workloads\n- **Pros**: Self-optimizing, learns patterns, prevents 429s\n- **Cons**: More complex, requires warm-up period\n\n### Sliding Window Algorithm\n- **Best for**: High-volume applications with strict rate limits\n- **Pros**: Precise rate limiting, prevents TPM violations\n- **Cons**: Memory overhead for tracking requests\n\n### Circuit Breaker Algorithm\n- **Best for**: Critical applications, cascading failure prevention\n- **Pros**: Prevents system overload, fast failure recovery\n- **Cons**: May be overly conservative\n\n## Performance Comparison\n\n| Algorithm | Memory Usage | CPU Overhead | Learning Ability | Prevention vs Reaction |\n|-----------|--------------|--------------|------------------|----------------------|\n| Retry | Low | Very Low | None | Reactive |\n| Token Bucket | Low | Low | None | Preventive |\n| Adaptive | Medium | Medium | High | Predictive |\n| Sliding Window | High | Medium | None | Preventive |\n| Circuit Breaker | Low | Low | Basic | Protective |\n\n## Real-World Examples\n\n### Example 1: Data Processing Pipeline (Adaptive)\n```python\nfrom plsno429 import throttle_openai\nimport openai\n\n@throttle_openai(\n    algorithm=\"adaptive\",\n    tpm_limit=90000,\n    learning_window=50,\n    adaptation_rate=0.15,  # Learn relatively quickly\n    min_delay=0.05,        # Very responsive\n    max_delay=120.0        # Don't wait too long\n)\ndef process_document_batch(documents):\n    \"\"\"Process a batch of documents with adaptive learning.\"\"\"\n    results = []\n    for doc in documents:\n        response = openai.ChatCompletion.create(\n            model=\"gpt-4\",\n            messages=[{\"role\": \"user\", \"content\": f\"Summarize: {doc}\"}]\n        )\n        results.append(response.choices[0].message.content)\n    return results\n```\n\n### Example 2: High-Volume API Service (Sliding Window)\n```python\nfrom plsno429 import throttle_requests\nimport httpx\n\n@throttle_requests(\n    algorithm=\"sliding_window\",\n    window_size=60,        # 1-minute window\n    max_requests=1500,     # OpenAI's RPM limit\n    tpm_limit=90000,       # Also respect TPM\n    cleanup_interval=5     # Clean up frequently\n)\nasync def translation_service(texts):\n    \"\"\"High-volume translation service with precise rate limiting.\"\"\"\n    async with httpx.AsyncClient() as client:\n        tasks = []\n        for text in texts:\n            task = client.post(\n                \"https://api.openai.com/v1/chat/completions\",\n                json={\n                    \"model\": \"gpt-3.5-turbo\",\n                    \"messages\": [{\"role\": \"user\", \"content\": f\"Translate: {text}\"}]\n                }\n            )\n            tasks.append(task)\n        \n        results = await asyncio.gather(*tasks)\n        return [r.json() for r in results]\n```\n\n### Example 3: Mission-Critical Application (Circuit Breaker)\n```python\nfrom plsno429 import throttle_openai\nfrom plsno429.exceptions import CircuitBreakerOpen\nimport logging\n\n@throttle_openai(\n    algorithm=\"circuit_breaker\",\n    failure_threshold=3,     # Open after 3 failures\n    recovery_timeout=180,    # 3-minute recovery\n    half_open_max_calls=2    # Test with 2 calls\n)\ndef critical_ai_service(prompt):\n    \"\"\"Mission-critical service with graceful degradation.\"\"\"\n    try:\n        return openai.ChatCompletion.create(\n            model=\"gpt-4\",\n            messages=[{\"role\": \"user\", \"content\": prompt}]\n        )\n    except CircuitBreakerOpen as e:\n        logging.warning(f\"AI service unavailable: {e}\")\n        # Fallback to cached responses or simpler processing\n        return {\"fallback\": \"Service temporarily unavailable\"}\n```\n\n### Example 4: Multi-Model Application (Model-Specific Limits)\n```python\nfrom plsno429 import throttle_openai\n\ndef extract_model_from_request(**kwargs):\n    \"\"\"Extract model name from request parameters.\"\"\"\n    return kwargs.get('model', 'gpt-3.5-turbo')\n\n@throttle_openai(\n    algorithm=\"adaptive\",\n    model_func=extract_model_from_request,\n    model_limits={\n        \"gpt-4\": 40000,                    # Lower limit for expensive model\n        \"gpt-3.5-turbo\": 90000,           # Standard limit\n        \"text-embedding-ada-002\": 1000000  # Higher limit for embeddings\n    },\n    tpm_limit=90000  # Global fallback limit\n)\ndef multi_model_ai_service(prompt, model=\"gpt-3.5-turbo\", task_type=\"chat\"):\n    \"\"\"Service supporting multiple models with different limits.\"\"\"\n    if task_type == \"embedding\":\n        return openai.Embedding.create(\n            model=\"text-embedding-ada-002\",\n            input=prompt\n        )\n    else:\n        return openai.ChatCompletion.create(\n            model=model,\n            messages=[{\"role\": \"user\", \"content\": prompt}]\n        )\n```\n\n## Requirements\n\n- Python 3.12+\n- No external dependencies (uses only standard library)\n- Compatible with requests, httpx, and OpenAI Python SDK\n\n## Development\n\n### Install dependencies\n\n```bash\nuv sync --group dev --group docs\n```\n\n### Run tests\n\n```bash\nuv run pytest\n```\n\n### Formatting and linting\n\n```bash\nuv run ruff format\nuv run ruff check --fix .\n```\n\n### Build package\n\n```bash\nuv build\n```\n\n## License\n\nApache 2.0 License\n\n## Contributing\n\nIssues and pull requests are welcome!\n\n## Notes\n\n- Choose algorithm based on your specific use case and requirements\n- Adaptive algorithm provides best long-term performance but needs time to learn\n- All algorithms respect OpenAI's Retry-After headers when provided\n- Consider your application's latency requirements when choosing algorithms\n- Circuit breaker is recommended for mission-critical applications",
    "bugtrack_url": null,
    "license": null,
    "summary": "A tiny Python library that politely says pls no 429 by auto-handling OpenAI rate limits.",
    "version": "0.2.2",
    "project_urls": {
        "Homepage": "https://github.com/appleparan/plsno429"
    },
    "split_keywords": [
        "429-error",
        " api-client",
        " artificial-intelligence",
        " async",
        " backoff",
        " http-client",
        " openai",
        " rate-limiting",
        " retry",
        " throttling"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3f8be7bdc22250f26c16b448b535a580431dfab61482f4391ccbb79116ace159",
                "md5": "3e345d69001a5b8db5a7468bdf16c28b",
                "sha256": "6c44c7b5d9c140461912dd5ea892e08667f5f9df2cf33f7324a68e83b030631e"
            },
            "downloads": -1,
            "filename": "plsno429-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3e345d69001a5b8db5a7468bdf16c28b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.12",
            "size": 24581,
            "upload_time": "2025-08-24T23:38:13",
            "upload_time_iso_8601": "2025-08-24T23:38:13.340142Z",
            "url": "https://files.pythonhosted.org/packages/3f/8b/e7bdc22250f26c16b448b535a580431dfab61482f4391ccbb79116ace159/plsno429-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "192621d5cd4fe873290ef451e2564aa48e3f6e5315c6e7e46e8cf611bcfe7ea2",
                "md5": "dcefd03e4b3a9d3fd413b1991152bbef",
                "sha256": "76cbe35716c20872497c36ee46220cfff36014af025f1cfa1528493dd13b2635"
            },
            "downloads": -1,
            "filename": "plsno429-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "dcefd03e4b3a9d3fd413b1991152bbef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.12",
            "size": 103768,
            "upload_time": "2025-08-24T23:38:15",
            "upload_time_iso_8601": "2025-08-24T23:38:15.346197Z",
            "url": "https://files.pythonhosted.org/packages/19/26/21d5cd4fe873290ef451e2564aa48e3f6e5315c6e7e46e8cf611bcfe7ea2/plsno429-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-24 23:38:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "appleparan",
    "github_project": "plsno429",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "plsno429"
}
        
Elapsed time: 1.10520s