skewsentry

Name	skewsentry JSON
Version	0.1.1 JSON
	download
home_page	None
Summary	Catch training ↔ serving feature skew before you ship to production
upload_time	2025-08-28 22:38:27
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	MIT License Copyright (c) 2024 SkewSentry Authors Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	ml machine-learning feature-engineering data-validation mlops feature-store training serving parity testing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">
  
  <img src="assets/SkewSentry_Logo.png" alt="SkewSentry Logo" width="200">
  
  # SkewSentry
  
  **Catch training ↔ serving feature skew before you ship to production**
  
  [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  [![Test Coverage](https://img.shields.io/badge/coverage-86%25-brightgreen.svg)](https://pytest.org/)
  
  *Prevent ML model failures with automated feature parity validation*
  
</div>

---

## 🚀 Why SkewSentry?

SkewSentry transforms fragile ML deployments into reliable production systems through automated feature parity validation.

### 💰 **Prevent Costly ML Failures**
- **70% of ML failures** stem from training/serving skew
- **Months of silent degradation** before detection
- **Lost revenue and customer trust** from broken predictions

### ⚡ **Production-Ready Validation**
- **Pre-deployment detection** - Catch issues in CI before they ship
- **Configurable tolerances** - Handle expected differences intelligently  
- **Multi-source support** - Python functions, HTTP APIs, any feature pipeline
- **Rich reporting** - HTML reports with detailed mismatch analysis

### 🔧 **Developer-First Design**
- **Zero configuration** - Works out of the box with intelligent defaults
- **CI integration** - Exit codes for automated validation gates
- **Multiple formats** - Text, JSON, and HTML reports for different use cases

## 📦 Installation

### Production
```bash
pip install skewsentry
```

### Development
```bash
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[dev]"
```

## ⚡ Quickstart

### Basic Feature Parity Check
```python
# Initialize spec from your data
skewsentry init features.yml --data validation.parquet --keys user_id timestamp

# Run parity check
skewsentry check \
  --spec features.yml \
  --offline training.pipeline:extract_features \
  --online serving.api:get_features \
  --data validation.parquet \
  --html report.html

# ✅ Exit 0: Features match within tolerance
# ❌ Exit 1: Parity violations detected (fails CI)
# 🚨 Exit 2: Configuration error
```

### Realistic Example: E-commerce Features
```yaml
# features.yml
version: 1
keys: ["user_id", "timestamp"]

features:
  - name: total_spend_7d
    dtype: float
    tolerance:
      abs: 0.01  # $0.01 absolute tolerance
      rel: 0.001  # 0.1% relative tolerance
      
  - name: order_count_30d
    dtype: int
    tolerance:
      abs: 1  # Allow 1 order difference
```

```python
# Offline pipeline (training)
def extract_features(df):
    return df.assign(
        total_spend_7d=df.groupby('user_id')['amount'].rolling('7D').sum(),
        order_count_30d=df.groupby('user_id').size().rolling('30D').sum()
    )

# Online pipeline (serving) - subtle differences
def get_features(df):
    return df.assign(
        total_spend_7d=df.groupby('user_id')['amount'].rolling('7D', closed='right').sum(),  # Different windowing!
        order_count_30d=df.groupby('user_id').size().rolling('30D').sum()
    )
```

**SkewSentry catches the windowing difference:**
```bash
❌ Feature parity violations detected:
  - total_spend_7d: mismatch_rate=0.1200 rows=5000 mean_abs_diff=0.0845
```

## 🏗️ Feature Adapters

SkewSentry works with any feature pipeline through adapters:

### Python Functions
```python
# Direct Python function integration
from skewsentry.adapters import PythonFunctionAdapter

adapter = PythonFunctionAdapter("mymodule:extract_features")
features = adapter.get_features(input_data)
```

### HTTP APIs
```python
# REST API integration with automatic batching
from skewsentry.adapters import HTTPAdapter

adapter = HTTPAdapter("http://api.example.com/features", timeout=30.0)
features = adapter.get_features(input_data)
```

## Usage

### Command Line Interface

#### Initialize Feature Spec
```bash
skewsentry init features.yml \
  --data sample_data.parquet \
  --keys user_id timestamp
```

#### Run Parity Check
```bash
skewsentry check \
  --spec features.yml \
  --offline module.offline:build_features \
  --online module.online:get_features \
  --data validation.parquet \
  --sample 10000 \
  --seed 42 \
  --html artifacts/report.html \
  --json artifacts/results.json
```

### Python API

```python
from skewsentry import FeatureSpec
from skewsentry.adapters.python import PythonFunctionAdapter
from skewsentry.adapters.http import HTTPAdapter
from skewsentry.runner import run_check

# Define feature comparison rules
spec = FeatureSpec.from_yaml("features.yml")

# Set up adapters for your pipelines
offline_adapter = PythonFunctionAdapter("training.pipeline:extract_features")
online_adapter = HTTPAdapter("https://api.myservice.com/features")

# Run comparison
report = run_check(
    spec=spec,
    data="validation_data.parquet",  # or DataFrame
    offline=offline_adapter,
    online=online_adapter,
    sample=5000,
    seed=42,
    html_out="report.html",
    json_out="results.json"
)

# Check results
if report.ok:
    print("✅ All features match within tolerance")
else:
    print("❌ Feature parity violations detected:")
    print(report.to_text(max_rows=10))
    
    # Fail CI/CD pipeline
    raise SystemExit(1)
```  

## Feature Specification

SkewSentry uses YAML configuration to define feature comparison rules:

```yaml
version: 1
keys: ["user_id", "timestamp"]  # Row alignment keys
null_policy: "same"              # "same" | "allow_both_null"

features:
  # Numeric features with tolerance
  - name: spend_7d
    dtype: float
    nullable: true
    tolerance:
      abs: 0.01      # Absolute tolerance (optional)
      rel: 0.001     # Relative tolerance (optional)
    window:
      lookback_days: 7
      timestamp_col: "timestamp"
      closed: "right"
      
  # Categorical features with validation
  - name: country
    dtype: category
    categories: ["US", "UK", "DE", "FR"]  # Expected values
    nullable: false
    
  # Integer features with range validation
  - name: age
    dtype: int
    nullable: false
    range: [0, 120]  # [min, max] bounds
    
  # String features (exact match)
  - name: user_segment
    dtype: string
    nullable: true
    
  # DateTime features (exact match)
  - name: last_login
    dtype: datetime
    nullable: true
```

### Supported Data Types
| Type | Comparison | Tolerance | Notes |
|------|------------|-----------|-------|
| `int` | Numeric | ✅ abs/rel | Coerced to float for comparison |
| `float` | Numeric | ✅ abs/rel | NaN handling per null_policy |
| `bool` | Exact | ❌ | True/False only |
| `string` | Exact | ❌ | Case sensitive |
| `category` | Exact + Unknown detection | ❌ | Validates against expected categories |
| `datetime` | Exact | ❌ | Timezone aware |

### Tolerance Configuration

**Absolute Tolerance**: `|offline_value - online_value| ≤ abs_tolerance`

**Relative Tolerance**: `|offline_value - online_value| ≤ rel_tolerance × max(|offline_value|, |online_value|, ε)`

Either or both can be specified. If both are provided, the comparison passes if *either* tolerance is satisfied.

## Adapters

SkewSentry supports multiple adapter types to connect with different feature pipeline architectures:

### Python Function Adapter

For in-process Python functions:

```python
from skewsentry.adapters.python import PythonFunctionAdapter

# Your feature function signature
def extract_features(df: pd.DataFrame) -> pd.DataFrame:
    """Extract features from input DataFrame.
    
    Args:
        df: Input DataFrame with raw data
        
    Returns:
        DataFrame with feature columns + key columns
    """
    return df[["user_id", "timestamp", "spend_7d", "country"]]

# Reference by module:function string
adapter = PythonFunctionAdapter("mypackage.features:extract_features")
```

### HTTP Adapter

For REST API endpoints:

```python
from skewsentry.adapters.http import HTTPAdapter

adapter = HTTPAdapter(
    url="https://features.myservice.com/batch",
    method="POST",
    headers={"Authorization": "Bearer token"},
    batch_size=1000,  # Records per request
    timeout=30.0,
    max_retries=3
)
```

**Expected API Contract**:
- **Request**: JSON array of input records
- **Response**: JSON array of feature records (same order)
- **Status**: 200 for success, 4xx/5xx for errors


## Reporting

SkewSentry generates multiple report formats for different use cases:

### Text Report
```python
# Console-friendly summary
print(report.to_text(max_rows=10))
```
```
OK: False
Missing rows — offline: 0, online: 3
Per-feature mismatch rates:
  - spend_7d: mismatch_rate=0.1200 rows=1000 mean_abs_diff=0.0845
  - country: mismatch_rate=0.0000 rows=1000 mean_abs_diff=None
```

### JSON Report
```python
# Machine-readable results
report.to_json("results.json")
```
```json
{
  "ok": false,
  "keys": ["user_id", "timestamp"],
  "missing_in_online": 3,
  "missing_in_offline": 0,
  "features": [
    {
      "name": "spend_7d",
      "mismatch_rate": 0.12,
      "num_rows": 1000,
      "mean_abs_diff": 0.0845,
      "unknown_categories": null
    }
  ],
  "failing_features": ["spend_7d"]
}
```

### HTML Report
```python
# Rich visual report for stakeholders
report.to_html("report.html")
```

Interactive HTML report includes:
- Executive summary with pass/fail status
- Per-feature mismatch statistics
- Sample mismatched rows with differences highlighted
- Missing row analysis
- Feature distribution comparisons

## CI Integration

### GitHub Actions

```yaml
name: Feature Parity Check
on: [push, pull_request]

jobs:
  parity-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install -e ".[dev]"
          
      - name: Run feature parity check
        run: |
          skewsentry check \
            --spec features.yml \
            --offline training.pipeline:extract_features \
            --online serving.api:get_features \
            --data tests/fixtures/validation.parquet \
            --html artifacts/parity-report.html \
            --json artifacts/parity-results.json
            
      - name: Upload report artifacts
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: parity-reports
          path: artifacts/
```

### Exit Codes
- **0**: All features match within specified tolerances ✅
- **1**: Feature parity violations detected ❌
- **2**: Configuration error or runtime failure 🚨

### Integration Patterns

**Pre-deployment Gate**:
```bash
# Block deployment if parity check fails
skewsentry check --spec features.yml --offline offline:fn --online online:fn --data validation.parquet
if [ $? -eq 1 ]; then
  echo "❌ Feature parity violations detected. Blocking deployment."
  exit 1
fi
```

**Model Registry Integration**:
```python
# Validate features before model registration
report = run_check(spec, data, offline_adapter, online_adapter)
if report.ok:
    model_registry.register_model(model, features=spec.features)
else:
    raise ValueError(f"Feature parity check failed: {report.failing_features}")
```

## Examples

### Real-World Bug Caught by SkewSentry

This is the exact type of production bug SkewSentry prevents:

```python
# Training pipeline (offline) - Spark/Python
def extract_features(df):
    # Rolling 7-day sum with pandas semantics
    spend_7d = df.groupby("user_id")["amount"] \
                 .rolling(7, min_periods=1) \
                 .sum() \
                 .round(2)
    return df.assign(spend_7d=spend_7d)

# Serving pipeline (online) - Java/Kafka Streams  
# Translated to Python equivalent for illustration
def get_features(df):
    # Rolling 7-day sum with different window semantics
    spend_7d = df.groupby("user_id")["amount"] \
                 .rolling(7, closed="left") \
                 .sum() \
                 .apply(lambda x: math.floor(x * 100) / 100)
    return df.assign(spend_7d=spend_7d)
```

**The Differences**:
1. **Window boundaries**: `min_periods=1` vs `closed="left"`
2. **Rounding logic**: `round(2)` vs `floor() * 100 / 100`

**The Impact**: 12% of feature values differed by 0.01-0.15, causing model accuracy to drop from 94% to 89% in production.

**The Solution**: SkewSentry with `tolerance: {abs: 0.01}` caught this in CI:
```bash
❌ Feature parity violations detected:
  - spend_7d: mismatch_rate=0.1200 rows=5000 mean_abs_diff=0.0845
```

### Complete Example

See [`examples/python/`](examples/python/) for a runnable demonstration showing how SkewSentry catches windowing and rounding differences between offline and online pipelines.

## Development

### Setup
```bash
git clone https://github.com/your-org/skewsentry.git
cd skewsentry
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[dev]"
```

### Testing
```bash
# Run all tests
uv run pytest

# With coverage (enforces 85%+)
uv run pytest --cov=skewsentry --cov-fail-under=85

# Run specific test categories
uv run pytest -k test_spec              # Specification tests
uv run pytest -k test_adapter           # Adapter tests  
uv run pytest -m "e2e"                  # End-to-end integration tests
```

### Project Architecture

```
skewsentry/
├── __init__.py                    # Package exports
├── spec.py                        # FeatureSpec Pydantic models
├── inputs.py                      # Data loading and sampling
├── adapters/                      # Pipeline adapters
│   ├── __init__.py
│   ├── base.py                    # FeatureAdapter protocol
│   ├── python.py                  # Python function adapter
│   ├── http.py                    # HTTP/REST API adapter
├── align.py                       # Row alignment by keys
├── compare.py                     # Feature comparison logic
├── runner.py                      # Pipeline orchestration
├── report.py                      # Report generation
├── cli.py                         # Command-line interface
├── errors.py                      # Exception classes
└── utils.py                       # Logging utilities
```

### Contributing

1. **Issues**: Report bugs or request features via GitHub Issues
2. **Pull Requests**: Fork, create feature branch, add tests, submit PR
3. **Testing**: All changes must include tests and maintain 85%+ coverage
4. **Documentation**: Update README and docstrings for new features

## Roadmap

### v0.2.0 - Enhanced Analysis
- [ ] Statistical significance testing (KS-test, chi-square)
- [ ] Feature drift detection over time
- [ ] SQL adapter for database sources
- [ ] Streaming data support

### v0.3.0 - Scale & Performance  
- [ ] Spark/Dask backends for large datasets
- [ ] Distributed comparison for high-volume pipelines
- [ ] Advanced sampling strategies
- [ ] Performance benchmarking suite

### v4.0.0 - Production Features
- [ ] Web dashboard for monitoring
- [ ] Alert integrations (Slack, PagerDuty)
- [ ] Model performance correlation analysis
- [ ] Enterprise security features

---

**License**: MIT | **Python**: 3.9+ | **Maintained by**: Yasser El Haddar

*Prevent ML model failures before they reach production. Start validating your feature pipelines today.*

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "skewsentry",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "ml, machine-learning, feature-engineering, data-validation, mlops, feature-store, training, serving, parity, testing",
    "author": null,
    "author_email": "Yasser El Haddar <yasserelhaddar@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/95/e3/6f60c81c34e51affb880d51e53d105270b6c3ca5575582d44e52fcaf0dec/skewsentry-0.1.1.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  \n  <img src=\"assets/SkewSentry_Logo.png\" alt=\"SkewSentry Logo\" width=\"200\">\n  \n  # SkewSentry\n  \n  **Catch training \u2194 serving feature skew before you ship to production**\n  \n  [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n  [![Test Coverage](https://img.shields.io/badge/coverage-86%25-brightgreen.svg)](https://pytest.org/)\n  \n  *Prevent ML model failures with automated feature parity validation*\n  \n</div>\n\n---\n\n## \ud83d\ude80 Why SkewSentry?\n\nSkewSentry transforms fragile ML deployments into reliable production systems through automated feature parity validation.\n\n### \ud83d\udcb0 **Prevent Costly ML Failures**\n- **70% of ML failures** stem from training/serving skew\n- **Months of silent degradation** before detection\n- **Lost revenue and customer trust** from broken predictions\n\n### \u26a1 **Production-Ready Validation**\n- **Pre-deployment detection** - Catch issues in CI before they ship\n- **Configurable tolerances** - Handle expected differences intelligently  \n- **Multi-source support** - Python functions, HTTP APIs, any feature pipeline\n- **Rich reporting** - HTML reports with detailed mismatch analysis\n\n### \ud83d\udd27 **Developer-First Design**\n- **Zero configuration** - Works out of the box with intelligent defaults\n- **CI integration** - Exit codes for automated validation gates\n- **Multiple formats** - Text, JSON, and HTML reports for different use cases\n\n## \ud83d\udce6 Installation\n\n### Production\n```bash\npip install skewsentry\n```\n\n### Development\n```bash\nuv venv .venv\nsource .venv/bin/activate\nuv pip install -e \".[dev]\"\n```\n\n## \u26a1 Quickstart\n\n### Basic Feature Parity Check\n```python\n# Initialize spec from your data\nskewsentry init features.yml --data validation.parquet --keys user_id timestamp\n\n# Run parity check\nskewsentry check \\\n  --spec features.yml \\\n  --offline training.pipeline:extract_features \\\n  --online serving.api:get_features \\\n  --data validation.parquet \\\n  --html report.html\n\n# \u2705 Exit 0: Features match within tolerance\n# \u274c Exit 1: Parity violations detected (fails CI)\n# \ud83d\udea8 Exit 2: Configuration error\n```\n\n### Realistic Example: E-commerce Features\n```yaml\n# features.yml\nversion: 1\nkeys: [\"user_id\", \"timestamp\"]\n\nfeatures:\n  - name: total_spend_7d\n    dtype: float\n    tolerance:\n      abs: 0.01  # $0.01 absolute tolerance\n      rel: 0.001  # 0.1% relative tolerance\n      \n  - name: order_count_30d\n    dtype: int\n    tolerance:\n      abs: 1  # Allow 1 order difference\n```\n\n```python\n# Offline pipeline (training)\ndef extract_features(df):\n    return df.assign(\n        total_spend_7d=df.groupby('user_id')['amount'].rolling('7D').sum(),\n        order_count_30d=df.groupby('user_id').size().rolling('30D').sum()\n    )\n\n# Online pipeline (serving) - subtle differences\ndef get_features(df):\n    return df.assign(\n        total_spend_7d=df.groupby('user_id')['amount'].rolling('7D', closed='right').sum(),  # Different windowing!\n        order_count_30d=df.groupby('user_id').size().rolling('30D').sum()\n    )\n```\n\n**SkewSentry catches the windowing difference:**\n```bash\n\u274c Feature parity violations detected:\n  - total_spend_7d: mismatch_rate=0.1200 rows=5000 mean_abs_diff=0.0845\n```\n\n## \ud83c\udfd7\ufe0f Feature Adapters\n\nSkewSentry works with any feature pipeline through adapters:\n\n### Python Functions\n```python\n# Direct Python function integration\nfrom skewsentry.adapters import PythonFunctionAdapter\n\nadapter = PythonFunctionAdapter(\"mymodule:extract_features\")\nfeatures = adapter.get_features(input_data)\n```\n\n### HTTP APIs\n```python\n# REST API integration with automatic batching\nfrom skewsentry.adapters import HTTPAdapter\n\nadapter = HTTPAdapter(\"http://api.example.com/features\", timeout=30.0)\nfeatures = adapter.get_features(input_data)\n```\n\n## Usage\n\n### Command Line Interface\n\n#### Initialize Feature Spec\n```bash\nskewsentry init features.yml \\\n  --data sample_data.parquet \\\n  --keys user_id timestamp\n```\n\n#### Run Parity Check\n```bash\nskewsentry check \\\n  --spec features.yml \\\n  --offline module.offline:build_features \\\n  --online module.online:get_features \\\n  --data validation.parquet \\\n  --sample 10000 \\\n  --seed 42 \\\n  --html artifacts/report.html \\\n  --json artifacts/results.json\n```\n\n### Python API\n\n```python\nfrom skewsentry import FeatureSpec\nfrom skewsentry.adapters.python import PythonFunctionAdapter\nfrom skewsentry.adapters.http import HTTPAdapter\nfrom skewsentry.runner import run_check\n\n# Define feature comparison rules\nspec = FeatureSpec.from_yaml(\"features.yml\")\n\n# Set up adapters for your pipelines\noffline_adapter = PythonFunctionAdapter(\"training.pipeline:extract_features\")\nonline_adapter = HTTPAdapter(\"https://api.myservice.com/features\")\n\n# Run comparison\nreport = run_check(\n    spec=spec,\n    data=\"validation_data.parquet\",  # or DataFrame\n    offline=offline_adapter,\n    online=online_adapter,\n    sample=5000,\n    seed=42,\n    html_out=\"report.html\",\n    json_out=\"results.json\"\n)\n\n# Check results\nif report.ok:\n    print(\"\u2705 All features match within tolerance\")\nelse:\n    print(\"\u274c Feature parity violations detected:\")\n    print(report.to_text(max_rows=10))\n    \n    # Fail CI/CD pipeline\n    raise SystemExit(1)\n```  \n\n## Feature Specification\n\nSkewSentry uses YAML configuration to define feature comparison rules:\n\n```yaml\nversion: 1\nkeys: [\"user_id\", \"timestamp\"]  # Row alignment keys\nnull_policy: \"same\"              # \"same\" | \"allow_both_null\"\n\nfeatures:\n  # Numeric features with tolerance\n  - name: spend_7d\n    dtype: float\n    nullable: true\n    tolerance:\n      abs: 0.01      # Absolute tolerance (optional)\n      rel: 0.001     # Relative tolerance (optional)\n    window:\n      lookback_days: 7\n      timestamp_col: \"timestamp\"\n      closed: \"right\"\n      \n  # Categorical features with validation\n  - name: country\n    dtype: category\n    categories: [\"US\", \"UK\", \"DE\", \"FR\"]  # Expected values\n    nullable: false\n    \n  # Integer features with range validation\n  - name: age\n    dtype: int\n    nullable: false\n    range: [0, 120]  # [min, max] bounds\n    \n  # String features (exact match)\n  - name: user_segment\n    dtype: string\n    nullable: true\n    \n  # DateTime features (exact match)\n  - name: last_login\n    dtype: datetime\n    nullable: true\n```\n\n### Supported Data Types\n| Type | Comparison | Tolerance | Notes |\n|------|------------|-----------|-------|\n| `int` | Numeric | \u2705 abs/rel | Coerced to float for comparison |\n| `float` | Numeric | \u2705 abs/rel | NaN handling per null_policy |\n| `bool` | Exact | \u274c | True/False only |\n| `string` | Exact | \u274c | Case sensitive |\n| `category` | Exact + Unknown detection | \u274c | Validates against expected categories |\n| `datetime` | Exact | \u274c | Timezone aware |\n\n### Tolerance Configuration\n\n**Absolute Tolerance**: `|offline_value - online_value| \u2264 abs_tolerance`\n\n**Relative Tolerance**: `|offline_value - online_value| \u2264 rel_tolerance \u00d7 max(|offline_value|, |online_value|, \u03b5)`\n\nEither or both can be specified. If both are provided, the comparison passes if *either* tolerance is satisfied.\n\n## Adapters\n\nSkewSentry supports multiple adapter types to connect with different feature pipeline architectures:\n\n### Python Function Adapter\n\nFor in-process Python functions:\n\n```python\nfrom skewsentry.adapters.python import PythonFunctionAdapter\n\n# Your feature function signature\ndef extract_features(df: pd.DataFrame) -> pd.DataFrame:\n    \"\"\"Extract features from input DataFrame.\n    \n    Args:\n        df: Input DataFrame with raw data\n        \n    Returns:\n        DataFrame with feature columns + key columns\n    \"\"\"\n    return df[[\"user_id\", \"timestamp\", \"spend_7d\", \"country\"]]\n\n# Reference by module:function string\nadapter = PythonFunctionAdapter(\"mypackage.features:extract_features\")\n```\n\n### HTTP Adapter\n\nFor REST API endpoints:\n\n```python\nfrom skewsentry.adapters.http import HTTPAdapter\n\nadapter = HTTPAdapter(\n    url=\"https://features.myservice.com/batch\",\n    method=\"POST\",\n    headers={\"Authorization\": \"Bearer token\"},\n    batch_size=1000,  # Records per request\n    timeout=30.0,\n    max_retries=3\n)\n```\n\n**Expected API Contract**:\n- **Request**: JSON array of input records\n- **Response**: JSON array of feature records (same order)\n- **Status**: 200 for success, 4xx/5xx for errors\n\n\n## Reporting\n\nSkewSentry generates multiple report formats for different use cases:\n\n### Text Report\n```python\n# Console-friendly summary\nprint(report.to_text(max_rows=10))\n```\n```\nOK: False\nMissing rows \u2014 offline: 0, online: 3\nPer-feature mismatch rates:\n  - spend_7d: mismatch_rate=0.1200 rows=1000 mean_abs_diff=0.0845\n  - country: mismatch_rate=0.0000 rows=1000 mean_abs_diff=None\n```\n\n### JSON Report\n```python\n# Machine-readable results\nreport.to_json(\"results.json\")\n```\n```json\n{\n  \"ok\": false,\n  \"keys\": [\"user_id\", \"timestamp\"],\n  \"missing_in_online\": 3,\n  \"missing_in_offline\": 0,\n  \"features\": [\n    {\n      \"name\": \"spend_7d\",\n      \"mismatch_rate\": 0.12,\n      \"num_rows\": 1000,\n      \"mean_abs_diff\": 0.0845,\n      \"unknown_categories\": null\n    }\n  ],\n  \"failing_features\": [\"spend_7d\"]\n}\n```\n\n### HTML Report\n```python\n# Rich visual report for stakeholders\nreport.to_html(\"report.html\")\n```\n\nInteractive HTML report includes:\n- Executive summary with pass/fail status\n- Per-feature mismatch statistics\n- Sample mismatched rows with differences highlighted\n- Missing row analysis\n- Feature distribution comparisons\n\n## CI Integration\n\n### GitHub Actions\n\n```yaml\nname: Feature Parity Check\non: [push, pull_request]\n\njobs:\n  parity-check:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n      - uses: actions/setup-python@v5\n        with:\n          python-version: '3.11'\n      \n      - name: Install dependencies\n        run: |\n          pip install -e \".[dev]\"\n          \n      - name: Run feature parity check\n        run: |\n          skewsentry check \\\n            --spec features.yml \\\n            --offline training.pipeline:extract_features \\\n            --online serving.api:get_features \\\n            --data tests/fixtures/validation.parquet \\\n            --html artifacts/parity-report.html \\\n            --json artifacts/parity-results.json\n            \n      - name: Upload report artifacts\n        uses: actions/upload-artifact@v4\n        if: always()\n        with:\n          name: parity-reports\n          path: artifacts/\n```\n\n### Exit Codes\n- **0**: All features match within specified tolerances \u2705\n- **1**: Feature parity violations detected \u274c\n- **2**: Configuration error or runtime failure \ud83d\udea8\n\n### Integration Patterns\n\n**Pre-deployment Gate**:\n```bash\n# Block deployment if parity check fails\nskewsentry check --spec features.yml --offline offline:fn --online online:fn --data validation.parquet\nif [ $? -eq 1 ]; then\n  echo \"\u274c Feature parity violations detected. Blocking deployment.\"\n  exit 1\nfi\n```\n\n**Model Registry Integration**:\n```python\n# Validate features before model registration\nreport = run_check(spec, data, offline_adapter, online_adapter)\nif report.ok:\n    model_registry.register_model(model, features=spec.features)\nelse:\n    raise ValueError(f\"Feature parity check failed: {report.failing_features}\")\n```\n\n## Examples\n\n### Real-World Bug Caught by SkewSentry\n\nThis is the exact type of production bug SkewSentry prevents:\n\n```python\n# Training pipeline (offline) - Spark/Python\ndef extract_features(df):\n    # Rolling 7-day sum with pandas semantics\n    spend_7d = df.groupby(\"user_id\")[\"amount\"] \\\n                 .rolling(7, min_periods=1) \\\n                 .sum() \\\n                 .round(2)\n    return df.assign(spend_7d=spend_7d)\n\n# Serving pipeline (online) - Java/Kafka Streams  \n# Translated to Python equivalent for illustration\ndef get_features(df):\n    # Rolling 7-day sum with different window semantics\n    spend_7d = df.groupby(\"user_id\")[\"amount\"] \\\n                 .rolling(7, closed=\"left\") \\\n                 .sum() \\\n                 .apply(lambda x: math.floor(x * 100) / 100)\n    return df.assign(spend_7d=spend_7d)\n```\n\n**The Differences**:\n1. **Window boundaries**: `min_periods=1` vs `closed=\"left\"`\n2. **Rounding logic**: `round(2)` vs `floor() * 100 / 100`\n\n**The Impact**: 12% of feature values differed by 0.01-0.15, causing model accuracy to drop from 94% to 89% in production.\n\n**The Solution**: SkewSentry with `tolerance: {abs: 0.01}` caught this in CI:\n```bash\n\u274c Feature parity violations detected:\n  - spend_7d: mismatch_rate=0.1200 rows=5000 mean_abs_diff=0.0845\n```\n\n### Complete Example\n\nSee [`examples/python/`](examples/python/) for a runnable demonstration showing how SkewSentry catches windowing and rounding differences between offline and online pipelines.\n\n## Development\n\n### Setup\n```bash\ngit clone https://github.com/your-org/skewsentry.git\ncd skewsentry\nuv venv .venv && source .venv/bin/activate\nuv pip install -e \".[dev]\"\n```\n\n### Testing\n```bash\n# Run all tests\nuv run pytest\n\n# With coverage (enforces 85%+)\nuv run pytest --cov=skewsentry --cov-fail-under=85\n\n# Run specific test categories\nuv run pytest -k test_spec              # Specification tests\nuv run pytest -k test_adapter           # Adapter tests  \nuv run pytest -m \"e2e\"                  # End-to-end integration tests\n```\n\n### Project Architecture\n\n```\nskewsentry/\n\u251c\u2500\u2500 __init__.py                    # Package exports\n\u251c\u2500\u2500 spec.py                        # FeatureSpec Pydantic models\n\u251c\u2500\u2500 inputs.py                      # Data loading and sampling\n\u251c\u2500\u2500 adapters/                      # Pipeline adapters\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 base.py                    # FeatureAdapter protocol\n\u2502   \u251c\u2500\u2500 python.py                  # Python function adapter\n\u2502   \u251c\u2500\u2500 http.py                    # HTTP/REST API adapter\n\u251c\u2500\u2500 align.py                       # Row alignment by keys\n\u251c\u2500\u2500 compare.py                     # Feature comparison logic\n\u251c\u2500\u2500 runner.py                      # Pipeline orchestration\n\u251c\u2500\u2500 report.py                      # Report generation\n\u251c\u2500\u2500 cli.py                         # Command-line interface\n\u251c\u2500\u2500 errors.py                      # Exception classes\n\u2514\u2500\u2500 utils.py                       # Logging utilities\n```\n\n### Contributing\n\n1. **Issues**: Report bugs or request features via GitHub Issues\n2. **Pull Requests**: Fork, create feature branch, add tests, submit PR\n3. **Testing**: All changes must include tests and maintain 85%+ coverage\n4. **Documentation**: Update README and docstrings for new features\n\n## Roadmap\n\n### v0.2.0 - Enhanced Analysis\n- [ ] Statistical significance testing (KS-test, chi-square)\n- [ ] Feature drift detection over time\n- [ ] SQL adapter for database sources\n- [ ] Streaming data support\n\n### v0.3.0 - Scale & Performance  \n- [ ] Spark/Dask backends for large datasets\n- [ ] Distributed comparison for high-volume pipelines\n- [ ] Advanced sampling strategies\n- [ ] Performance benchmarking suite\n\n### v4.0.0 - Production Features\n- [ ] Web dashboard for monitoring\n- [ ] Alert integrations (Slack, PagerDuty)\n- [ ] Model performance correlation analysis\n- [ ] Enterprise security features\n\n---\n\n**License**: MIT | **Python**: 3.9+ | **Maintained by**: Yasser El Haddar\n\n*Prevent ML model failures before they reach production. Start validating your feature pipelines today.*\n\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2024 SkewSentry Authors\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.",
    "summary": "Catch training \u2194 serving feature skew before you ship to production",
    "version": "0.1.1",
    "project_urls": null,
    "split_keywords": [
        "ml",
        " machine-learning",
        " feature-engineering",
        " data-validation",
        " mlops",
        " feature-store",
        " training",
        " serving",
        " parity",
        " testing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "da6e9bda141b3a8dccee4fe49b8e28a54e0a1141d7c5288ace5b59de303066b7",
                "md5": "a965213e2663de85edec472fce4c71a9",
                "sha256": "04d8091fb8a82ab8fff5315b3b509929a586311c15b155b07afff40245fb46f5"
            },
            "downloads": -1,
            "filename": "skewsentry-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a965213e2663de85edec472fce4c71a9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 23995,
            "upload_time": "2025-08-28T22:38:25",
            "upload_time_iso_8601": "2025-08-28T22:38:25.793480Z",
            "url": "https://files.pythonhosted.org/packages/da/6e/9bda141b3a8dccee4fe49b8e28a54e0a1141d7c5288ace5b59de303066b7/skewsentry-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "95e36f60c81c34e51affb880d51e53d105270b6c3ca5575582d44e52fcaf0dec",
                "md5": "2db31f8c37131bfd5f0af4330e2f3d45",
                "sha256": "5ba0d0e80b17772bbccbb9d05923b8f066e70a11e95763e11b94cf73055fb851"
            },
            "downloads": -1,
            "filename": "skewsentry-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "2db31f8c37131bfd5f0af4330e2f3d45",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 25982,
            "upload_time": "2025-08-28T22:38:27",
            "upload_time_iso_8601": "2025-08-28T22:38:27.118597Z",
            "url": "https://files.pythonhosted.org/packages/95/e3/6f60c81c34e51affb880d51e53d105270b6c3ca5575582d44e52fcaf0dec/skewsentry-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-28 22:38:27",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "skewsentry"
}

None