fastwoe

Name	fastwoe JSON
Version	0.1.3 JSON
	download
home_page	None
Summary	Fast Weight of Evidence (WOE) encoding and inference
upload_time	2025-09-19 17:03:56
maintainer	None
docs_url	None
author	None
requires_python	<3.13,>=3.9
license	MIT
keywords	feature-engineering machine-learning statistical-inference weight-of-evidence woe
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # FastWoe: Fast Weight of Evidence (WOE) encoding and inference

[![CI](https://github.com/xRiskLab/fastwoe/workflows/CI/badge.svg)](https://github.com/xRiskLab/fastwoe/actions)
[![Compatibility](https://github.com/xRiskLab/fastwoe/workflows/Python%20Version%20Compatibility/badge.svg)](https://github.com/xRiskLab/fastwoe/actions)
[![PyPI version](https://img.shields.io/pypi/v/fastwoe.svg)](https://pypi.org/project/fastwoe/)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![scikit-learn 1.3.0+](https://img.shields.io/badge/sklearn-1.3.0+-orange.svg)](https://scikit-learn.org/)
[![PyPI downloads](https://img.shields.io/pypi/dm/fastwoe.svg)](https://pypi.org/project/fastwoe/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

FastWoe is a Python library for efficient **Weight of Evidence (WOE)** encoding of categorical features and statistical inference. It's designed for machine learning practitioners seeking robust, interpretable feature engineering and likelihood-ratio-based inference for binary classification problems.

![FastWoe](https://github.com/xRiskLab/fastwoe/raw/main/ims/title.png)

## 🌟 Key Features

- **Fast WOE Encoding**: Leverages scikit-learn's `TargetEncoder` for efficient computation
- **Statistical Confidence Intervals**: Provides standard errors and confidence intervals for WOE values
- **Cardinality Control**: Built-in preprocessing to handle high-cardinality categorical features
- **Intelligent Numerical Binning**: Support for traditional binning, decision tree-based binning, and FAISS KMeans clustering
- **Binning Summaries**: Feature-level binning statistics including Gini score and Information Value (IV)
- **Compatible with scikit-learn**: Follows scikit-learn's preprocessing transformer interface
- **Uncertainty Quantification**: Combines Alan Turing's factor principle with Maximum Likelihood theory (see [paper](docs/woe_st_errors.md))

## 🎲 What is Weight of Evidence?

![Weight of Evidence](https://github.com/xRiskLab/fastwoe/raw/main/ims/weight_of_evidence.png)

Weight of Evidence (WOE) is a statistical technique that:

- Transforms discrete features into logarithmic scores
- Measures the strength of relationship between feature categories and true labels
- Provides interpretable coefficients as weights in logistic regression models
- Handles missing values and rare categories gracefully

**Mathematical Definition:**
```
WOE = ln(P(Event|Category) / P(Non-Event|Category)) - ln(P(Event) / P(Non-Event))
```

Where WOE represents the log-odds difference between a category and the overall population.

## 🚀 Installation

> [!IMPORTANT]
> FastWoe requires Python 3.9+ and scikit-learn 1.3.0+ for TargetEncoder support.

### From PyPI (Recommended)
```bash
pip install fastwoe
```

📦 **View on PyPI**: [https://pypi.org/project/fastwoe/](https://pypi.org/project/fastwoe/)

### Optional Dependencies

#### FAISS KMeans Binning
For FAISS KMeans clustering-based binning (see [Numerical Feature Binning](#-numerical-feature-binning)):
```bash
pip install fastwoe[faiss]
```

For GPU acceleration support:
```bash
pip install faiss-gpu  # Requires CUDA
```

> **⚠️ Important**: If you get `ImportError: FAISS is required for faiss_kmeans binning method`, you need to install the `[faiss]` extras. See [FAISS Troubleshooting Guide](FAISS_TROUBLESHOOTING.md) for detailed solutions.

> [!NOTE]
> **FAISS Compatibility**: The `fastwoe[faiss]` installation automatically installs `faiss-cpu>=1.12.0` which supports Python 3.7-3.12 and is compatible with both NumPy 1.x and 2.x. If you encounter import errors with FAISS, ensure you're using a compatible NumPy version. For NumPy 2.x, use `faiss-cpu>=1.12.0`; for older NumPy versions, any `faiss-cpu>=1.7.0` should work.

### From Source
```bash
git clone https://github.com/xRiskLab/fastwoe.git
cd fastwoe
pip install -e .
```

### Development Installation
```bash
git clone https://github.com/xRiskLab/fastwoe.git
cd fastwoe
pip install -e ".[dev]"
```

> [!TIP]
> For development work, we recommend using `uv` for faster package management:
> ```bash
> uv sync --dev
> ```

## 📖 Quick Start

![FastWoe](https://github.com/xRiskLab/fastwoe/raw/main/ims/fastwoe.png)

```python
import pandas as pd
import numpy as np
from fastwoe import FastWoe, WoePreprocessor

# Create sample data
data = pd.DataFrame({
    'category': ['A', 'B', 'C'] * 100 + ['D'] * 50,
    'high_card_cat': [f'cat_{i}' for i in np.random.randint(0, 50, 350)],
    'target': np.random.binomial(1, 0.3, 350)
})

# Step 1: Preprocess high-cardinality features (optional)
preprocessor = WoePreprocessor(max_categories=10, min_count=5)
X_preprocessed = preprocessor.fit_transform(
    data[['category', 'high_card_cat']],
    cat_features=['high_card_cat']  # Only preprocess this column
)

# Step 2: Apply WOE encoding
woe_encoder = FastWoe()
X_woe = woe_encoder.fit_transform(X_preprocessed, data['target'])

print("WOE-encoded features:")
print(X_woe.head())

# Step 3: Get detailed mappings with statistics
mapping = woe_encoder.get_mapping('category')
print("\nWOE Mapping for 'category':")
print(mapping[['category', 'count', 'event_rate', 'woe', 'woe_se']])
```

## 🔧 Advanced Usage

> [!CAUTION]
> When we make inferences with `predict_proba` and `predict_ci` methods, we are making a (naive) assumption that pieces of evidence are independent.
> The sum of WOE scores can only produce meaningful probabilistic outputs if the data is not strongly correlated among features and does not contain very granular categories with very few observations.

### Probability Predictions

```python
# Get predictions with Naive Bayes classification
preds = woe_encoder.predict_proba(X_preprocessed)[:, 1]
print(preds.mean())
```

### Confidence Intervals

> [!NOTE]
> Statistical confidence intervals help assess the reliability of WOE estimates, especially for categories with small sample sizes.

```python
# Get predictions with confidence intervals
ci_results = woe_encoder.predict_ci(X_preprocessed, alpha=0.05)
print(ci_results[['prediction', 'lower_ci', 'upper_ci']].head())
```

### Feature Statistics
```python
# Get comprehensive feature statistics
feature_stats = woe_encoder.get_feature_stats()
print(feature_stats)
```

### Standardized WOE
```python
# Get Wald scores (standardized log-odds) or use "woe" for raw WOE values
X_standardized = woe_encoder.transform_standardized(X_preprocessed, output='wald')
```

### Numerical Feature Binning

FastWoe supports three methods for binning numerical features:

#### 1. Histogram-Based Binning
```python
# Use KBinsDiscretizer with quantile strategy
woe_encoder = FastWoe(
    binning_method="kbins",
    binner_kwargs={
        "n_bins": 5,
        "strategy": "quantile",  # or "uniform", "kbins"
        "encode": "ordinal"
    }
)
```

#### 2. Decision Tree-Based Binning
```python
# Use single decision tree to find optimal splits
woe_encoder = FastWoe(
    binning_method="tree",
    tree_kwargs={
        "max_depth": 3,
        "min_samples_split": 20,
        "min_samples_leaf": 10
    }
)

# Or use a custom tree estimator
from sklearn.tree import ExtraTreeClassifier
woe_encoder = FastWoe(
    binning_method="tree",
    tree_estimator=ExtraTreeClassifier,
    tree_kwargs={"max_depth": 2, "random_state": 42}
)
```

#### 3. FAISS KMeans Binning
```python
# Use FAISS KMeans clustering for efficient binning
# First install FAISS: pip install fastwoe[faiss]
woe_encoder = FastWoe(
    binning_method="faiss_kmeans",
    faiss_kwargs={
        "k": 5,              # Number of clusters
        "niter": 20,         # Number of iterations
        "verbose": False,    # Show progress
        "gpu": False         # Use GPU acceleration (requires faiss-gpu)
    }
)

# Example with GPU acceleration
woe_encoder = FastWoe(
    binning_method="faiss_kmeans",
    faiss_kwargs={
        "k": 8,
        "niter": 50,
        "verbose": True,
        "gpu": True          # pip install faiss-gpu-cu12 for CUDA 12
    }
)
```

**Benefits of FAISS KMeans Binning:**
- **Efficient Clustering**: Uses Facebook's FAISS library for fast KMeans clustering
- **Data-Driven Bins**: Creates bins based on feature value clusters, not quantiles
- **GPU Acceleration**: Optional GPU support for large datasets
- **Scalable**: Optimized for high-dimensional and large-scale data
- **Meaningful Labels**: Generates interpretable bin labels based on cluster centroids
- **Missing Value Handling**: Properly handles missing values in clustering

**Benefits of Tree-Based Binning:**
- **Target-Aware**: Splits are optimized for the target variable
- **Non-Linear Relationships**: Captures complex patterns better than uniform/quantile binning
- **Automatic Bin Count**: Number of bins determined by tree structure
- **Flexible Configuration**: Use any tree estimator with custom hyperparameters

### Pipeline Integration
```python
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Create a complete pipeline
pipeline = Pipeline([
    ('preprocessor', WoePreprocessor(top_p=0.95, min_count=10)),
    ('woe_encoder', FastWoe()),
    ('classifier', LogisticRegression())
])

# Fit the entire pipeline
pipeline.fit(data[['category', 'high_card_cat']], data['target'])
```

## 📋 API Reference

### FastWoe Class

#### Parameters
- `encoder_kwargs` (dict): Additional parameters for sklearn's TargetEncoder
- `random_state` (int): Random state for reproducibility
- `binning_method` (str): Method for numerical binning - "kbins" (default), "tree", or "faiss_kmeans"
- `binner_kwargs` (dict): Parameters for KBinsDiscretizer (when binning_method="kbins")
- `tree_estimator` (estimator): Custom tree estimator for binning (when binning_method="tree")
- `tree_kwargs` (dict): Parameters for tree estimator
- `faiss_kwargs` (dict): Parameters for FAISS KMeans (when binning_method="faiss_kmeans")

#### Key Methods
- `fit(X, y)`: Fit the WOE encoder
- `transform(X)`: Transform features to WOE values
- `fit_transform(X, y)`: Fit and transform in one step
- `get_mapping(column)`: Get WOE mapping for specific column
- `predict_proba(X)`: Get probability predictions
- `predict_ci(X, alpha)`: Get predictions with confidence intervals

### WoePreprocessor Class

The `WoePreprocessor` is a preprocessing step that reduces the cardinality of categorical features. It is used to handle high-cardinality categorical features.

> [!WARNING]
> High-cardinality features (>50 categories) can lead to overfitting and unreliable WOE estimates. Always use WoePreprocessor for such features if you plan to use in downstream tasks.

#### Parameters
- `max_categories` (int): Maximum categories to keep per feature
- `top_p` (float): Keep categories covering top_p% of frequency
- `min_count` (int): Minimum count required for category
- `other_token` (str): Token for grouping rare categories

> [!TIP]
> The `top_p` parameter uses **cumulative frequency** to select categories. For example, `top_p=0.95` keeps categories that together represent 95% of all observations, automatically grouping the long tail of rare categories into `"__other__"`. This is more adaptive than fixed `max_categories` since it preserves the most important categories regardless of their absolute count.

#### Key Methods
- `fit(X, cat_features)`: Fit preprocessor
- `transform(X)`: Apply preprocessing
- `get_reduction_summary(X)`: Get cardinality reduction statistics

**Example: Using `top_p` parameter**
```python
# Dataset with 100 categories:
# "A" (40%), "B" (30%), "C" (15%), "D" (10%), remaining 96 categories (5% total)

preprocessor = WoePreprocessor(top_p=0.95, min_count=5)
# Result: Keeps ["A", "B", "C", "D"] (95% coverage), groups rest as "__other__"
# Reduces 100 → 5 categories while preserving 95% of the categories
```

### WeightOfEvidence Class

The `WeightOfEvidence` class provides interpretability for FastWoe classifiers with automatic parameter inference and uncertainty quantification through confidence intervals.

#### Parameters
- `classifier` (FastWoe, optional): FastWoe classifier to explain (auto-created if None)
- `X_train` (array-like, optional): Training features (auto-inferred if possible)
- `y_train` (array-like, optional): Training labels (auto-inferred if possible)
- `feature_names` (list, optional): Feature names (auto-inferred if possible)
- `class_names` (list, optional): Class names (auto-inferred if possible)
- `auto_infer` (bool): Enable automatic parameter inference (default=True)

#### Key Methods
- `explain(x, sample_idx=None, class_to_explain=None, true_label=None, return_dict=True)`: Explain single sample or sample from dataset
- `explain_ci(x, sample_idx=None, alpha=0.05, return_dict=True)`: Explain with confidence intervals for uncertainty quantification
- `predict_ci(X, alpha=0.05)`: Batch predictions with confidence bounds
- `summary()`: Get explainer overview and statistics

#### Key Features
- **Auto-Inference**: Automatically detects parameters from FastWoe classifiers
- **Dual Usage**: Support both `explain(sample)` and `explain(dataset, index)` patterns
- **Uncertainty Quantification**: Confidence intervals for WOE scores and probabilities
- **Rich Output**: Human-readable interpretations with evidence strength levels

## 📊 Theoretical Background

![A.M. Turing example](https://github.com/xRiskLab/fastwoe/raw/main/ims/turing_paper.png)

This implementation is based on rigorous statistical theory:

1. **WOE Standard Error**: `SE(WOE) = sqrt(1/good_count + 1/bad_count)`
2. **Confidence Intervals**: Using normal approximation with calculated standard errors
3. **Information Value**: Measures predictive power of each feature
4. **Gini Score**: Derived from AUC to measure discriminatory power

For rare counts, we rely on the rule of three to calculate the standard error.

For technical details, see [Weight of Evidence (WOE), Log Odds, and Standard Errors](docs/woe_standard_errors.md).

![Credit scoring example](https://github.com/xRiskLab/fastwoe/raw/main/ims/credit_example_woe.png)
![I.J. Good](https://github.com/xRiskLab/fastwoe/raw/main/ims/good_bayes_odds.png)

## 🧪 Testing

Run the test suite:
```bash
# Install test dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=fastwoe --cov-report=html
```

## 🛠️ Development

### Development Setup

Clone the repository and install dependencies:

```bash
git clone https://github.com/xRiskLab/fastwoe.git
cd fastwoe
uv sync --dev
```

### Running Tests

Run the main test suite:
```bash
uv run pytest
```

Run tests without slow compatibility tests:
```bash
uv run pytest -m "not slow"
```

Run compatibility tests across Python/scikit-learn versions (requires `uv`):
```bash
uv run pytest -m compatibility
```

Run specific test categories:
```bash
# Only fast compatibility checks
uv run pytest -m "compatibility and not slow"

# Only slow cross-version tests
uv run pytest -m "compatibility and slow"
```

### Building the Package

Build wheel and source distribution:
```bash
uv build
```

Install from local build:
```bash
uv pip install dist/fastwoe-*.whl
```

Test installation in clean environment:
```bash
# Create temporary environment
uv venv .test-env --python 3.9
uv pip install --python .test-env/bin/python dist/fastwoe-*.whl
.test-env/bin/python -c "import fastwoe; print(f'FastWoe {fastwoe.__version__} installed successfully!')"
```

### Code Quality

Format code:
```bash
uv run black fastwoe/ tests/
```

Lint code:
```bash
uv run ruff check fastwoe/ tests/
```

## 📈 Performance Characteristics

- **Memory Efficient**: Uses pandas and numpy for vectorized operations
- **Scalable**: Handles datasets with millions of rows
- **Fast**: Leverages sklearn's optimized TargetEncoder implementation
- **Robust**: Handles edge cases like single categories and missing values

## 📝 Changelog

For a changelog, see [CHANGELOG](CHANGELOG.md).

> [!NOTE]
> This package is in a beta release mode. The API is not considered stable for production use.

## 🔧 Troubleshooting

### FAISS Import Issues

If you encounter FAISS-related import errors, here are common solutions:

**Error: `No module named 'numpy._core'`**
- This occurs when FAISS was compiled against an older NumPy version
- Solution: Upgrade to `faiss-cpu>=1.12.0` which supports Python 3.7-3.12 and both NumPy 1.x and 2.x
- Run: `pip install --upgrade faiss-cpu>=1.12.0`

**Error: `AttributeError: module 'faiss' has no attribute 'KMeans'`**
- This occurs when using an older FAISS version with incorrect import paths
- Solution: The latest `fastwoe[faiss]` installation handles this automatically
- If using FAISS directly, import as: `from faiss.extra_wrappers import Kmeans`

**Error: `A module that was compiled using NumPy 1.x cannot be run in NumPy 2.x`**
- This occurs when FAISS was compiled against NumPy 1.x but you're using NumPy 2.x
- Solution: Use `faiss-cpu>=1.12.0` which supports Python 3.7-3.12 and both NumPy versions
- Or downgrade NumPy: `pip install "numpy<2.0"`

### Verification

To verify FAISS is working correctly:
```python
from fastwoe import FastWoe
import pandas as pd
import numpy as np

# Test FAISS functionality
X = pd.DataFrame({'feature': np.random.randn(100)})
y = np.random.randint(0, 2, 100)

woe = FastWoe(binning_method='faiss_kmeans', faiss_kwargs={'k': 3})
woe.fit(X, y)  # Should work without errors
```

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 📚 References

1. Alan M. Turing (1942). The Applications of Probability to Cryptography.
2. I. J. Good (1950). Probability and the Weighing of Evidence.
3. Daniele Micci-Barreca (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.
4. Naeem Siddiqi (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring.

## 🔗 Other Projects

- [scikit-learn](https://scikit-learn.org/): Python Machine learning library providing TargetEncoder implementation
- [category_encoders](https://contrib.scikit-learn.org/category_encoders/): Additional categorical encoding methods
- [WoeBoost](https://github.com/xRiskLab/woeboost): Weight of Evidence (WOE) Gradient Boosting in Python

## ℹ️ Additional Information

- **Documentation**: [README](README.md) and [Theoretical Background](docs/woe_standard_errors.md)
- **Examples**: See [examples/](examples/) directory

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "fastwoe",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.9",
    "maintainer_email": null,
    "keywords": "feature-engineering, machine-learning, statistical-inference, weight-of-evidence, woe",
    "author": null,
    "author_email": "xRiskLab <contact@xrisklab.ai>",
    "download_url": "https://files.pythonhosted.org/packages/2b/a3/fcce8d231bb0785c3ab8abb52f040c80e0a4a64bd25e2dac18e29b03dd94/fastwoe-0.1.3.tar.gz",
    "platform": null,
    "description": "# FastWoe: Fast Weight of Evidence (WOE) encoding and inference\n\n[![CI](https://github.com/xRiskLab/fastwoe/workflows/CI/badge.svg)](https://github.com/xRiskLab/fastwoe/actions)\n[![Compatibility](https://github.com/xRiskLab/fastwoe/workflows/Python%20Version%20Compatibility/badge.svg)](https://github.com/xRiskLab/fastwoe/actions)\n[![PyPI version](https://img.shields.io/pypi/v/fastwoe.svg)](https://pypi.org/project/fastwoe/)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![scikit-learn 1.3.0+](https://img.shields.io/badge/sklearn-1.3.0+-orange.svg)](https://scikit-learn.org/)\n[![PyPI downloads](https://img.shields.io/pypi/dm/fastwoe.svg)](https://pypi.org/project/fastwoe/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nFastWoe is a Python library for efficient **Weight of Evidence (WOE)** encoding of categorical features and statistical inference. It's designed for machine learning practitioners seeking robust, interpretable feature engineering and likelihood-ratio-based inference for binary classification problems.\n\n![FastWoe](https://github.com/xRiskLab/fastwoe/raw/main/ims/title.png)\n\n## \ud83c\udf1f Key Features\n\n- **Fast WOE Encoding**: Leverages scikit-learn's `TargetEncoder` for efficient computation\n- **Statistical Confidence Intervals**: Provides standard errors and confidence intervals for WOE values\n- **Cardinality Control**: Built-in preprocessing to handle high-cardinality categorical features\n- **Intelligent Numerical Binning**: Support for traditional binning, decision tree-based binning, and FAISS KMeans clustering\n- **Binning Summaries**: Feature-level binning statistics including Gini score and Information Value (IV)\n- **Compatible with scikit-learn**: Follows scikit-learn's preprocessing transformer interface\n- **Uncertainty Quantification**: Combines Alan Turing's factor principle with Maximum Likelihood theory (see [paper](docs/woe_st_errors.md))\n\n## \ud83c\udfb2 What is Weight of Evidence?\n\n![Weight of Evidence](https://github.com/xRiskLab/fastwoe/raw/main/ims/weight_of_evidence.png)\n\nWeight of Evidence (WOE) is a statistical technique that:\n\n- Transforms discrete features into logarithmic scores\n- Measures the strength of relationship between feature categories and true labels\n- Provides interpretable coefficients as weights in logistic regression models\n- Handles missing values and rare categories gracefully\n\n**Mathematical Definition:**\n```\nWOE = ln(P(Event|Category) / P(Non-Event|Category)) - ln(P(Event) / P(Non-Event))\n```\n\nWhere WOE represents the log-odds difference between a category and the overall population.\n\n## \ud83d\ude80 Installation\n\n> [!IMPORTANT]\n> FastWoe requires Python 3.9+ and scikit-learn 1.3.0+ for TargetEncoder support.\n\n### From PyPI (Recommended)\n```bash\npip install fastwoe\n```\n\n\ud83d\udce6 **View on PyPI**: [https://pypi.org/project/fastwoe/](https://pypi.org/project/fastwoe/)\n\n### Optional Dependencies\n\n#### FAISS KMeans Binning\nFor FAISS KMeans clustering-based binning (see [Numerical Feature Binning](#-numerical-feature-binning)):\n```bash\npip install fastwoe[faiss]\n```\n\nFor GPU acceleration support:\n```bash\npip install faiss-gpu  # Requires CUDA\n```\n\n> **\u26a0\ufe0f Important**: If you get `ImportError: FAISS is required for faiss_kmeans binning method`, you need to install the `[faiss]` extras. See [FAISS Troubleshooting Guide](FAISS_TROUBLESHOOTING.md) for detailed solutions.\n\n> [!NOTE]\n> **FAISS Compatibility**: The `fastwoe[faiss]` installation automatically installs `faiss-cpu>=1.12.0` which supports Python 3.7-3.12 and is compatible with both NumPy 1.x and 2.x. If you encounter import errors with FAISS, ensure you're using a compatible NumPy version. For NumPy 2.x, use `faiss-cpu>=1.12.0`; for older NumPy versions, any `faiss-cpu>=1.7.0` should work.\n\n### From Source\n```bash\ngit clone https://github.com/xRiskLab/fastwoe.git\ncd fastwoe\npip install -e .\n```\n\n### Development Installation\n```bash\ngit clone https://github.com/xRiskLab/fastwoe.git\ncd fastwoe\npip install -e \".[dev]\"\n```\n\n> [!TIP]\n> For development work, we recommend using `uv` for faster package management:\n> ```bash\n> uv sync --dev\n> ```\n\n## \ud83d\udcd6 Quick Start\n\n![FastWoe](https://github.com/xRiskLab/fastwoe/raw/main/ims/fastwoe.png)\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom fastwoe import FastWoe, WoePreprocessor\n\n# Create sample data\ndata = pd.DataFrame({\n    'category': ['A', 'B', 'C'] * 100 + ['D'] * 50,\n    'high_card_cat': [f'cat_{i}' for i in np.random.randint(0, 50, 350)],\n    'target': np.random.binomial(1, 0.3, 350)\n})\n\n# Step 1: Preprocess high-cardinality features (optional)\npreprocessor = WoePreprocessor(max_categories=10, min_count=5)\nX_preprocessed = preprocessor.fit_transform(\n    data[['category', 'high_card_cat']],\n    cat_features=['high_card_cat']  # Only preprocess this column\n)\n\n# Step 2: Apply WOE encoding\nwoe_encoder = FastWoe()\nX_woe = woe_encoder.fit_transform(X_preprocessed, data['target'])\n\nprint(\"WOE-encoded features:\")\nprint(X_woe.head())\n\n# Step 3: Get detailed mappings with statistics\nmapping = woe_encoder.get_mapping('category')\nprint(\"\\nWOE Mapping for 'category':\")\nprint(mapping[['category', 'count', 'event_rate', 'woe', 'woe_se']])\n```\n\n## \ud83d\udd27 Advanced Usage\n\n> [!CAUTION]\n> When we make inferences with `predict_proba` and `predict_ci` methods, we are making a (naive) assumption that pieces of evidence are independent.\n> The sum of WOE scores can only produce meaningful probabilistic outputs if the data is not strongly correlated among features and does not contain very granular categories with very few observations.\n\n### Probability Predictions\n\n```python\n# Get predictions with Naive Bayes classification\npreds = woe_encoder.predict_proba(X_preprocessed)[:, 1]\nprint(preds.mean())\n```\n\n### Confidence Intervals\n\n> [!NOTE]\n> Statistical confidence intervals help assess the reliability of WOE estimates, especially for categories with small sample sizes.\n\n```python\n# Get predictions with confidence intervals\nci_results = woe_encoder.predict_ci(X_preprocessed, alpha=0.05)\nprint(ci_results[['prediction', 'lower_ci', 'upper_ci']].head())\n```\n\n### Feature Statistics\n```python\n# Get comprehensive feature statistics\nfeature_stats = woe_encoder.get_feature_stats()\nprint(feature_stats)\n```\n\n### Standardized WOE\n```python\n# Get Wald scores (standardized log-odds) or use \"woe\" for raw WOE values\nX_standardized = woe_encoder.transform_standardized(X_preprocessed, output='wald')\n```\n\n### Numerical Feature Binning\n\nFastWoe supports three methods for binning numerical features:\n\n#### 1. Histogram-Based Binning\n```python\n# Use KBinsDiscretizer with quantile strategy\nwoe_encoder = FastWoe(\n    binning_method=\"kbins\",\n    binner_kwargs={\n        \"n_bins\": 5,\n        \"strategy\": \"quantile\",  # or \"uniform\", \"kbins\"\n        \"encode\": \"ordinal\"\n    }\n)\n```\n\n#### 2. Decision Tree-Based Binning\n```python\n# Use single decision tree to find optimal splits\nwoe_encoder = FastWoe(\n    binning_method=\"tree\",\n    tree_kwargs={\n        \"max_depth\": 3,\n        \"min_samples_split\": 20,\n        \"min_samples_leaf\": 10\n    }\n)\n\n# Or use a custom tree estimator\nfrom sklearn.tree import ExtraTreeClassifier\nwoe_encoder = FastWoe(\n    binning_method=\"tree\",\n    tree_estimator=ExtraTreeClassifier,\n    tree_kwargs={\"max_depth\": 2, \"random_state\": 42}\n)\n```\n\n#### 3. FAISS KMeans Binning\n```python\n# Use FAISS KMeans clustering for efficient binning\n# First install FAISS: pip install fastwoe[faiss]\nwoe_encoder = FastWoe(\n    binning_method=\"faiss_kmeans\",\n    faiss_kwargs={\n        \"k\": 5,              # Number of clusters\n        \"niter\": 20,         # Number of iterations\n        \"verbose\": False,    # Show progress\n        \"gpu\": False         # Use GPU acceleration (requires faiss-gpu)\n    }\n)\n\n# Example with GPU acceleration\nwoe_encoder = FastWoe(\n    binning_method=\"faiss_kmeans\",\n    faiss_kwargs={\n        \"k\": 8,\n        \"niter\": 50,\n        \"verbose\": True,\n        \"gpu\": True          # pip install faiss-gpu-cu12 for CUDA 12\n    }\n)\n```\n\n**Benefits of FAISS KMeans Binning:**\n- **Efficient Clustering**: Uses Facebook's FAISS library for fast KMeans clustering\n- **Data-Driven Bins**: Creates bins based on feature value clusters, not quantiles\n- **GPU Acceleration**: Optional GPU support for large datasets\n- **Scalable**: Optimized for high-dimensional and large-scale data\n- **Meaningful Labels**: Generates interpretable bin labels based on cluster centroids\n- **Missing Value Handling**: Properly handles missing values in clustering\n\n**Benefits of Tree-Based Binning:**\n- **Target-Aware**: Splits are optimized for the target variable\n- **Non-Linear Relationships**: Captures complex patterns better than uniform/quantile binning\n- **Automatic Bin Count**: Number of bins determined by tree structure\n- **Flexible Configuration**: Use any tree estimator with custom hyperparameters\n\n### Pipeline Integration\n```python\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.linear_model import LogisticRegression\n\n# Create a complete pipeline\npipeline = Pipeline([\n    ('preprocessor', WoePreprocessor(top_p=0.95, min_count=10)),\n    ('woe_encoder', FastWoe()),\n    ('classifier', LogisticRegression())\n])\n\n# Fit the entire pipeline\npipeline.fit(data[['category', 'high_card_cat']], data['target'])\n```\n\n## \ud83d\udccb API Reference\n\n### FastWoe Class\n\n#### Parameters\n- `encoder_kwargs` (dict): Additional parameters for sklearn's TargetEncoder\n- `random_state` (int): Random state for reproducibility\n- `binning_method` (str): Method for numerical binning - \"kbins\" (default), \"tree\", or \"faiss_kmeans\"\n- `binner_kwargs` (dict): Parameters for KBinsDiscretizer (when binning_method=\"kbins\")\n- `tree_estimator` (estimator): Custom tree estimator for binning (when binning_method=\"tree\")\n- `tree_kwargs` (dict): Parameters for tree estimator\n- `faiss_kwargs` (dict): Parameters for FAISS KMeans (when binning_method=\"faiss_kmeans\")\n\n#### Key Methods\n- `fit(X, y)`: Fit the WOE encoder\n- `transform(X)`: Transform features to WOE values\n- `fit_transform(X, y)`: Fit and transform in one step\n- `get_mapping(column)`: Get WOE mapping for specific column\n- `predict_proba(X)`: Get probability predictions\n- `predict_ci(X, alpha)`: Get predictions with confidence intervals\n\n### WoePreprocessor Class\n\nThe `WoePreprocessor` is a preprocessing step that reduces the cardinality of categorical features. It is used to handle high-cardinality categorical features.\n\n> [!WARNING]\n> High-cardinality features (>50 categories) can lead to overfitting and unreliable WOE estimates. Always use WoePreprocessor for such features if you plan to use in downstream tasks.\n\n#### Parameters\n- `max_categories` (int): Maximum categories to keep per feature\n- `top_p` (float): Keep categories covering top_p% of frequency\n- `min_count` (int): Minimum count required for category\n- `other_token` (str): Token for grouping rare categories\n\n> [!TIP]\n> The `top_p` parameter uses **cumulative frequency** to select categories. For example, `top_p=0.95` keeps categories that together represent 95% of all observations, automatically grouping the long tail of rare categories into `\"__other__\"`. This is more adaptive than fixed `max_categories` since it preserves the most important categories regardless of their absolute count.\n\n#### Key Methods\n- `fit(X, cat_features)`: Fit preprocessor\n- `transform(X)`: Apply preprocessing\n- `get_reduction_summary(X)`: Get cardinality reduction statistics\n\n**Example: Using `top_p` parameter**\n```python\n# Dataset with 100 categories:\n# \"A\" (40%), \"B\" (30%), \"C\" (15%), \"D\" (10%), remaining 96 categories (5% total)\n\npreprocessor = WoePreprocessor(top_p=0.95, min_count=5)\n# Result: Keeps [\"A\", \"B\", \"C\", \"D\"] (95% coverage), groups rest as \"__other__\"\n# Reduces 100 \u2192 5 categories while preserving 95% of the categories\n```\n\n### WeightOfEvidence Class\n\nThe `WeightOfEvidence` class provides interpretability for FastWoe classifiers with automatic parameter inference and uncertainty quantification through confidence intervals.\n\n#### Parameters\n- `classifier` (FastWoe, optional): FastWoe classifier to explain (auto-created if None)\n- `X_train` (array-like, optional): Training features (auto-inferred if possible)\n- `y_train` (array-like, optional): Training labels (auto-inferred if possible)\n- `feature_names` (list, optional): Feature names (auto-inferred if possible)\n- `class_names` (list, optional): Class names (auto-inferred if possible)\n- `auto_infer` (bool): Enable automatic parameter inference (default=True)\n\n#### Key Methods\n- `explain(x, sample_idx=None, class_to_explain=None, true_label=None, return_dict=True)`: Explain single sample or sample from dataset\n- `explain_ci(x, sample_idx=None, alpha=0.05, return_dict=True)`: Explain with confidence intervals for uncertainty quantification\n- `predict_ci(X, alpha=0.05)`: Batch predictions with confidence bounds\n- `summary()`: Get explainer overview and statistics\n\n#### Key Features\n- **Auto-Inference**: Automatically detects parameters from FastWoe classifiers\n- **Dual Usage**: Support both `explain(sample)` and `explain(dataset, index)` patterns\n- **Uncertainty Quantification**: Confidence intervals for WOE scores and probabilities\n- **Rich Output**: Human-readable interpretations with evidence strength levels\n\n## \ud83d\udcca Theoretical Background\n\n![A.M. Turing example](https://github.com/xRiskLab/fastwoe/raw/main/ims/turing_paper.png)\n\nThis implementation is based on rigorous statistical theory:\n\n1. **WOE Standard Error**: `SE(WOE) = sqrt(1/good_count + 1/bad_count)`\n2. **Confidence Intervals**: Using normal approximation with calculated standard errors\n3. **Information Value**: Measures predictive power of each feature\n4. **Gini Score**: Derived from AUC to measure discriminatory power\n\nFor rare counts, we rely on the rule of three to calculate the standard error.\n\nFor technical details, see [Weight of Evidence (WOE), Log Odds, and Standard Errors](docs/woe_standard_errors.md).\n\n![Credit scoring example](https://github.com/xRiskLab/fastwoe/raw/main/ims/credit_example_woe.png)\n![I.J. Good](https://github.com/xRiskLab/fastwoe/raw/main/ims/good_bayes_odds.png)\n\n## \ud83e\uddea Testing\n\nRun the test suite:\n```bash\n# Install test dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=fastwoe --cov-report=html\n```\n\n## \ud83d\udee0\ufe0f Development\n\n### Development Setup\n\nClone the repository and install dependencies:\n\n```bash\ngit clone https://github.com/xRiskLab/fastwoe.git\ncd fastwoe\nuv sync --dev\n```\n\n### Running Tests\n\nRun the main test suite:\n```bash\nuv run pytest\n```\n\nRun tests without slow compatibility tests:\n```bash\nuv run pytest -m \"not slow\"\n```\n\nRun compatibility tests across Python/scikit-learn versions (requires `uv`):\n```bash\nuv run pytest -m compatibility\n```\n\nRun specific test categories:\n```bash\n# Only fast compatibility checks\nuv run pytest -m \"compatibility and not slow\"\n\n# Only slow cross-version tests\nuv run pytest -m \"compatibility and slow\"\n```\n\n### Building the Package\n\nBuild wheel and source distribution:\n```bash\nuv build\n```\n\nInstall from local build:\n```bash\nuv pip install dist/fastwoe-*.whl\n```\n\nTest installation in clean environment:\n```bash\n# Create temporary environment\nuv venv .test-env --python 3.9\nuv pip install --python .test-env/bin/python dist/fastwoe-*.whl\n.test-env/bin/python -c \"import fastwoe; print(f'FastWoe {fastwoe.__version__} installed successfully!')\"\n```\n\n### Code Quality\n\nFormat code:\n```bash\nuv run black fastwoe/ tests/\n```\n\nLint code:\n```bash\nuv run ruff check fastwoe/ tests/\n```\n\n## \ud83d\udcc8 Performance Characteristics\n\n- **Memory Efficient**: Uses pandas and numpy for vectorized operations\n- **Scalable**: Handles datasets with millions of rows\n- **Fast**: Leverages sklearn's optimized TargetEncoder implementation\n- **Robust**: Handles edge cases like single categories and missing values\n\n## \ud83d\udcdd Changelog\n\nFor a changelog, see [CHANGELOG](CHANGELOG.md).\n\n> [!NOTE]\n> This package is in a beta release mode. The API is not considered stable for production use.\n\n## \ud83d\udd27 Troubleshooting\n\n### FAISS Import Issues\n\nIf you encounter FAISS-related import errors, here are common solutions:\n\n**Error: `No module named 'numpy._core'`**\n- This occurs when FAISS was compiled against an older NumPy version\n- Solution: Upgrade to `faiss-cpu>=1.12.0` which supports Python 3.7-3.12 and both NumPy 1.x and 2.x\n- Run: `pip install --upgrade faiss-cpu>=1.12.0`\n\n**Error: `AttributeError: module 'faiss' has no attribute 'KMeans'`**\n- This occurs when using an older FAISS version with incorrect import paths\n- Solution: The latest `fastwoe[faiss]` installation handles this automatically\n- If using FAISS directly, import as: `from faiss.extra_wrappers import Kmeans`\n\n**Error: `A module that was compiled using NumPy 1.x cannot be run in NumPy 2.x`**\n- This occurs when FAISS was compiled against NumPy 1.x but you're using NumPy 2.x\n- Solution: Use `faiss-cpu>=1.12.0` which supports Python 3.7-3.12 and both NumPy versions\n- Or downgrade NumPy: `pip install \"numpy<2.0\"`\n\n### Verification\n\nTo verify FAISS is working correctly:\n```python\nfrom fastwoe import FastWoe\nimport pandas as pd\nimport numpy as np\n\n# Test FAISS functionality\nX = pd.DataFrame({'feature': np.random.randn(100)})\ny = np.random.randint(0, 2, 100)\n\nwoe = FastWoe(binning_method='faiss_kmeans', faiss_kwargs={'k': 3})\nwoe.fit(X, y)  # Should work without errors\n```\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udcda References\n\n1. Alan M. Turing (1942). The Applications of Probability to Cryptography.\n2. I. J. Good (1950). Probability and the Weighing of Evidence.\n3. Daniele Micci-Barreca (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.\n4. Naeem Siddiqi (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring.\n\n## \ud83d\udd17 Other Projects\n\n- [scikit-learn](https://scikit-learn.org/): Python Machine learning library providing TargetEncoder implementation\n- [category_encoders](https://contrib.scikit-learn.org/category_encoders/): Additional categorical encoding methods\n- [WoeBoost](https://github.com/xRiskLab/woeboost): Weight of Evidence (WOE) Gradient Boosting in Python\n\n## \u2139\ufe0f Additional Information\n\n- **Documentation**: [README](README.md) and [Theoretical Background](docs/woe_standard_errors.md)\n- **Examples**: See [examples/](examples/) directory\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fast Weight of Evidence (WOE) encoding and inference",
    "version": "0.1.3",
    "project_urls": {
        "Documentation": "https://github.com/xRiskLab/fastwoe#readme",
        "Homepage": "https://github.com/xRiskLab/fastwoe",
        "Issues": "https://github.com/xRiskLab/fastwoe/issues",
        "PyPI": "https://pypi.org/project/fastwoe/",
        "Repository": "https://github.com/xRiskLab/fastwoe"
    },
    "split_keywords": [
        "feature-engineering",
        " machine-learning",
        " statistical-inference",
        " weight-of-evidence",
        " woe"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9290689ca4159d923fce4ed8b0bdf1c02294269df17ccc7a6fcc9dc17ce4913e",
                "md5": "9ff1784b3d71bef2cab92d2cf5975b12",
                "sha256": "21ff7738c53e643dda82bc4c55cf957d0ee6163741d24f2b0b62d2638260e6d0"
            },
            "downloads": -1,
            "filename": "fastwoe-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9ff1784b3d71bef2cab92d2cf5975b12",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.9",
            "size": 32054,
            "upload_time": "2025-09-19T17:03:48",
            "upload_time_iso_8601": "2025-09-19T17:03:48.590906Z",
            "url": "https://files.pythonhosted.org/packages/92/90/689ca4159d923fce4ed8b0bdf1c02294269df17ccc7a6fcc9dc17ce4913e/fastwoe-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2ba3fcce8d231bb0785c3ab8abb52f040c80e0a4a64bd25e2dac18e29b03dd94",
                "md5": "93f357678af962753fa9d0e51a10d35c",
                "sha256": "0158d0cd36ab625f9d25cb8229c608f2bb2ecc1cd784e681a2af3e892629a0c0"
            },
            "downloads": -1,
            "filename": "fastwoe-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "93f357678af962753fa9d0e51a10d35c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.9",
            "size": 4274176,
            "upload_time": "2025-09-19T17:03:56",
            "upload_time_iso_8601": "2025-09-19T17:03:56.349457Z",
            "url": "https://files.pythonhosted.org/packages/2b/a3/fcce8d231bb0785c3ab8abb52f040c80e0a4a64bd25e2dac18e29b03dd94/fastwoe-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-19 17:03:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "xRiskLab",
    "github_project": "fastwoe#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "fastwoe"
}

None