stable-cart


Namestable-cart JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
SummaryUnified stable CART decision trees with 7 stability primitives and cross-method learning for enhanced prediction stability.
upload_time2025-10-29 05:50:39
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseMIT License Copyright (c) 2025 Gaurav Sood Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords machine learning decision trees cart stability variance reduction sklearn prediction stability bootstrap honest learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Stable CART: Lower Cross-Bootstrap Prediction Variance

[![Python application](https://github.com/finite-sample/stable-cart/actions/workflows/ci.yml/badge.svg)](https://github.com/finite-sample/stable-cart/actions/workflows/ci.yml)
[![PyPI version](https://img.shields.io/pypi/v/stable-cart.svg)](https://pypi.org/project/stable-cart/)
[![Downloads](https://pepy.tech/badge/stable-cart)](https://pepy.tech/project/stable-cart)
[![Documentation](https://github.com/finite-sample/stable-cart/actions/workflows/docs.yml/badge.svg)](https://finite-sample.github.io/stable-cart/)
[![License](https://img.shields.io/github/license/finite-sample/stable-cart)](https://github.com/finite-sample/stable-cart/blob/main/LICENSE)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)

A scikit-learn compatible implementation of **Stable CART** (Classification and Regression Trees) with advanced stability metrics and techniques to reduce prediction variance.

## Features

- 🌳 **Unified Tree Architecture**: All trees support both regression and classification with a simple `task` parameter
- 🎯 **LessGreedyHybridTree**: Advanced tree with honest data partitioning, lookahead, and optional oblique splits
- 📊 **BootstrapVariancePenalizedTree**: Explicitly penalizes bootstrap prediction variance during split selection
- 🛡️ **RobustPrefixHonestTree**: Robust consensus-based prefix splits with honest leaf estimation
- 📈 **Prediction Stability Metrics**: Measure model consistency across different training runs
- 🔧 **Full sklearn Compatibility**: Works with pipelines, cross-validation, and grid search

## Installation

### From PyPI (Recommended)

```bash
pip install stable-cart
```

### From Source

```bash
git clone https://github.com/finite-sample/stable-cart.git
cd stable-cart
pip install -e .
```

### With Development Dependencies

```bash
pip install -e ".[dev]"
```

## Quick Start

```python
from stable_cart import (
    # Unified trees - all support both regression and classification
    LessGreedyHybridTree, 
    BootstrapVariancePenalizedTree,
    RobustPrefixHonestTree,
    # Evaluation utilities
    prediction_stability, 
    evaluate_models
)
from sklearn.datasets import make_regression, make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

# === UNIFIED ARCHITECTURE ===

# Regression Example
X_reg, y_reg = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

# All trees support both tasks with the 'task' parameter
less_greedy = LessGreedyHybridTree(task='regression', max_depth=5, random_state=42)
bootstrap_tree = BootstrapVariancePenalizedTree(
    task='regression', max_depth=5, variance_penalty=2.0, n_bootstrap=10, random_state=42
)
robust_tree = RobustPrefixHonestTree(task='regression', top_levels=2, max_depth=5, random_state=42)
greedy_model = DecisionTreeRegressor(max_depth=5, random_state=42)

# Fit models
for model in [less_greedy, bootstrap_tree, robust_tree, greedy_model]:
    model.fit(X_train, y_train)

# Classification Example with Same Tree Classes
X_clf, y_clf = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X_clf, y_clf, test_size=0.3, random_state=42
)

# Same tree classes, just change the task parameter
less_greedy_clf = LessGreedyHybridTree(task='classification', max_depth=5, random_state=42)
bootstrap_clf = BootstrapVariancePenalizedTree(
    task='classification', max_depth=5, variance_penalty=1.0, n_bootstrap=5, random_state=42
)
robust_clf = RobustPrefixHonestTree(task='classification', top_levels=2, max_depth=5, random_state=42)
standard_clf = DecisionTreeClassifier(max_depth=5, random_state=42)

# Fit classification models
for model in [less_greedy_clf, bootstrap_clf, robust_clf, standard_clf]:
    model.fit(X_train_clf, y_train_clf)

# Evaluate both regression and classification
reg_models = {
    "less_greedy": less_greedy,
    "bootstrap_penalized": bootstrap_tree,
    "robust_prefix": robust_tree,
    "greedy": greedy_model
}

clf_models = {
    "less_greedy": less_greedy_clf,
    "bootstrap_penalized": bootstrap_clf,
    "robust_prefix": robust_clf,
    "standard": standard_clf
}

# Get predictions and probabilities
reg_predictions = {name: model.predict(X_test) for name, model in reg_models.items()}
clf_predictions = {name: model.predict(X_test_clf) for name, model in clf_models.items()}
clf_probabilities = {name: model.predict_proba(X_test_clf) for name, model in clf_models.items() 
                     if hasattr(model, 'predict_proba')}

print("Regression R² scores:")
for name, model in reg_models.items():
    score = model.score(X_test, y_test)
    print(f"  {name}: {score:.3f}")

print("\nClassification accuracy scores:")
for name, model in clf_models.items():
    score = model.score(X_test_clf, y_test_clf)
    print(f"  {name}: {score:.3f}")

```

## Advanced Configuration Examples

### Unified Parameter Interface

All stable-cart trees share a unified parameter interface with comprehensive stability primitives:

```python
from stable_cart import LessGreedyHybridTree

# Regression with all stability features enabled
advanced_reg_tree = LessGreedyHybridTree(
    # === CORE CONFIGURATION ===
    task='regression',               # 'regression' or 'classification'
    max_depth=6,                    # Maximum tree depth
    min_samples_split=50,           # Minimum samples to split node
    min_samples_leaf=25,            # Minimum samples per leaf
    
    # === HONEST DATA PARTITIONING ===
    split_frac=0.6,                 # Fraction for structure learning
    val_frac=0.2,                   # Fraction for validation
    est_frac=0.2,                   # Fraction for leaf estimation
    enable_stratified_sampling=True, # Balanced honest partitioning
    
    # === OBLIQUE SPLITS (SIGNATURE FEATURE) ===
    enable_oblique_splits=True,     # Enable oblique splits
    oblique_strategy='root_only',   # 'root_only', 'all_levels', 'adaptive'
    oblique_regularization='lasso', # 'lasso', 'ridge', 'elastic_net'
    enable_correlation_gating=True, # Use correlation gating
    min_correlation_threshold=0.3,  # Minimum correlation to trigger oblique
    
    # === LOOKAHEAD SEARCH (SIGNATURE FEATURE) ===
    enable_lookahead=True,          # Enable lookahead search
    lookahead_depth=2,              # Lookahead depth
    beam_width=15,                  # Number of candidates to track
    enable_ambiguity_gating=True,   # Use ambiguity gating
    ambiguity_threshold=0.05,       # Trigger lookahead when splits are close
    min_samples_for_lookahead=800,  # Minimum samples to enable lookahead
    
    # === CROSS-METHOD LEARNING FEATURES ===
    enable_robust_consensus_for_ambiguous=True, # Consensus for ambiguous splits
    consensus_samples=12,                       # Bootstrap samples for consensus
    consensus_threshold=0.7,                    # Agreement threshold
    enable_winsorization=True,                  # Outlier clipping (from RobustPrefix)
    winsor_quantiles=(0.02, 0.98),            # Outlier clipping bounds
    enable_bootstrap_variance_tracking=True,   # Variance tracking (from Bootstrap)
    variance_tracking_samples=10,              # Bootstrap samples for variance
    
    # === LEAF STABILIZATION ===  
    leaf_smoothing=0.1,             # Shrinkage parameter (0=none, higher=more)
    leaf_smoothing_strategy='m_estimate',  # 'm_estimate', 'shrink_to_parent', 'beta_smoothing'
    
    random_state=42
)

# Classification with conservative stability settings
conservative_clf_tree = LessGreedyHybridTree(
    task='classification',
    max_depth=4,                                    # Shallower for more stability
    min_samples_split=60,                           # Higher split threshold
    min_samples_leaf=30,                            # Larger leaves for stability
    leaf_smoothing=0.5,                             # Heavy smoothing
    leaf_smoothing_strategy='m_estimate',           # Bayesian smoothing
    enable_bootstrap_variance_tracking=True,       # Track prediction variance
    enable_robust_consensus_for_ambiguous=True,    # Use consensus for ambiguous splits
    consensus_threshold=0.8,                       # High agreement requirement
    consensus_samples=15,                          # More bootstrap samples
    enable_winsorization=True,                     # Enable outlier protection
    classification_criterion='gini',              # Gini impurity criterion
    random_state=42
)

# Fit and evaluate
advanced_reg_tree.fit(X_train, y_train)
conservative_clf_tree.fit(X_train_clf, y_train_clf)

print(f"Regression R²: {advanced_reg_tree.score(X_test, y_test):.3f}")
print(f"Classification accuracy: {conservative_clf_tree.score(X_test_clf, y_test_clf):.3f}")
```

### Stability Measurement

```python
from stable_cart import prediction_stability

# Measure prediction stability across bootstrap samples
stability_results = prediction_stability(
    [advanced_reg_tree, conservative_clf_tree], 
    [X_test, X_test_clf], 
    n_bootstrap=20
)

print("Prediction variance (lower = more stable):")
for model_name, variance in stability_results.items():
    print(f"  {model_name}: {variance:.4f}")
```

## Algorithms

All trees in stable-cart use a **unified architecture** that supports both regression and classification through a simple `task` parameter. This means you can use the same algorithm for both types of problems!

### LessGreedyHybridTree

**🎯 When to use**: When you need stable predictions but can't afford the complexity of ensembles (works for both regression and classification)

**💡 Core intuition**: Like a careful decision-maker who considers multiple options before choosing, rather than going with the first good option. Standard CART makes greedy choices at each split - this algorithm looks ahead and thinks more carefully.

**⚖️ Trade-offs**: 
- ✅ **Gain**: 30-50% more stable predictions across different training runs
- ✅ **Gain**: Better generalization with honest estimation
- ✅ **Gain**: Works for both regression and classification with same API
- ❌ **Cost**: ~5% accuracy reduction, slightly higher training time

**🔧 How it works**:
- **Honest data partitioning**: Separates data for structure learning vs. prediction estimation
- **Lookahead with beam search**: Considers multiple future splits before deciding (not just immediate gain)
- **Optional oblique root**: Can use linear combinations at the top (Lasso for regression, LogisticRegression for classification)
- **Task-adaptive leaf estimation**: Shrinkage for regression, m-estimate smoothing for classification

### BootstrapVariancePenalizedTree

**🎯 When to use**: When prediction consistency is more important than squeezing out every bit of accuracy (both regression and classification)

**💡 Core intuition**: Like choosing a reliable car over a faster but unpredictable one. This algorithm explicitly optimizes for models that give similar predictions even when trained on slightly different data samples.

**⚖️ Trade-offs**:
- ✅ **Gain**: Most consistent predictions across bootstrap samples
- ✅ **Gain**: Excellent for scenarios where you retrain models frequently
- ✅ **Gain**: Unified interface for regression and classification
- ❌ **Cost**: Moderate training time increase due to bootstrap evaluation
- ❌ **Cost**: May sacrifice some accuracy for consistency

**🔧 How it works**:
- **Variance penalty**: During training, penalizes splits that lead to high prediction variance across bootstrap samples
- **Honest estimation**: Builds tree structure on one data subset, estimates leaf values on another
- **Bootstrap evaluation**: Tests each potential split on multiple bootstrap samples to measure stability
- **Task-adaptive loss**: Uses SSE for regression, Gini/entropy for classification

### RobustPrefixHonestTree

**🎯 When to use**: When you need reliable probability estimates and stable decision boundaries (supports both binary classification and regression)

**💡 Core intuition**: Like making the big strategic decisions first with a committee consensus, then fine-tuning details with fresh information. This tree locks in the most important splits using agreement across multiple bootstrap samples, then uses separate data for final estimates.

**⚖️ Trade-offs**:
- ✅ **Gain**: Very stable decision boundaries across different training runs
- ✅ **Gain**: Reliable probability estimates (classification) or predictions (regression)
- ✅ **Gain**: Robust to outliers and data noise
- ✅ **Gain**: Unified API for both regression and classification
- ❌ **Cost**: Limited to binary classification (multi-class support coming soon)
- ❌ **Cost**: May be conservative in capturing complex patterns

**🔧 How it works**:
- **Robust prefix**: Uses multiple bootstrap samples to find splits that consistently matter, then locks those in
- **Honest leaves**: After structure is fixed, estimates values on completely separate data
- **Task-adaptive smoothing**: Shrinkage for regression, m-estimate for classification
- **Winsorization**: Caps extreme feature values to reduce outlier influence

## Choosing the Right Algorithm

### 🤔 Decision Guide

**Start here**: What's your primary concern?

```
🌟 UNIFIED ARCHITECTURE:
├── Need maximum stability? → BootstrapVariancePenalizedTree(task='regression'|'classification')
├── Want balanced stability + flexibility? → LessGreedyHybridTree(task='regression'|'classification')
├── Need robust prefix + reliable estimates? → RobustPrefixHonestTree(task='regression'|'classification')
└── Just need sklearn baseline? → DecisionTreeRegressor/DecisionTreeClassifier
```

**💡 Pro Tip**: All stable-cart trees use the same unified interface with the `task` parameter - switch between regression and classification effortlessly!

### 📋 Use Case Comparison

| Scenario | Best Choice | Why |
|----------|-------------|-----|
| **Financial risk models** | RobustPrefixHonestTree(task='classification') | Stable probability estimates crucial |
| **A/B testing analysis** | BootstrapVariancePenalizedTree(task='regression') | Consistency across samples matters most |
| **Medical diagnosis support** | RobustPrefixHonestTree(task='classification') | Reliable probabilities + robust to outliers |
| **Demand forecasting** | LessGreedyHybridTree(task='regression') | Balance of accuracy + stability |
| **Customer churn prediction** | LessGreedyHybridTree(task='classification') | Stable classification with probability estimates |
| **Real-time recommendations** | Standard CART | Speed over stability |
| **Research/prototyping** | LessGreedyHybridTree(task='regression'/'classification') | Good general-purpose stable option |

### ⚡ Quick Selection Rules

**Choose BootstrapVariancePenalizedTree when**:
- You retrain models frequently with new data
- Prediction consistency is more important than peak accuracy
- You have sufficient training time
- **Works for both**: `task='regression'` or `task='classification'`

**Choose LessGreedyHybridTree when**:
- You want stability without major accuracy loss
- You need a general-purpose stable tree
- Training time is somewhat constrained
- **Works for both**: `task='regression'` or `task='classification'`

**Choose RobustPrefixHonestTree when**:
- You need trustworthy probability estimates (classification) or predictions (regression)
- Your data may have outliers
- You want very stable decision boundaries
- **Works for both**: `task='regression'` or `task='classification'` (binary only for now)

**Stick with Standard CART when**:
- You need maximum speed
- You have very large datasets (>100k samples)
- Stability is not a concern

## Performance Comparison

Here's how stable-cart models typically perform compared to standard trees:

| Metric | Standard Tree | Stable CART | Improvement |
|--------|---------------|-------------|-------------|
| **Prediction Variance** | High | Low | 30-50% reduction |
| **Out-of-sample Stability** | Variable | Consistent | 20-40% more stable |
| **Accuracy** | High | Slightly lower | 2-5% trade-off |
| **Interpretability** | Good | Good | Maintained |

## Development and Testing

### Running Tests

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Run with coverage
pytest --cov=stable_cart

# Run specific test categories
pytest -m "not slow"        # Skip slow tests
pytest -m "benchmark"       # Benchmark tests only
pytest tests/               # All tests
```

### Local CI Testing

Test the CI pipeline locally using Docker:

```bash
# Run the full CI pipeline in a clean Docker container
make ci-docker

# Or run individual steps
make lint        # Check code formatting and style
make test        # Run the test suite
make coverage    # Run tests with coverage report
```

### Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes and add tests
4. Run the test suite (`make test`)
5. Run linting (`make lint`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request

### Benchmarking

Run comprehensive benchmarks comparing CART vs stable-CART methods:

```bash
# Quick benchmark (4 key datasets, fast execution)
make quick-benchmark

# Comprehensive benchmark (all datasets)
make benchmark

# Stability-focused benchmark (datasets highlighting variance differences)
make stability-benchmark

# Custom benchmark
python scripts/comprehensive_benchmark.py --datasets friedman1,breast_cancer --models CART,LessGreedyHybrid --quick

# View results
ls benchmark_results/
cat benchmark_results/comprehensive_benchmark_report.md
```

## Citation

If you use stable-cart in your research, please cite:

```bibtex
@software{stable_cart_2025,
  title={Stable CART: Enhanced Decision Trees with Prediction Stability},
  author={Sood, Gaurav and Bhosle, Arav},
  year={2025},
  url={https://github.com/finite-sample/stable-cart},
  version={0.3.0}
}
```

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for a detailed history of changes.

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Related Work

- **CART**: Breiman, L., et al. (1984). Classification and regression trees.
- **Honest Trees**: Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests.
- **Bootstrap Aggregating**: Breiman, L. (1996). Bagging predictors.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "stable-cart",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "machine learning, decision trees, CART, stability, variance reduction, sklearn, prediction stability, bootstrap, honest learning",
    "author": null,
    "author_email": "Gaurav Sood <contact@gsood.com>, Arav Bhosle <aravbhosle@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/76/14/aebf8e485184999a0726ec12008a83402fe23bd8d1e6691f6dfa397236fc/stable_cart-0.3.0.tar.gz",
    "platform": null,
    "description": "## Stable CART: Lower Cross-Bootstrap Prediction Variance\n\n[![Python application](https://github.com/finite-sample/stable-cart/actions/workflows/ci.yml/badge.svg)](https://github.com/finite-sample/stable-cart/actions/workflows/ci.yml)\n[![PyPI version](https://img.shields.io/pypi/v/stable-cart.svg)](https://pypi.org/project/stable-cart/)\n[![Downloads](https://pepy.tech/badge/stable-cart)](https://pepy.tech/project/stable-cart)\n[![Documentation](https://github.com/finite-sample/stable-cart/actions/workflows/docs.yml/badge.svg)](https://finite-sample.github.io/stable-cart/)\n[![License](https://img.shields.io/github/license/finite-sample/stable-cart)](https://github.com/finite-sample/stable-cart/blob/main/LICENSE)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n\nA scikit-learn compatible implementation of **Stable CART** (Classification and Regression Trees) with advanced stability metrics and techniques to reduce prediction variance.\n\n## Features\n\n- \ud83c\udf33 **Unified Tree Architecture**: All trees support both regression and classification with a simple `task` parameter\n- \ud83c\udfaf **LessGreedyHybridTree**: Advanced tree with honest data partitioning, lookahead, and optional oblique splits\n- \ud83d\udcca **BootstrapVariancePenalizedTree**: Explicitly penalizes bootstrap prediction variance during split selection\n- \ud83d\udee1\ufe0f **RobustPrefixHonestTree**: Robust consensus-based prefix splits with honest leaf estimation\n- \ud83d\udcc8 **Prediction Stability Metrics**: Measure model consistency across different training runs\n- \ud83d\udd27 **Full sklearn Compatibility**: Works with pipelines, cross-validation, and grid search\n\n## Installation\n\n### From PyPI (Recommended)\n\n```bash\npip install stable-cart\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/finite-sample/stable-cart.git\ncd stable-cart\npip install -e .\n```\n\n### With Development Dependencies\n\n```bash\npip install -e \".[dev]\"\n```\n\n## Quick Start\n\n```python\nfrom stable_cart import (\n    # Unified trees - all support both regression and classification\n    LessGreedyHybridTree, \n    BootstrapVariancePenalizedTree,\n    RobustPrefixHonestTree,\n    # Evaluation utilities\n    prediction_stability, \n    evaluate_models\n)\nfrom sklearn.datasets import make_regression, make_classification\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier\n\n# === UNIFIED ARCHITECTURE ===\n\n# Regression Example\nX_reg, y_reg = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)\nX_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)\n\n# All trees support both tasks with the 'task' parameter\nless_greedy = LessGreedyHybridTree(task='regression', max_depth=5, random_state=42)\nbootstrap_tree = BootstrapVariancePenalizedTree(\n    task='regression', max_depth=5, variance_penalty=2.0, n_bootstrap=10, random_state=42\n)\nrobust_tree = RobustPrefixHonestTree(task='regression', top_levels=2, max_depth=5, random_state=42)\ngreedy_model = DecisionTreeRegressor(max_depth=5, random_state=42)\n\n# Fit models\nfor model in [less_greedy, bootstrap_tree, robust_tree, greedy_model]:\n    model.fit(X_train, y_train)\n\n# Classification Example with Same Tree Classes\nX_clf, y_clf = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)\nX_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(\n    X_clf, y_clf, test_size=0.3, random_state=42\n)\n\n# Same tree classes, just change the task parameter\nless_greedy_clf = LessGreedyHybridTree(task='classification', max_depth=5, random_state=42)\nbootstrap_clf = BootstrapVariancePenalizedTree(\n    task='classification', max_depth=5, variance_penalty=1.0, n_bootstrap=5, random_state=42\n)\nrobust_clf = RobustPrefixHonestTree(task='classification', top_levels=2, max_depth=5, random_state=42)\nstandard_clf = DecisionTreeClassifier(max_depth=5, random_state=42)\n\n# Fit classification models\nfor model in [less_greedy_clf, bootstrap_clf, robust_clf, standard_clf]:\n    model.fit(X_train_clf, y_train_clf)\n\n# Evaluate both regression and classification\nreg_models = {\n    \"less_greedy\": less_greedy,\n    \"bootstrap_penalized\": bootstrap_tree,\n    \"robust_prefix\": robust_tree,\n    \"greedy\": greedy_model\n}\n\nclf_models = {\n    \"less_greedy\": less_greedy_clf,\n    \"bootstrap_penalized\": bootstrap_clf,\n    \"robust_prefix\": robust_clf,\n    \"standard\": standard_clf\n}\n\n# Get predictions and probabilities\nreg_predictions = {name: model.predict(X_test) for name, model in reg_models.items()}\nclf_predictions = {name: model.predict(X_test_clf) for name, model in clf_models.items()}\nclf_probabilities = {name: model.predict_proba(X_test_clf) for name, model in clf_models.items() \n                     if hasattr(model, 'predict_proba')}\n\nprint(\"Regression R\u00b2 scores:\")\nfor name, model in reg_models.items():\n    score = model.score(X_test, y_test)\n    print(f\"  {name}: {score:.3f}\")\n\nprint(\"\\nClassification accuracy scores:\")\nfor name, model in clf_models.items():\n    score = model.score(X_test_clf, y_test_clf)\n    print(f\"  {name}: {score:.3f}\")\n\n```\n\n## Advanced Configuration Examples\n\n### Unified Parameter Interface\n\nAll stable-cart trees share a unified parameter interface with comprehensive stability primitives:\n\n```python\nfrom stable_cart import LessGreedyHybridTree\n\n# Regression with all stability features enabled\nadvanced_reg_tree = LessGreedyHybridTree(\n    # === CORE CONFIGURATION ===\n    task='regression',               # 'regression' or 'classification'\n    max_depth=6,                    # Maximum tree depth\n    min_samples_split=50,           # Minimum samples to split node\n    min_samples_leaf=25,            # Minimum samples per leaf\n    \n    # === HONEST DATA PARTITIONING ===\n    split_frac=0.6,                 # Fraction for structure learning\n    val_frac=0.2,                   # Fraction for validation\n    est_frac=0.2,                   # Fraction for leaf estimation\n    enable_stratified_sampling=True, # Balanced honest partitioning\n    \n    # === OBLIQUE SPLITS (SIGNATURE FEATURE) ===\n    enable_oblique_splits=True,     # Enable oblique splits\n    oblique_strategy='root_only',   # 'root_only', 'all_levels', 'adaptive'\n    oblique_regularization='lasso', # 'lasso', 'ridge', 'elastic_net'\n    enable_correlation_gating=True, # Use correlation gating\n    min_correlation_threshold=0.3,  # Minimum correlation to trigger oblique\n    \n    # === LOOKAHEAD SEARCH (SIGNATURE FEATURE) ===\n    enable_lookahead=True,          # Enable lookahead search\n    lookahead_depth=2,              # Lookahead depth\n    beam_width=15,                  # Number of candidates to track\n    enable_ambiguity_gating=True,   # Use ambiguity gating\n    ambiguity_threshold=0.05,       # Trigger lookahead when splits are close\n    min_samples_for_lookahead=800,  # Minimum samples to enable lookahead\n    \n    # === CROSS-METHOD LEARNING FEATURES ===\n    enable_robust_consensus_for_ambiguous=True, # Consensus for ambiguous splits\n    consensus_samples=12,                       # Bootstrap samples for consensus\n    consensus_threshold=0.7,                    # Agreement threshold\n    enable_winsorization=True,                  # Outlier clipping (from RobustPrefix)\n    winsor_quantiles=(0.02, 0.98),            # Outlier clipping bounds\n    enable_bootstrap_variance_tracking=True,   # Variance tracking (from Bootstrap)\n    variance_tracking_samples=10,              # Bootstrap samples for variance\n    \n    # === LEAF STABILIZATION ===  \n    leaf_smoothing=0.1,             # Shrinkage parameter (0=none, higher=more)\n    leaf_smoothing_strategy='m_estimate',  # 'm_estimate', 'shrink_to_parent', 'beta_smoothing'\n    \n    random_state=42\n)\n\n# Classification with conservative stability settings\nconservative_clf_tree = LessGreedyHybridTree(\n    task='classification',\n    max_depth=4,                                    # Shallower for more stability\n    min_samples_split=60,                           # Higher split threshold\n    min_samples_leaf=30,                            # Larger leaves for stability\n    leaf_smoothing=0.5,                             # Heavy smoothing\n    leaf_smoothing_strategy='m_estimate',           # Bayesian smoothing\n    enable_bootstrap_variance_tracking=True,       # Track prediction variance\n    enable_robust_consensus_for_ambiguous=True,    # Use consensus for ambiguous splits\n    consensus_threshold=0.8,                       # High agreement requirement\n    consensus_samples=15,                          # More bootstrap samples\n    enable_winsorization=True,                     # Enable outlier protection\n    classification_criterion='gini',              # Gini impurity criterion\n    random_state=42\n)\n\n# Fit and evaluate\nadvanced_reg_tree.fit(X_train, y_train)\nconservative_clf_tree.fit(X_train_clf, y_train_clf)\n\nprint(f\"Regression R\u00b2: {advanced_reg_tree.score(X_test, y_test):.3f}\")\nprint(f\"Classification accuracy: {conservative_clf_tree.score(X_test_clf, y_test_clf):.3f}\")\n```\n\n### Stability Measurement\n\n```python\nfrom stable_cart import prediction_stability\n\n# Measure prediction stability across bootstrap samples\nstability_results = prediction_stability(\n    [advanced_reg_tree, conservative_clf_tree], \n    [X_test, X_test_clf], \n    n_bootstrap=20\n)\n\nprint(\"Prediction variance (lower = more stable):\")\nfor model_name, variance in stability_results.items():\n    print(f\"  {model_name}: {variance:.4f}\")\n```\n\n## Algorithms\n\nAll trees in stable-cart use a **unified architecture** that supports both regression and classification through a simple `task` parameter. This means you can use the same algorithm for both types of problems!\n\n### LessGreedyHybridTree\n\n**\ud83c\udfaf When to use**: When you need stable predictions but can't afford the complexity of ensembles (works for both regression and classification)\n\n**\ud83d\udca1 Core intuition**: Like a careful decision-maker who considers multiple options before choosing, rather than going with the first good option. Standard CART makes greedy choices at each split - this algorithm looks ahead and thinks more carefully.\n\n**\u2696\ufe0f Trade-offs**: \n- \u2705 **Gain**: 30-50% more stable predictions across different training runs\n- \u2705 **Gain**: Better generalization with honest estimation\n- \u2705 **Gain**: Works for both regression and classification with same API\n- \u274c **Cost**: ~5% accuracy reduction, slightly higher training time\n\n**\ud83d\udd27 How it works**:\n- **Honest data partitioning**: Separates data for structure learning vs. prediction estimation\n- **Lookahead with beam search**: Considers multiple future splits before deciding (not just immediate gain)\n- **Optional oblique root**: Can use linear combinations at the top (Lasso for regression, LogisticRegression for classification)\n- **Task-adaptive leaf estimation**: Shrinkage for regression, m-estimate smoothing for classification\n\n### BootstrapVariancePenalizedTree\n\n**\ud83c\udfaf When to use**: When prediction consistency is more important than squeezing out every bit of accuracy (both regression and classification)\n\n**\ud83d\udca1 Core intuition**: Like choosing a reliable car over a faster but unpredictable one. This algorithm explicitly optimizes for models that give similar predictions even when trained on slightly different data samples.\n\n**\u2696\ufe0f Trade-offs**:\n- \u2705 **Gain**: Most consistent predictions across bootstrap samples\n- \u2705 **Gain**: Excellent for scenarios where you retrain models frequently\n- \u2705 **Gain**: Unified interface for regression and classification\n- \u274c **Cost**: Moderate training time increase due to bootstrap evaluation\n- \u274c **Cost**: May sacrifice some accuracy for consistency\n\n**\ud83d\udd27 How it works**:\n- **Variance penalty**: During training, penalizes splits that lead to high prediction variance across bootstrap samples\n- **Honest estimation**: Builds tree structure on one data subset, estimates leaf values on another\n- **Bootstrap evaluation**: Tests each potential split on multiple bootstrap samples to measure stability\n- **Task-adaptive loss**: Uses SSE for regression, Gini/entropy for classification\n\n### RobustPrefixHonestTree\n\n**\ud83c\udfaf When to use**: When you need reliable probability estimates and stable decision boundaries (supports both binary classification and regression)\n\n**\ud83d\udca1 Core intuition**: Like making the big strategic decisions first with a committee consensus, then fine-tuning details with fresh information. This tree locks in the most important splits using agreement across multiple bootstrap samples, then uses separate data for final estimates.\n\n**\u2696\ufe0f Trade-offs**:\n- \u2705 **Gain**: Very stable decision boundaries across different training runs\n- \u2705 **Gain**: Reliable probability estimates (classification) or predictions (regression)\n- \u2705 **Gain**: Robust to outliers and data noise\n- \u2705 **Gain**: Unified API for both regression and classification\n- \u274c **Cost**: Limited to binary classification (multi-class support coming soon)\n- \u274c **Cost**: May be conservative in capturing complex patterns\n\n**\ud83d\udd27 How it works**:\n- **Robust prefix**: Uses multiple bootstrap samples to find splits that consistently matter, then locks those in\n- **Honest leaves**: After structure is fixed, estimates values on completely separate data\n- **Task-adaptive smoothing**: Shrinkage for regression, m-estimate for classification\n- **Winsorization**: Caps extreme feature values to reduce outlier influence\n\n## Choosing the Right Algorithm\n\n### \ud83e\udd14 Decision Guide\n\n**Start here**: What's your primary concern?\n\n```\n\ud83c\udf1f UNIFIED ARCHITECTURE:\n\u251c\u2500\u2500 Need maximum stability? \u2192 BootstrapVariancePenalizedTree(task='regression'|'classification')\n\u251c\u2500\u2500 Want balanced stability + flexibility? \u2192 LessGreedyHybridTree(task='regression'|'classification')\n\u251c\u2500\u2500 Need robust prefix + reliable estimates? \u2192 RobustPrefixHonestTree(task='regression'|'classification')\n\u2514\u2500\u2500 Just need sklearn baseline? \u2192 DecisionTreeRegressor/DecisionTreeClassifier\n```\n\n**\ud83d\udca1 Pro Tip**: All stable-cart trees use the same unified interface with the `task` parameter - switch between regression and classification effortlessly!\n\n### \ud83d\udccb Use Case Comparison\n\n| Scenario | Best Choice | Why |\n|----------|-------------|-----|\n| **Financial risk models** | RobustPrefixHonestTree(task='classification') | Stable probability estimates crucial |\n| **A/B testing analysis** | BootstrapVariancePenalizedTree(task='regression') | Consistency across samples matters most |\n| **Medical diagnosis support** | RobustPrefixHonestTree(task='classification') | Reliable probabilities + robust to outliers |\n| **Demand forecasting** | LessGreedyHybridTree(task='regression') | Balance of accuracy + stability |\n| **Customer churn prediction** | LessGreedyHybridTree(task='classification') | Stable classification with probability estimates |\n| **Real-time recommendations** | Standard CART | Speed over stability |\n| **Research/prototyping** | LessGreedyHybridTree(task='regression'/'classification') | Good general-purpose stable option |\n\n### \u26a1 Quick Selection Rules\n\n**Choose BootstrapVariancePenalizedTree when**:\n- You retrain models frequently with new data\n- Prediction consistency is more important than peak accuracy\n- You have sufficient training time\n- **Works for both**: `task='regression'` or `task='classification'`\n\n**Choose LessGreedyHybridTree when**:\n- You want stability without major accuracy loss\n- You need a general-purpose stable tree\n- Training time is somewhat constrained\n- **Works for both**: `task='regression'` or `task='classification'`\n\n**Choose RobustPrefixHonestTree when**:\n- You need trustworthy probability estimates (classification) or predictions (regression)\n- Your data may have outliers\n- You want very stable decision boundaries\n- **Works for both**: `task='regression'` or `task='classification'` (binary only for now)\n\n**Stick with Standard CART when**:\n- You need maximum speed\n- You have very large datasets (>100k samples)\n- Stability is not a concern\n\n## Performance Comparison\n\nHere's how stable-cart models typically perform compared to standard trees:\n\n| Metric | Standard Tree | Stable CART | Improvement |\n|--------|---------------|-------------|-------------|\n| **Prediction Variance** | High | Low | 30-50% reduction |\n| **Out-of-sample Stability** | Variable | Consistent | 20-40% more stable |\n| **Accuracy** | High | Slightly lower | 2-5% trade-off |\n| **Interpretability** | Good | Good | Maintained |\n\n## Development and Testing\n\n### Running Tests\n\n```bash\n# Install dev dependencies\npip install -e \".[dev]\"\n\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=stable_cart\n\n# Run specific test categories\npytest -m \"not slow\"        # Skip slow tests\npytest -m \"benchmark\"       # Benchmark tests only\npytest tests/               # All tests\n```\n\n### Local CI Testing\n\nTest the CI pipeline locally using Docker:\n\n```bash\n# Run the full CI pipeline in a clean Docker container\nmake ci-docker\n\n# Or run individual steps\nmake lint        # Check code formatting and style\nmake test        # Run the test suite\nmake coverage    # Run tests with coverage report\n```\n\n### Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes and add tests\n4. Run the test suite (`make test`)\n5. Run linting (`make lint`)\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Open a Pull Request\n\n### Benchmarking\n\nRun comprehensive benchmarks comparing CART vs stable-CART methods:\n\n```bash\n# Quick benchmark (4 key datasets, fast execution)\nmake quick-benchmark\n\n# Comprehensive benchmark (all datasets)\nmake benchmark\n\n# Stability-focused benchmark (datasets highlighting variance differences)\nmake stability-benchmark\n\n# Custom benchmark\npython scripts/comprehensive_benchmark.py --datasets friedman1,breast_cancer --models CART,LessGreedyHybrid --quick\n\n# View results\nls benchmark_results/\ncat benchmark_results/comprehensive_benchmark_report.md\n```\n\n## Citation\n\nIf you use stable-cart in your research, please cite:\n\n```bibtex\n@software{stable_cart_2025,\n  title={Stable CART: Enhanced Decision Trees with Prediction Stability},\n  author={Sood, Gaurav and Bhosle, Arav},\n  year={2025},\n  url={https://github.com/finite-sample/stable-cart},\n  version={0.3.0}\n}\n```\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for a detailed history of changes.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Related Work\n\n- **CART**: Breiman, L., et al. (1984). Classification and regression trees.\n- **Honest Trees**: Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests.\n- **Bootstrap Aggregating**: Breiman, L. (1996). Bagging predictors.\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 Gaurav Sood\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.",
    "summary": "Unified stable CART decision trees with 7 stability primitives and cross-method learning for enhanced prediction stability.",
    "version": "0.3.0",
    "project_urls": {
        "Changelog": "https://github.com/finite-sample/stable-cart/blob/main/CHANGELOG.md",
        "Documentation": "https://finite-sample.github.io/stable-cart/",
        "Homepage": "https://github.com/finite-sample/stable-cart",
        "Issues": "https://github.com/finite-sample/stable-cart/issues",
        "Repository": "https://github.com/finite-sample/stable-cart"
    },
    "split_keywords": [
        "machine learning",
        " decision trees",
        " cart",
        " stability",
        " variance reduction",
        " sklearn",
        " prediction stability",
        " bootstrap",
        " honest learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3170f7f8d2cc51391766e43d2e40fe95f15a10614b69659b057dddf3e984fc4d",
                "md5": "1fb29cc95cdcd13a8bf6e632705d5f19",
                "sha256": "b3c503ab436e25627ed02e15bcbede39dd5ab163df6fdc16e148472a53db198e"
            },
            "downloads": -1,
            "filename": "stable_cart-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1fb29cc95cdcd13a8bf6e632705d5f19",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 56783,
            "upload_time": "2025-10-29T05:50:36",
            "upload_time_iso_8601": "2025-10-29T05:50:36.637704Z",
            "url": "https://files.pythonhosted.org/packages/31/70/f7f8d2cc51391766e43d2e40fe95f15a10614b69659b057dddf3e984fc4d/stable_cart-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7614aebf8e485184999a0726ec12008a83402fe23bd8d1e6691f6dfa397236fc",
                "md5": "1cead4c33d8773d6025f799160febac1",
                "sha256": "2052afd6527d2abe469e211d29e0557ba7da3e1bbfccfea19a79fb874e6a266b"
            },
            "downloads": -1,
            "filename": "stable_cart-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1cead4c33d8773d6025f799160febac1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 64930,
            "upload_time": "2025-10-29T05:50:39",
            "upload_time_iso_8601": "2025-10-29T05:50:39.006302Z",
            "url": "https://files.pythonhosted.org/packages/76/14/aebf8e485184999a0726ec12008a83402fe23bd8d1e6691f6dfa397236fc/stable_cart-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-29 05:50:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "finite-sample",
    "github_project": "stable-cart",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "stable-cart"
}
        
Elapsed time: 1.70494s