step-criterion


Namestep-criterion JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryEducational stepwise model selection for statsmodels: AIC, BIC, Adjusted Rยฒ, and p-values with OLS/GLM support for exploratory analysis.
upload_time2025-08-21 03:30:23
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords statistics regression model-selection stepwise aic bic statsmodels machine-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # step-criterion

[![Python](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**Educational and diagnostic stepwise model selection for statsmodels with multiple criteria: AIC, BIC, Adjusted Rยฒ, and p-values.**

This package provides a unified, flexible interface for **exploratory stepwise regression** with various selection criteria, supporting both OLS and GLM models with advanced features like interaction terms, transformations, and different statistical tests. **Designed for educational purposes and model exploration, not production model selection.**

## โœจ Key Features

- **๐ŸŽฏ Main Function**: `step_criterion()` - unified interface for all selection methods
- **๐Ÿ“Š Multiple Criteria**: AIC, BIC, Adjusted Rยฒ, and p-value based selection
- **๐Ÿ”ง Convenience Wrappers**: Specialized functions for each criterion
- **๐Ÿ“ˆ Model Support**: OLS and GLM (including logistic, Poisson, etc.)
- **๐Ÿงฎ Advanced Formulas**: Interaction terms, transformations, categorical variables
- **โšก GLM Flexibility**: Multiple test types (likelihood ratio, Wald)
- **๐Ÿ”‡ Clean Output**: Automatic suppression of technical warnings
- **๐Ÿ“‹ R-like Results**: Familiar ANOVA-style step tables

## โš ๏ธ Important Statistical Considerations

**This package is designed for educational and exploratory purposes.** Stepwise selection has well-documented statistical limitations that users should understand:

### ๐Ÿšจ Key Limitations

- **P-value Inflation**: Multiple testing inflates Type I error rates. P-values from stepwise procedures are biased and **should not be used for inference**
- **Overfitting**: Selected models are optimistic and may not generalize well to new data
- **Selection Bias**: Standard confidence intervals and hypothesis tests are invalid after model selection
- **Multiple Comparisons**: The more variables considered, the higher the chance of spurious associations

### ๐ŸŽฏ Recommended Uses

- **โœ… Educational**: Learning about model selection and variable importance
- **โœ… Exploratory Data Analysis**: Initial investigation of relationships  
- **โœ… Diagnostic**: Understanding which variables might be relevant
- **โœ… Hypothesis Generation**: Developing ideas for future confirmatory studies

### โŒ Not Recommended For

- **โŒ Confirmatory Analysis**: Final statistical inference or hypothesis testing
- **โŒ Production Models**: Automated model selection in production systems
- **โŒ P-value Reporting**: Publishing p-values from stepwise-selected models
- **โŒ Causal Inference**: Establishing causal relationships

### ๐Ÿ“š Better Alternatives for Production

For reliable inference and model selection, consider:
- **Cross-validation** with penalized regression (LASSO, Ridge, Elastic Net)
- **Information criteria** with proper model averaging
- **Bootstrap procedures** for selection uncertainty
- **Post-selection inference** methods when stepwise is unavoidable
- **Domain knowledge** guided model specification

## ๐Ÿš€ Installation

```bash
pip install step-criterion
```

## ๐Ÿ“– Quick Start

### Basic Usage

```python
import pandas as pd
import statsmodels.api as sm
from step_criterion import step_criterion

# Load your data
df = pd.read_csv("your_data.csv")

# Perform stepwise selection with BIC
result = step_criterion(
    data=df,
    initial="y ~ 1",  # Start with intercept only
    scope={"upper": "y ~ x1 + x2 + x3 + x1:x2 + I(x1**2)"},
    direction="both",  # Forward and backward steps
    criterion="bic",   # Selection criterion
    trace=1            # Show step-by-step progress
)

# View results
print(result.model.summary())
print("\nStep-by-step path:")
print(result.anova)
```

## ๐Ÿ“š Comprehensive Documentation

### Main Function: `step_criterion()`

**This is the recommended entry point** - a unified interface supporting all selection criteria and model types.

```python
step_criterion(
    data,                    # pandas DataFrame
    initial,                 # Initial formula string
    scope=None,             # Upper/lower bounds for model terms
    direction="both",       # "both", "forward", or "backward"
    criterion="aic",        # "aic", "bic", "adjr2", or "p-value"
    trace=1,                # Verbosity level (0=silent, 1=progress)
    family=None,            # statsmodels family (None=OLS, or sm.families.*)
    glm_test="lr",          # For GLM p-value: "lr", "wald", "score", "gradient"
    alpha_enter=0.05,       # p-value threshold for entering (p-value criterion)
    alpha_exit=0.10,        # p-value threshold for removal (p-value criterion)
    steps=1000,             # Maximum number of steps
    keep=None,              # Optional function to track custom metrics
    fit_kwargs=None         # Additional arguments passed to model.fit()
)
```

### Selection Criteria

#### 1. **AIC (Akaike Information Criterion)**
```python
# Using main function
result = step_criterion(data=df, initial="y ~ 1", 
                       scope={"upper": "y ~ x1 + x2 + x3"}, 
                       criterion="aic")

# Using convenience wrapper (allows custom k penalty)
from step_criterion import step_aic
result = step_aic(data=df, initial="y ~ 1", 
                  scope={"upper": "y ~ x1 + x2 + x3"}, 
                  k=2.0)  # Standard AIC penalty
```

#### 2. **BIC (Bayesian Information Criterion)**
```python
# BIC automatically uses log(n) penalty
result = step_criterion(data=df, initial="y ~ 1", 
                       scope={"upper": "y ~ x1 + x2 + x3"}, 
                       criterion="bic")

# Convenience wrapper
from step_criterion import step_bic
result = step_bic(data=df, initial="y ~ 1", 
                  scope={"upper": "y ~ x1 + x2 + x3"})
```

#### 3. **Adjusted Rยฒ (OLS only)**
```python
# Maximizes adjusted R-squared
result = step_criterion(data=df, initial="y ~ 1", 
                       scope={"upper": "y ~ x1 + x2 + x3"}, 
                       criterion="adjr2")

# Convenience wrapper
from step_criterion import step_adjr2
result = step_adjr2(data=df, initial="y ~ 1", 
                    scope={"upper": "y ~ x1 + x2 + x3"})
```

#### 4. **P-value Based Selection**
```python
# OLS with F-tests
result = step_criterion(data=df, initial="y ~ 1", 
                       scope={"upper": "y ~ x1 + x2 + x3"}, 
                       criterion="p-value",
                       alpha_enter=0.05, alpha_exit=0.10)

# GLM with likelihood ratio tests
result = step_criterion(data=df, initial="y ~ 1", 
                       scope={"upper": "y ~ x1 + x2 + x3"}, 
                       criterion="p-value",
                       family=sm.families.Binomial(),
                       glm_test="lr")

# Convenience wrapper with GLM Wald tests
from step_criterion import step_pvalue
result = step_pvalue(data=df, initial="y ~ 1", 
                     scope={"upper": "y ~ x1 + x2 + x3"},
                     family=sm.families.Binomial(),
                     glm_test="wald")
```

### Model Types

#### Ordinary Least Squares (OLS)
```python
# family=None (default) uses OLS
result = step_criterion(
    data=df,
    initial="y ~ 1",
    scope={"upper": "y ~ x1 + x2 + x3"},
    criterion="bic"
)
```

#### Generalized Linear Models (GLM)
```python
import statsmodels.api as sm

# Logistic regression
result = step_criterion(
    data=df,
    initial="binary_outcome ~ 1",
    scope={"upper": "binary_outcome ~ x1 + x2 + x3"},
    criterion="aic",
    family=sm.families.Binomial()
)

# Poisson regression
result = step_criterion(
    data=df,
    initial="count_outcome ~ 1",
    scope={"upper": "count_outcome ~ x1 + x2 + x3"},
    criterion="bic",
    family=sm.families.Poisson()
)

# Gamma regression
result = step_criterion(
    data=df,
    initial="positive_outcome ~ 1",
    scope={"upper": "positive_outcome ~ x1 + x2 + x3"},
    criterion="aic",
    family=sm.families.Gamma()
)
```

### Advanced Formula Syntax

Using [Patsy](https://patsy.readthedocs.io/) formula syntax for complex model specifications:

```python
# Interaction terms
scope = {"upper": "y ~ x1 + x2 + x1:x2"}           # Specific interaction
scope = {"upper": "y ~ x1 * x2"}                   # Main effects + interaction
scope = {"upper": "y ~ (x1 + x2 + x3)**2"}         # All pairwise interactions

# Transformations
scope = {"upper": "y ~ x1 + I(x1**2) + I(x1**3)"}  # Polynomial terms
scope = {"upper": "y ~ x1 + np.log(x2) + np.sqrt(x3)"}  # Math functions

# Categorical variables
scope = {"upper": "y ~ x1 + C(category)"}          # Categorical encoding
scope = {"upper": "y ~ x1 + C(category, Treatment(reference='A'))"}  # Custom reference

# Mixed interactions
scope = {"upper": "y ~ x1 + x2 + C(group) + x1:C(group) + I(x2**2)"}
```

### GLM Test Options

For GLM models with p-value criterion, choose the appropriate test:

```python
# Likelihood Ratio Test (recommended for most cases)
result = step_criterion(data=df, initial="y ~ 1", criterion="p-value",
                       family=sm.families.Binomial(), glm_test="lr")

# Wald Test (faster, asymptotically equivalent)
result = step_criterion(data=df, initial="y ~ 1", criterion="p-value",
                       family=sm.families.Binomial(), glm_test="wald")

# Score and Gradient tests (currently mapped to LR with warning)
result = step_criterion(data=df, initial="y ~ 1", criterion="p-value",
                       family=sm.families.Binomial(), glm_test="score")
```

### Model Averaging

Model averaging provides AIC/BIC weights for each model in the stepwise path, allowing you to assess relative model support and account for model uncertainty:

```python
# Enable model averaging with any criterion
result = step_criterion(data=df, initial="y ~ 1", 
                       scope={"upper": "y ~ x1 + x2 + x3"},
                       criterion="aic", model_averaging=True)

# Or use convenience functions
result = step_aic(data=df, initial="y ~ 1", 
                  scope={"upper": "y ~ x1 + x2 + x3"},
                  model_averaging=True)

result = step_bic(data=df, initial="y ~ 1", 
                  scope={"upper": "y ~ x1 + x2 + x3"},
                  model_averaging=True)

# Access the model weights
print(result.model_weights)
#     Model  Score (AIC)     Delta    Weight
# 0  y ~ x1         156.2      0.0     0.524
# 1  y ~ x2         157.8      1.6     0.235
# 2  y ~ x3         159.1      2.9     0.123
# 3      y~1        161.4      5.2     0.039

# Interpret the weights
substantial_support = result.model_weights[result.model_weights['Weight'] > 0.1]
print(f"Models with substantial support: {len(substantial_support)}")
print(f"Top model weight: {result.model_weights['Weight'].iloc[0]:.3f}")
```

**Model weights are calculated as:**
- ฮ”แตข = criterionแตข - min(criterion)  
- wแตข = exp(-0.5 ร— ฮ”แตข) / ฮฃ exp(-0.5 ร— ฮ”โฑผ)

**Guidelines for interpretation:**
- **Weight > 0.1**: Substantial support
- **Weight > 0.05**: Some support  
- **Weight < 0.05**: Little support

โš ๏ธ **Important**: Weights reflect relative support among models in the stepwise path, not all possible models. Results depend on starting model and search strategy.

### Direction Options

```python
# Both directions (recommended) - can add and remove terms
result = step_criterion(data=df, initial="y ~ x1", direction="both",
                       scope={"upper": "y ~ x1 + x2 + x3"})

# Forward only - only adds terms
result = step_criterion(data=df, initial="y ~ 1", direction="forward",
                       scope={"upper": "y ~ x1 + x2 + x3"})

# Backward only - only removes terms  
result = step_criterion(data=df, initial="y ~ x1 + x2 + x3", direction="backward",
                       scope={"lower": "y ~ 1"})
```

## ๐ŸŽฏ Convenience Functions

While `step_criterion()` is the main interface, specialized convenience functions are available:

```python
from step_criterion import step_aic, step_bic, step_adjr2, step_pvalue

# AIC with custom penalty
result = step_aic(data=df, initial="y ~ 1", scope={"upper": "y ~ x1 + x2"}, k=2.5)

# BIC (automatic log(n) penalty)
result = step_bic(data=df, initial="y ~ 1", scope={"upper": "y ~ x1 + x2"})

# Adjusted Rยฒ (OLS only)
result = step_adjr2(data=df, initial="y ~ 1", scope={"upper": "y ~ x1 + x2"})

# P-value with custom thresholds
result = step_pvalue(data=df, initial="y ~ 1", scope={"upper": "y ~ x1 + x2"},
                     alpha_enter=0.01, alpha_exit=0.05)
```

## ๐Ÿ“Š Results and Output

### StepwiseResult Object

All functions return a `StepwiseResult` object with:

```python
result.model     # Final statsmodels Results object
result.anova     # Step-by-step path DataFrame  
result.keep      # Optional custom metrics (if keep function provided)

# Access final model
print(result.model.summary())
print(f"Final AIC: {result.model.aic:.3f}")
print(f"R-squared: {result.model.rsquared:.3f}")

# View selection path
print(result.anova)
```

### Step Path Table (result.anova)

```
     Step     Df   Deviance  Resid. Df  Resid. Dev      AIC
0              NaN       NaN        15   305.619    308.392
1     + GNP    1.0    54.762        14   250.857    256.402
2   + UNEMP    1.0     8.363        13   242.494    250.812
3   + ARMED    1.0     4.177        12   238.317    249.408
4    + YEAR    1.0    18.662        11   219.655    233.518
```

## ๐Ÿ” Examples

**Note**: The following examples demonstrate the package's capabilities for **exploratory analysis**. Remember that p-values and model selection results should not be used for confirmatory inference.

### Example 1: Economic Data with Interactions

```python
import pandas as pd
import statsmodels.api as sm
from step_criterion import step_criterion

# Load Longley economic dataset
longley = sm.datasets.longley.load_pandas().data
longley.rename(columns={'TOTEMP': 'employment'}, inplace=True)

# Stepwise with BIC including interactions and polynomials
result = step_criterion(
    data=longley,
    initial="employment ~ 1",
    scope={"upper": "employment ~ GNP + UNEMP + ARMED + POP + YEAR + GNPDEFL + GNP:YEAR + I(GNP**2)"},
    direction="both",
    criterion="bic",
    trace=1
)

print("Final model:")
print(result.model.summary())
```

### Example 2: Logistic Regression for Binary Classification

```python
# Simulated medical data
np.random.seed(42)
n = 1000
data = pd.DataFrame({
    'age': np.random.normal(50, 15, n),
    'bmi': np.random.normal(25, 5, n),
    'cholesterol': np.random.normal(200, 40, n),
    'smoking': np.random.choice([0, 1], n, p=[0.7, 0.3]),
    'exercise': np.random.normal(3, 2, n)  # hours per week
})

# Create outcome with realistic relationships
logit = (-5 + 0.05*data['age'] + 0.1*data['bmi'] + 
         0.01*data['cholesterol'] + 2*data['smoking'] - 0.2*data['exercise'])
data['disease'] = (np.random.random(n) < 1/(1+np.exp(-logit))).astype(int)

# Stepwise logistic regression
result = step_criterion(
    data=data,
    initial="disease ~ 1",
    scope={"upper": "disease ~ age + bmi + cholesterol + smoking + exercise + age:smoking + I(bmi**2)"},
    direction="both",
    criterion="p-value",
    family=sm.families.Binomial(),
    glm_test="lr",
    alpha_enter=0.05,
    alpha_exit=0.10,
    trace=1
)

print("Logistic regression results:")
print(result.model.summary())
```

### Example 3: Comparing Multiple Criteria

```python
from step_criterion import step_criterion, step_aic, step_bic, step_adjr2

# Compare different selection criteria
criteria_results = {}

for criterion in ['aic', 'bic', 'adjr2']:
    result = step_criterion(
        data=df,
        initial="y ~ 1",
        scope={"upper": "y ~ x1 + x2 + x3 + x1:x2 + I(x1**2)"},
        criterion=criterion,
        trace=0  # Silent for comparison
    )
    criteria_results[criterion] = {
        'formula': result.model.model.formula,
        'aic': result.model.aic,
        'bic': result.model.bic,
        'rsquared_adj': getattr(result.model, 'rsquared_adj', None),
        'n_params': len(result.model.params)
    }

# Display comparison
comparison_df = pd.DataFrame(criteria_results).T
print("Comparison of selection criteria:")
print(comparison_df)
```

## โš™๏ธ Advanced Usage

### Custom Metrics Tracking

```python
def track_metrics(model, score):
    """Custom function to track additional metrics during selection"""
    return {
        'aic': model.aic,
        'bic': model.bic,
        'rsquared': getattr(model, 'rsquared', None),
        'condition_number': np.linalg.cond(model.model.exog)
    }

result = step_criterion(
    data=df,
    initial="y ~ 1",
    scope={"upper": "y ~ x1 + x2 + x3"},
    criterion="bic",
    keep=track_metrics  # Track custom metrics at each step
)

# View tracked metrics
print(result.keep)
```

### Handling Missing Data

```python
# The package works with statsmodels' missing data handling
result = step_criterion(
    data=df_with_missing,
    initial="y ~ 1",
    scope={"upper": "y ~ x1 + x2 + x3"},
    criterion="aic",
    fit_kwargs={'missing': 'drop'}  # or 'raise', 'skip'
)
```

## ๐Ÿ› ๏ธ API Reference

### โš ๏ธ Interpretation Warning

**Results from this package should be interpreted carefully:**
- Use selected models for **exploration and hypothesis generation only**
- **Do not report p-values** from stepwise-selected models as if they were from pre-specified models
- **Confidence intervals and standard errors** are not valid after selection
- **Effect sizes may be inflated** due to selection bias
- Always validate findings with **independent data** or **proper post-selection methods**

### Main Function

- **`step_criterion()`**: Unified stepwise selection interface

### Convenience Functions

- **`step_aic()`**: AIC-based selection with custom penalty parameter
- **`step_bic()`**: BIC-based selection  
- **`step_adjr2()`**: Adjusted Rยฒ-based selection (OLS only)
- **`step_pvalue()`**: P-value based selection with test options

### Return Object

- **`StepwiseResult`**: Container with `model`, `anova`, and optional `keep` attributes

## ๐Ÿ”ง Dependencies

- Python โ‰ฅ 3.9
- pandas โ‰ฅ 1.5
- numpy โ‰ฅ 1.23  
- statsmodels โ‰ฅ 0.13

## ๐Ÿ“„ License

MIT License - see LICENSE file for details.

## ๐Ÿค Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

## ๐Ÿ“ž Support

- **Issues**: [GitHub Issues](https://github.com/MrRolie/step-criterion/issues)
- **Documentation**: This README and inline docstrings
- **Examples**: See `examples_usage.ipynb` in the repository

## ๐Ÿ”„ Version History

- **0.1.0**: Initial release with comprehensive stepwise selection support

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "step-criterion",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "statistics, regression, model-selection, stepwise, aic, bic, statsmodels, machine-learning",
    "author": null,
    "author_email": "Mikael Hillman-Pepin <mikmik1512@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/92/0e/8c5d7428a79d22362876d85c9f6f5c438b4f416fafdf72824980a91bd3b2/step_criterion-0.2.2.tar.gz",
    "platform": null,
    "description": "# step-criterion\n\n[![Python](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n**Educational and diagnostic stepwise model selection for statsmodels with multiple criteria: AIC, BIC, Adjusted R\u00b2, and p-values.**\n\nThis package provides a unified, flexible interface for **exploratory stepwise regression** with various selection criteria, supporting both OLS and GLM models with advanced features like interaction terms, transformations, and different statistical tests. **Designed for educational purposes and model exploration, not production model selection.**\n\n## \u2728 Key Features\n\n- **\ud83c\udfaf Main Function**: `step_criterion()` - unified interface for all selection methods\n- **\ud83d\udcca Multiple Criteria**: AIC, BIC, Adjusted R\u00b2, and p-value based selection\n- **\ud83d\udd27 Convenience Wrappers**: Specialized functions for each criterion\n- **\ud83d\udcc8 Model Support**: OLS and GLM (including logistic, Poisson, etc.)\n- **\ud83e\uddee Advanced Formulas**: Interaction terms, transformations, categorical variables\n- **\u26a1 GLM Flexibility**: Multiple test types (likelihood ratio, Wald)\n- **\ud83d\udd07 Clean Output**: Automatic suppression of technical warnings\n- **\ud83d\udccb R-like Results**: Familiar ANOVA-style step tables\n\n## \u26a0\ufe0f Important Statistical Considerations\n\n**This package is designed for educational and exploratory purposes.** Stepwise selection has well-documented statistical limitations that users should understand:\n\n### \ud83d\udea8 Key Limitations\n\n- **P-value Inflation**: Multiple testing inflates Type I error rates. P-values from stepwise procedures are biased and **should not be used for inference**\n- **Overfitting**: Selected models are optimistic and may not generalize well to new data\n- **Selection Bias**: Standard confidence intervals and hypothesis tests are invalid after model selection\n- **Multiple Comparisons**: The more variables considered, the higher the chance of spurious associations\n\n### \ud83c\udfaf Recommended Uses\n\n- **\u2705 Educational**: Learning about model selection and variable importance\n- **\u2705 Exploratory Data Analysis**: Initial investigation of relationships  \n- **\u2705 Diagnostic**: Understanding which variables might be relevant\n- **\u2705 Hypothesis Generation**: Developing ideas for future confirmatory studies\n\n### \u274c Not Recommended For\n\n- **\u274c Confirmatory Analysis**: Final statistical inference or hypothesis testing\n- **\u274c Production Models**: Automated model selection in production systems\n- **\u274c P-value Reporting**: Publishing p-values from stepwise-selected models\n- **\u274c Causal Inference**: Establishing causal relationships\n\n### \ud83d\udcda Better Alternatives for Production\n\nFor reliable inference and model selection, consider:\n- **Cross-validation** with penalized regression (LASSO, Ridge, Elastic Net)\n- **Information criteria** with proper model averaging\n- **Bootstrap procedures** for selection uncertainty\n- **Post-selection inference** methods when stepwise is unavoidable\n- **Domain knowledge** guided model specification\n\n## \ud83d\ude80 Installation\n\n```bash\npip install step-criterion\n```\n\n## \ud83d\udcd6 Quick Start\n\n### Basic Usage\n\n```python\nimport pandas as pd\nimport statsmodels.api as sm\nfrom step_criterion import step_criterion\n\n# Load your data\ndf = pd.read_csv(\"your_data.csv\")\n\n# Perform stepwise selection with BIC\nresult = step_criterion(\n    data=df,\n    initial=\"y ~ 1\",  # Start with intercept only\n    scope={\"upper\": \"y ~ x1 + x2 + x3 + x1:x2 + I(x1**2)\"},\n    direction=\"both\",  # Forward and backward steps\n    criterion=\"bic\",   # Selection criterion\n    trace=1            # Show step-by-step progress\n)\n\n# View results\nprint(result.model.summary())\nprint(\"\\nStep-by-step path:\")\nprint(result.anova)\n```\n\n## \ud83d\udcda Comprehensive Documentation\n\n### Main Function: `step_criterion()`\n\n**This is the recommended entry point** - a unified interface supporting all selection criteria and model types.\n\n```python\nstep_criterion(\n    data,                    # pandas DataFrame\n    initial,                 # Initial formula string\n    scope=None,             # Upper/lower bounds for model terms\n    direction=\"both\",       # \"both\", \"forward\", or \"backward\"\n    criterion=\"aic\",        # \"aic\", \"bic\", \"adjr2\", or \"p-value\"\n    trace=1,                # Verbosity level (0=silent, 1=progress)\n    family=None,            # statsmodels family (None=OLS, or sm.families.*)\n    glm_test=\"lr\",          # For GLM p-value: \"lr\", \"wald\", \"score\", \"gradient\"\n    alpha_enter=0.05,       # p-value threshold for entering (p-value criterion)\n    alpha_exit=0.10,        # p-value threshold for removal (p-value criterion)\n    steps=1000,             # Maximum number of steps\n    keep=None,              # Optional function to track custom metrics\n    fit_kwargs=None         # Additional arguments passed to model.fit()\n)\n```\n\n### Selection Criteria\n\n#### 1. **AIC (Akaike Information Criterion)**\n```python\n# Using main function\nresult = step_criterion(data=df, initial=\"y ~ 1\", \n                       scope={\"upper\": \"y ~ x1 + x2 + x3\"}, \n                       criterion=\"aic\")\n\n# Using convenience wrapper (allows custom k penalty)\nfrom step_criterion import step_aic\nresult = step_aic(data=df, initial=\"y ~ 1\", \n                  scope={\"upper\": \"y ~ x1 + x2 + x3\"}, \n                  k=2.0)  # Standard AIC penalty\n```\n\n#### 2. **BIC (Bayesian Information Criterion)**\n```python\n# BIC automatically uses log(n) penalty\nresult = step_criterion(data=df, initial=\"y ~ 1\", \n                       scope={\"upper\": \"y ~ x1 + x2 + x3\"}, \n                       criterion=\"bic\")\n\n# Convenience wrapper\nfrom step_criterion import step_bic\nresult = step_bic(data=df, initial=\"y ~ 1\", \n                  scope={\"upper\": \"y ~ x1 + x2 + x3\"})\n```\n\n#### 3. **Adjusted R\u00b2 (OLS only)**\n```python\n# Maximizes adjusted R-squared\nresult = step_criterion(data=df, initial=\"y ~ 1\", \n                       scope={\"upper\": \"y ~ x1 + x2 + x3\"}, \n                       criterion=\"adjr2\")\n\n# Convenience wrapper\nfrom step_criterion import step_adjr2\nresult = step_adjr2(data=df, initial=\"y ~ 1\", \n                    scope={\"upper\": \"y ~ x1 + x2 + x3\"})\n```\n\n#### 4. **P-value Based Selection**\n```python\n# OLS with F-tests\nresult = step_criterion(data=df, initial=\"y ~ 1\", \n                       scope={\"upper\": \"y ~ x1 + x2 + x3\"}, \n                       criterion=\"p-value\",\n                       alpha_enter=0.05, alpha_exit=0.10)\n\n# GLM with likelihood ratio tests\nresult = step_criterion(data=df, initial=\"y ~ 1\", \n                       scope={\"upper\": \"y ~ x1 + x2 + x3\"}, \n                       criterion=\"p-value\",\n                       family=sm.families.Binomial(),\n                       glm_test=\"lr\")\n\n# Convenience wrapper with GLM Wald tests\nfrom step_criterion import step_pvalue\nresult = step_pvalue(data=df, initial=\"y ~ 1\", \n                     scope={\"upper\": \"y ~ x1 + x2 + x3\"},\n                     family=sm.families.Binomial(),\n                     glm_test=\"wald\")\n```\n\n### Model Types\n\n#### Ordinary Least Squares (OLS)\n```python\n# family=None (default) uses OLS\nresult = step_criterion(\n    data=df,\n    initial=\"y ~ 1\",\n    scope={\"upper\": \"y ~ x1 + x2 + x3\"},\n    criterion=\"bic\"\n)\n```\n\n#### Generalized Linear Models (GLM)\n```python\nimport statsmodels.api as sm\n\n# Logistic regression\nresult = step_criterion(\n    data=df,\n    initial=\"binary_outcome ~ 1\",\n    scope={\"upper\": \"binary_outcome ~ x1 + x2 + x3\"},\n    criterion=\"aic\",\n    family=sm.families.Binomial()\n)\n\n# Poisson regression\nresult = step_criterion(\n    data=df,\n    initial=\"count_outcome ~ 1\",\n    scope={\"upper\": \"count_outcome ~ x1 + x2 + x3\"},\n    criterion=\"bic\",\n    family=sm.families.Poisson()\n)\n\n# Gamma regression\nresult = step_criterion(\n    data=df,\n    initial=\"positive_outcome ~ 1\",\n    scope={\"upper\": \"positive_outcome ~ x1 + x2 + x3\"},\n    criterion=\"aic\",\n    family=sm.families.Gamma()\n)\n```\n\n### Advanced Formula Syntax\n\nUsing [Patsy](https://patsy.readthedocs.io/) formula syntax for complex model specifications:\n\n```python\n# Interaction terms\nscope = {\"upper\": \"y ~ x1 + x2 + x1:x2\"}           # Specific interaction\nscope = {\"upper\": \"y ~ x1 * x2\"}                   # Main effects + interaction\nscope = {\"upper\": \"y ~ (x1 + x2 + x3)**2\"}         # All pairwise interactions\n\n# Transformations\nscope = {\"upper\": \"y ~ x1 + I(x1**2) + I(x1**3)\"}  # Polynomial terms\nscope = {\"upper\": \"y ~ x1 + np.log(x2) + np.sqrt(x3)\"}  # Math functions\n\n# Categorical variables\nscope = {\"upper\": \"y ~ x1 + C(category)\"}          # Categorical encoding\nscope = {\"upper\": \"y ~ x1 + C(category, Treatment(reference='A'))\"}  # Custom reference\n\n# Mixed interactions\nscope = {\"upper\": \"y ~ x1 + x2 + C(group) + x1:C(group) + I(x2**2)\"}\n```\n\n### GLM Test Options\n\nFor GLM models with p-value criterion, choose the appropriate test:\n\n```python\n# Likelihood Ratio Test (recommended for most cases)\nresult = step_criterion(data=df, initial=\"y ~ 1\", criterion=\"p-value\",\n                       family=sm.families.Binomial(), glm_test=\"lr\")\n\n# Wald Test (faster, asymptotically equivalent)\nresult = step_criterion(data=df, initial=\"y ~ 1\", criterion=\"p-value\",\n                       family=sm.families.Binomial(), glm_test=\"wald\")\n\n# Score and Gradient tests (currently mapped to LR with warning)\nresult = step_criterion(data=df, initial=\"y ~ 1\", criterion=\"p-value\",\n                       family=sm.families.Binomial(), glm_test=\"score\")\n```\n\n### Model Averaging\n\nModel averaging provides AIC/BIC weights for each model in the stepwise path, allowing you to assess relative model support and account for model uncertainty:\n\n```python\n# Enable model averaging with any criterion\nresult = step_criterion(data=df, initial=\"y ~ 1\", \n                       scope={\"upper\": \"y ~ x1 + x2 + x3\"},\n                       criterion=\"aic\", model_averaging=True)\n\n# Or use convenience functions\nresult = step_aic(data=df, initial=\"y ~ 1\", \n                  scope={\"upper\": \"y ~ x1 + x2 + x3\"},\n                  model_averaging=True)\n\nresult = step_bic(data=df, initial=\"y ~ 1\", \n                  scope={\"upper\": \"y ~ x1 + x2 + x3\"},\n                  model_averaging=True)\n\n# Access the model weights\nprint(result.model_weights)\n#     Model  Score (AIC)     Delta    Weight\n# 0  y ~ x1         156.2      0.0     0.524\n# 1  y ~ x2         157.8      1.6     0.235\n# 2  y ~ x3         159.1      2.9     0.123\n# 3      y~1        161.4      5.2     0.039\n\n# Interpret the weights\nsubstantial_support = result.model_weights[result.model_weights['Weight'] > 0.1]\nprint(f\"Models with substantial support: {len(substantial_support)}\")\nprint(f\"Top model weight: {result.model_weights['Weight'].iloc[0]:.3f}\")\n```\n\n**Model weights are calculated as:**\n- \u0394\u1d62 = criterion\u1d62 - min(criterion)  \n- w\u1d62 = exp(-0.5 \u00d7 \u0394\u1d62) / \u03a3 exp(-0.5 \u00d7 \u0394\u2c7c)\n\n**Guidelines for interpretation:**\n- **Weight > 0.1**: Substantial support\n- **Weight > 0.05**: Some support  \n- **Weight < 0.05**: Little support\n\n\u26a0\ufe0f **Important**: Weights reflect relative support among models in the stepwise path, not all possible models. Results depend on starting model and search strategy.\n\n### Direction Options\n\n```python\n# Both directions (recommended) - can add and remove terms\nresult = step_criterion(data=df, initial=\"y ~ x1\", direction=\"both\",\n                       scope={\"upper\": \"y ~ x1 + x2 + x3\"})\n\n# Forward only - only adds terms\nresult = step_criterion(data=df, initial=\"y ~ 1\", direction=\"forward\",\n                       scope={\"upper\": \"y ~ x1 + x2 + x3\"})\n\n# Backward only - only removes terms  \nresult = step_criterion(data=df, initial=\"y ~ x1 + x2 + x3\", direction=\"backward\",\n                       scope={\"lower\": \"y ~ 1\"})\n```\n\n## \ud83c\udfaf Convenience Functions\n\nWhile `step_criterion()` is the main interface, specialized convenience functions are available:\n\n```python\nfrom step_criterion import step_aic, step_bic, step_adjr2, step_pvalue\n\n# AIC with custom penalty\nresult = step_aic(data=df, initial=\"y ~ 1\", scope={\"upper\": \"y ~ x1 + x2\"}, k=2.5)\n\n# BIC (automatic log(n) penalty)\nresult = step_bic(data=df, initial=\"y ~ 1\", scope={\"upper\": \"y ~ x1 + x2\"})\n\n# Adjusted R\u00b2 (OLS only)\nresult = step_adjr2(data=df, initial=\"y ~ 1\", scope={\"upper\": \"y ~ x1 + x2\"})\n\n# P-value with custom thresholds\nresult = step_pvalue(data=df, initial=\"y ~ 1\", scope={\"upper\": \"y ~ x1 + x2\"},\n                     alpha_enter=0.01, alpha_exit=0.05)\n```\n\n## \ud83d\udcca Results and Output\n\n### StepwiseResult Object\n\nAll functions return a `StepwiseResult` object with:\n\n```python\nresult.model     # Final statsmodels Results object\nresult.anova     # Step-by-step path DataFrame  \nresult.keep      # Optional custom metrics (if keep function provided)\n\n# Access final model\nprint(result.model.summary())\nprint(f\"Final AIC: {result.model.aic:.3f}\")\nprint(f\"R-squared: {result.model.rsquared:.3f}\")\n\n# View selection path\nprint(result.anova)\n```\n\n### Step Path Table (result.anova)\n\n```\n     Step     Df   Deviance  Resid. Df  Resid. Dev      AIC\n0              NaN       NaN        15   305.619    308.392\n1     + GNP    1.0    54.762        14   250.857    256.402\n2   + UNEMP    1.0     8.363        13   242.494    250.812\n3   + ARMED    1.0     4.177        12   238.317    249.408\n4    + YEAR    1.0    18.662        11   219.655    233.518\n```\n\n## \ud83d\udd0d Examples\n\n**Note**: The following examples demonstrate the package's capabilities for **exploratory analysis**. Remember that p-values and model selection results should not be used for confirmatory inference.\n\n### Example 1: Economic Data with Interactions\n\n```python\nimport pandas as pd\nimport statsmodels.api as sm\nfrom step_criterion import step_criterion\n\n# Load Longley economic dataset\nlongley = sm.datasets.longley.load_pandas().data\nlongley.rename(columns={'TOTEMP': 'employment'}, inplace=True)\n\n# Stepwise with BIC including interactions and polynomials\nresult = step_criterion(\n    data=longley,\n    initial=\"employment ~ 1\",\n    scope={\"upper\": \"employment ~ GNP + UNEMP + ARMED + POP + YEAR + GNPDEFL + GNP:YEAR + I(GNP**2)\"},\n    direction=\"both\",\n    criterion=\"bic\",\n    trace=1\n)\n\nprint(\"Final model:\")\nprint(result.model.summary())\n```\n\n### Example 2: Logistic Regression for Binary Classification\n\n```python\n# Simulated medical data\nnp.random.seed(42)\nn = 1000\ndata = pd.DataFrame({\n    'age': np.random.normal(50, 15, n),\n    'bmi': np.random.normal(25, 5, n),\n    'cholesterol': np.random.normal(200, 40, n),\n    'smoking': np.random.choice([0, 1], n, p=[0.7, 0.3]),\n    'exercise': np.random.normal(3, 2, n)  # hours per week\n})\n\n# Create outcome with realistic relationships\nlogit = (-5 + 0.05*data['age'] + 0.1*data['bmi'] + \n         0.01*data['cholesterol'] + 2*data['smoking'] - 0.2*data['exercise'])\ndata['disease'] = (np.random.random(n) < 1/(1+np.exp(-logit))).astype(int)\n\n# Stepwise logistic regression\nresult = step_criterion(\n    data=data,\n    initial=\"disease ~ 1\",\n    scope={\"upper\": \"disease ~ age + bmi + cholesterol + smoking + exercise + age:smoking + I(bmi**2)\"},\n    direction=\"both\",\n    criterion=\"p-value\",\n    family=sm.families.Binomial(),\n    glm_test=\"lr\",\n    alpha_enter=0.05,\n    alpha_exit=0.10,\n    trace=1\n)\n\nprint(\"Logistic regression results:\")\nprint(result.model.summary())\n```\n\n### Example 3: Comparing Multiple Criteria\n\n```python\nfrom step_criterion import step_criterion, step_aic, step_bic, step_adjr2\n\n# Compare different selection criteria\ncriteria_results = {}\n\nfor criterion in ['aic', 'bic', 'adjr2']:\n    result = step_criterion(\n        data=df,\n        initial=\"y ~ 1\",\n        scope={\"upper\": \"y ~ x1 + x2 + x3 + x1:x2 + I(x1**2)\"},\n        criterion=criterion,\n        trace=0  # Silent for comparison\n    )\n    criteria_results[criterion] = {\n        'formula': result.model.model.formula,\n        'aic': result.model.aic,\n        'bic': result.model.bic,\n        'rsquared_adj': getattr(result.model, 'rsquared_adj', None),\n        'n_params': len(result.model.params)\n    }\n\n# Display comparison\ncomparison_df = pd.DataFrame(criteria_results).T\nprint(\"Comparison of selection criteria:\")\nprint(comparison_df)\n```\n\n## \u2699\ufe0f Advanced Usage\n\n### Custom Metrics Tracking\n\n```python\ndef track_metrics(model, score):\n    \"\"\"Custom function to track additional metrics during selection\"\"\"\n    return {\n        'aic': model.aic,\n        'bic': model.bic,\n        'rsquared': getattr(model, 'rsquared', None),\n        'condition_number': np.linalg.cond(model.model.exog)\n    }\n\nresult = step_criterion(\n    data=df,\n    initial=\"y ~ 1\",\n    scope={\"upper\": \"y ~ x1 + x2 + x3\"},\n    criterion=\"bic\",\n    keep=track_metrics  # Track custom metrics at each step\n)\n\n# View tracked metrics\nprint(result.keep)\n```\n\n### Handling Missing Data\n\n```python\n# The package works with statsmodels' missing data handling\nresult = step_criterion(\n    data=df_with_missing,\n    initial=\"y ~ 1\",\n    scope={\"upper\": \"y ~ x1 + x2 + x3\"},\n    criterion=\"aic\",\n    fit_kwargs={'missing': 'drop'}  # or 'raise', 'skip'\n)\n```\n\n## \ud83d\udee0\ufe0f API Reference\n\n### \u26a0\ufe0f Interpretation Warning\n\n**Results from this package should be interpreted carefully:**\n- Use selected models for **exploration and hypothesis generation only**\n- **Do not report p-values** from stepwise-selected models as if they were from pre-specified models\n- **Confidence intervals and standard errors** are not valid after selection\n- **Effect sizes may be inflated** due to selection bias\n- Always validate findings with **independent data** or **proper post-selection methods**\n\n### Main Function\n\n- **`step_criterion()`**: Unified stepwise selection interface\n\n### Convenience Functions\n\n- **`step_aic()`**: AIC-based selection with custom penalty parameter\n- **`step_bic()`**: BIC-based selection  \n- **`step_adjr2()`**: Adjusted R\u00b2-based selection (OLS only)\n- **`step_pvalue()`**: P-value based selection with test options\n\n### Return Object\n\n- **`StepwiseResult`**: Container with `model`, `anova`, and optional `keep` attributes\n\n## \ud83d\udd27 Dependencies\n\n- Python \u2265 3.9\n- pandas \u2265 1.5\n- numpy \u2265 1.23  \n- statsmodels \u2265 0.13\n\n## \ud83d\udcc4 License\n\nMIT License - see LICENSE file for details.\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please feel free to submit issues, feature requests, or pull requests.\n\n## \ud83d\udcde Support\n\n- **Issues**: [GitHub Issues](https://github.com/MrRolie/step-criterion/issues)\n- **Documentation**: This README and inline docstrings\n- **Examples**: See `examples_usage.ipynb` in the repository\n\n## \ud83d\udd04 Version History\n\n- **0.1.0**: Initial release with comprehensive stepwise selection support\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Educational stepwise model selection for statsmodels: AIC, BIC, Adjusted R\u00b2, and p-values with OLS/GLM support for exploratory analysis.",
    "version": "0.2.2",
    "project_urls": {
        "Documentation": "https://github.com/MrRolie/step-criterion#readme",
        "Homepage": "https://github.com/MrRolie/step-criterion",
        "Issues": "https://github.com/MrRolie/step-criterion/issues",
        "Repository": "https://github.com/MrRolie/step-criterion"
    },
    "split_keywords": [
        "statistics",
        " regression",
        " model-selection",
        " stepwise",
        " aic",
        " bic",
        " statsmodels",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "11c05cf6926f9ef6b8b094a4f9192252efb241a4aab3aa5e9bd746f33d76e3da",
                "md5": "e420d40886e93e41bb9cc5fcf3bd3f0f",
                "sha256": "1daa73fa218b15d20c756fd7c6d27ae65971b7dcc3ffa135b37ee6bcd2e8fbbe"
            },
            "downloads": -1,
            "filename": "step_criterion-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e420d40886e93e41bb9cc5fcf3bd3f0f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 15222,
            "upload_time": "2025-08-21T03:30:22",
            "upload_time_iso_8601": "2025-08-21T03:30:22.001356Z",
            "url": "https://files.pythonhosted.org/packages/11/c0/5cf6926f9ef6b8b094a4f9192252efb241a4aab3aa5e9bd746f33d76e3da/step_criterion-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "920e8c5d7428a79d22362876d85c9f6f5c438b4f416fafdf72824980a91bd3b2",
                "md5": "e78c0a95d16efe95a6b173933230e6f2",
                "sha256": "a0ff9f47487e1c82916c5fb0d77a00f31843405314d480000082515607045855"
            },
            "downloads": -1,
            "filename": "step_criterion-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "e78c0a95d16efe95a6b173933230e6f2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 22746,
            "upload_time": "2025-08-21T03:30:23",
            "upload_time_iso_8601": "2025-08-21T03:30:23.401695Z",
            "url": "https://files.pythonhosted.org/packages/92/0e/8c5d7428a79d22362876d85c9f6f5c438b4f416fafdf72824980a91bd3b2/step_criterion-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-21 03:30:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "MrRolie",
    "github_project": "step-criterion#readme",
    "github_not_found": true,
    "lcname": "step-criterion"
}
        
Elapsed time: 1.35708s