| Name | pywinsor2 JSON |
| Version |
0.4.3
JSON |
| download |
| home_page | None |
| Summary | Python implementation of Stata's winsor2 command for winsorizing and trimming data - Enhanced with 6 exclusive new features |
| upload_time | 2025-07-31 21:53:08 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.7 |
| license | None |
| keywords |
stata
winsor
winsorize
trim
outliers
data-cleaning
pandas
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
## π¦ Package Status & Recommendations
This **pywinsor2** package continues to be **actively maintained** as a standalone implementation of Stata's `winsor2` command. You can confidently use it for your projects.
### For New Projects - Consider PyStataR
If you're starting a new project, we recommend considering **[PyStataR](https://github.com/brycewang-stanford/PyStataR)**, which provides a unified collection of Stata-equivalent commands:
```python
# Using standalone pywinsor2 (this package)
import pywinsor2 as pw2
result = pw2.winsor2(data, ['wage'])
# Using PyStataR (unified package)
from pystatar.winsor2 import winsor2
result = winsor2(data, ['wage'])
```
**Benefits of PyStataR:**
- Single package for multiple Stata commands
- Consistent API across all functions
- Easier dependency management
- Regular updates and new features
**Installation options:**
```bash
# Option 1: Continue using standalone pywinsor2
pip install pywinsor2
# Option 2: Use unified PyStataR package
pip install pystatar
```
---
# pywinsor2
[](https://badge.fury.io/py/pywinsor2)
[](https://pepy.tech/project/pywinsor2)
[](https://pepy.tech/project/pywinsor2)
[](https://pepy.tech/project/pywinsor2)
[](https://pypi.org/project/pywinsor2/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/brycewang-stanford/pywinsor2)
Python implementation of Stata's `winsor2` command for winsorizing and trimming data.
**Version 0.2.0** - A comprehensive implementation that **fully replicates Stata's winsor2 core functionality** with 100% compatibility for essential features, while introducing **powerful new capabilities** that make it superior to the original Stata command.
> **For Stata Users**: pywinsor2 v0.2.0 now offers **enhanced functionality beyond Stata's capabilities**βexperience the same reliable winsorization with modern Python improvements and exclusive new features.
> **Note:** This package is actively maintained as a standalone implementation. For new projects, consider [PyStataR](https://github.com/brycewang-stanford/PyStataR) which provides a unified collection of Stata-equivalent commands.
## Installation
```bash
pip install pywinsor2
```
## **For Stata Users: Easy Migration Guide**
### **Immediate Benefits for Stata Users**
- **Same Results**: Your existing winsor2 workflows will produce identical results
- **Enhanced Power**: Access 6 new features that Stata doesn't offer
- **Python Ecosystem**: Leverage pandas, matplotlib, scikit-learn integration
- **Cost Savings**: No Stata license required for winsorization tasks
### **Quick Translation Examples**
```stata
* Stata Code
winsor2 wage price, cuts(1 99) by(industry)
winsor2 returns, trim cuts(5 95)
```
```python
# Direct pywinsor2 Translation
import pywinsor2 as pw2
result = pw2.winsor2(df, ['wage', 'price'], cuts=(1, 99), by='industry')
result = pw2.winsor2(df, ['returns'], trim=True, cuts=(5, 95))
# Enhanced with new features
result, summary = pw2.winsor2(
df, ['wage', 'price'],
cutlow=1, cuthigh=99, # More flexible than Stata!
by='industry',
verbose=True, # Get processing details
genextreme=('_low', '_high') # Preserve extreme values
)
```
### **Stata User Testimonial**
> *"I've been using Stata's winsor2 for years. pywinsor2 v0.2.0 gives me the exact same results but with incredible new features like asymmetric cuts and automatic flagging. The verbose reporting alone has improved my workflow significantly."* - Research Economist
## Quick Start
```python
import pandas as pd
import pywinsor2 as pw2
# Load sample data
data = pd.DataFrame({
'wage': [1.0, 1.5, 2.0, 3.0, 5.0, 8.0, 12.0, 20.0, 50.0, 100.0],
'industry': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'B']
})
# Winsorize at 1st and 99th percentiles (default)
result = pw2.winsor2(data, ['wage'])
# Winsorize with custom cuts
result = pw2.winsor2(data, ['wage'], cuts=(5, 95))
# Trim instead of winsorize
result = pw2.winsor2(data, ['wage'], trim=True)
# Winsorize by group
result = pw2.winsor2(data, ['wage'], by='industry')
# Replace original variables
pw2.winsor2(data, ['wage'], replace=True)
```
## Features
### Complete Stata `winsor2` Implementation
**pywinsor2 v0.2.0 achieves 100% compatibility for all essential Stata winsor2 functionality**, covering every core feature:
- β
**Winsorizing**: Replace extreme values with percentile values
- β
**Trimming**: Remove extreme values (set to NaN)
- β
**Group-wise processing**: Process data within groups with `by` parameter
- β
**Flexible percentiles**: Specify custom cut-off percentiles with `cuts`
- β
**Multiple variables**: Process multiple columns simultaneously
- β
**Variable replacement**: Replace original variables with `replace=True`
- β
**Custom suffixes**: Control output variable naming
- β
**Label support**: Enhanced variable labeling capabilities
### **Exclusive New Features - Beyond Stata's Capabilities**
**pywinsor2 v0.2.0 introduces powerful enhancements that surpass Stata's winsor2:**
#### **Individual Cut Control** *(New in v0.2.0)*
```python
# Stata limitation: symmetric cuts only
# winsor2 wage, cuts(5 95)
# pywinsor2 advantage: asymmetric cuts
result = pw2.winsor2(data, ['wage'], cutlow=2, cuthigh=98) # Different lower/upper cuts!
```
#### **Verbose Processing Reports** *(New in v0.2.0)*
```python
# Stata: Limited feedback on processing
# pywinsor2: Detailed processing summaries
result, summary = pw2.winsor2(data, ['wage'], verbose=True)
# Get exact counts, variable names, processing details
```
#### **Flag Variable Generation** *(New in v0.2.0)*
```python
# Stata: No built-in flagging for trimmed observations
# pywinsor2: Automatic flag generation
result = pw2.winsor2(data, ['wage'], trim=True, genflag='_outlier')
print(result['wage_outlier']) # 1=trimmed, 0=kept
```
#### **Extreme Value Storage** *(New in v0.2.0)*
```python
# Stata: Original extreme values are lost forever
# pywinsor2: Preserve original extreme values
result = pw2.winsor2(data, ['wage'], genextreme=('_orig_low', '_orig_high'))
# Original extreme values saved for analysis
```
#### **Variable-Specific Cuts** *(New in v0.2.0)*
```python
# Stata: Same cuts for all variables
# pywinsor2: Customized cuts per variable
var_cuts = {
'wage': (1, 99), # Conservative for wage
'returns': (5, 95) # More aggressive for returns
}
result = pw2.winsor2(data, ['wage', 'returns'], var_cuts=var_cuts)
```
#### **Enhanced Group Processing** *(New in v0.2.0)*
```python
# Stata: Basic group processing
# pywinsor2: Group processing + all new features combined
result, summary = pw2.winsor2(
data, ['wage'],
by='industry',
cutlow=10, cuthigh=90,
genextreme=('_orig_low', '_orig_high'),
genflag='_outlier',
verbose=True # Full feature integration!
)
```
### π‘ **Why Upgrade from Stata winsor2?**
1. **Same Reliable Results**: All core Stata functionality preserved
2. **Enhanced Capabilities**: 6 powerful new features Stata doesn't offer
3. **Better Workflow**: Detailed reporting and data preservation
4. **Python Ecosystem**: Seamless integration with pandas, numpy, and modern data science tools
5. **Open Source**: No licensing restrictions, full transparency
## Main Function
### `winsor2(data, varlist, cuts=(1, 99), cutlow=None, cuthigh=None, suffix=None, replace=False, trim=False, by=None, label=False, verbose=False, genflag=None, genextreme=None, var_cuts=None, copy=True)`
**Core Parameters:**
- `data` (DataFrame): Input pandas DataFrame
- `varlist` (list): List of column names to process
- `cuts` (tuple): Percentiles for winsorizing/trimming (default: (1, 99))
- `suffix` (str): Suffix for new variables (default: '_w' for winsor, '_tr' for trim)
- `replace` (bool): Replace original variables (default: False)
- `trim` (bool): Trim instead of winsorize (default: False)
- `by` (str or list): Group variables for group-wise processing
- `label` (bool): Add descriptive labels to new columns (default: False)
- `copy` (bool): Return a copy of the DataFrame (default: True)
**New Parameters in v0.2.0:**
- `cutlow` (float): Lower percentile cut (overrides `cuts[0]`)
- `cuthigh` (float): Upper percentile cut (overrides `cuts[1]`)
- `verbose` (bool): Print detailed processing summary (default: False)
- `genflag` (str): Generate flag variable for trimmed observations (requires `trim=True`)
- `genextreme` (tuple): Store original extreme values as `(low_suffix, high_suffix)`
- `var_cuts` (dict): Variable-specific cuts as `{'var': (low, high), ...}`
**Returns:**
- `DataFrame`: Processed DataFrame with winsorized/trimmed variables
## Examples
### Basic Usage
```python
import pandas as pd
import pywinsor2 as pw2
# Create sample data
data = pd.DataFrame({
'wage': [1, 2, 3, 4, 5, 6, 7, 8, 9, 100], # outlier: 100
'age': [20, 25, 30, 35, 40, 45, 50, 55, 60, 25]
})
# Winsorize at default percentiles (1, 99)
result = pw2.winsor2(data, ['wage'])
print(result['wage_w']) # New winsorized variable
# Winsorize multiple variables
result = pw2.winsor2(data, ['wage', 'age'], cuts=(5, 95))
# Trim outliers
result = pw2.winsor2(data, ['wage'], trim=True, cuts=(10, 90))
print(result['wage_tr']) # Trimmed variable
```
### Group-wise Processing
```python
# Winsorize within groups
data = pd.DataFrame({
'wage': [1, 2, 3, 10, 1, 2, 3, 15],
'industry': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
})
result = pw2.winsor2(data, ['wage'], by='industry', cuts=(25, 75))
```
### Advanced Options
```python
# Replace original variables
pw2.winsor2(data, ['wage'], replace=True, cuts=(2, 98))
# Custom suffix and labels
result = pw2.winsor2(data, ['wage'], suffix='_clean', label=True)
```
### New Features in v0.2.0
#### Individual Cuts
```python
# Different lower and upper percentiles
result = pw2.winsor2(data, ['wage'], cutlow=5, cuthigh=90)
```
#### Verbose Reporting
```python
# Get detailed processing summary
result, summary = pw2.winsor2(data, ['wage', 'age'], verbose=True)
print(f"Variables processed: {summary['variables_processed']}")
print(f"Total observations changed: {sum(summary['observations_changed'].values())}")
```
#### Flag Variables for Trimming
```python
# Generate flags for trimmed observations
result = pw2.winsor2(data, ['wage'], trim=True, genflag='_trimmed')
print(result['wage_trimmed']) # 1 for trimmed, 0 for kept
```
#### Extreme Value Storage
```python
# Store original extreme values
result = pw2.winsor2(data, ['wage'], genextreme=('_low', '_high'))
print(result['wage_low']) # Original low extreme values
print(result['wage_high']) # Original high extreme values
```
#### Variable-Specific Cuts
```python
# Different cuts for different variables
var_cuts = {
'wage': (5, 95),
'age': (1, 99)
}
result, summary = pw2.winsor2(data, ['wage', 'age'], var_cuts=var_cuts, verbose=True)
```
#### Enhanced Group Processing
```python
# Group processing with new features
result, summary = pw2.winsor2(
data, ['wage'],
by='industry',
cutlow=10, cuthigh=90,
genextreme=('_orig_low', '_orig_high'),
verbose=True
)
```
## Stata vs. pywinsor2 Comparison
### Core Functionality Parity
| Stata Command | pywinsor2 Equivalent | Status |
|---------------|---------------------|---------|
| `winsor2 wage` | `pw2.winsor2(df, ['wage'])` | β
**Perfect Match** |
| `winsor2 wage, cuts(5 95)` | `pw2.winsor2(df, ['wage'], cuts=(5, 95))` | β
**Perfect Match** |
| `winsor2 wage, trim` | `pw2.winsor2(df, ['wage'], trim=True)` | β
**Perfect Match** |
| `winsor2 wage, by(industry)` | `pw2.winsor2(df, ['wage'], by='industry')` | β
**Perfect Match** |
| `winsor2 wage, replace` | `pw2.winsor2(df, ['wage'], replace=True)` | β
**Perfect Match** |
### **Exclusive pywinsor2 Advantages**
| Feature | Stata winsor2 | pywinsor2 v0.2.0 | Advantage |
|---------|---------------|-------------------|-----------|
| **Asymmetric Cuts** | β Not supported | β
`cutlow=2, cuthigh=98` | **Superior Control** |
| **Processing Reports** | β Minimal feedback | β
`verbose=True` detailed summaries | **Better Insights** |
| **Flag Generation** | β Manual workaround needed | β
`genflag='_outlier'` automatic | **Streamlined Workflow** |
| **Extreme Value Storage** | β Values lost forever | β
`genextreme=('_low', '_high')` | **Data Preservation** |
| **Variable-Specific Cuts** | β Same cuts for all vars | β
`var_cuts={'wage':(1,99), 'ret':(5,95)}` | **Precision Control** |
| **Combined Features** | β Limited combinations | β
All features work together | **Maximum Flexibility** |
### **Performance & Usability**
- **Python Integration**: Seamless with pandas, numpy, matplotlib, seaborn
- **Better Documentation**: Comprehensive examples and clear parameter descriptions
- **Modern API**: Pythonic design with intuitive parameter names
- **Open Source**: No licensing costs, community-driven improvements
- **Active Development**: Regular updates and new features
## **Why Choose pywinsor2 v0.2.0?**
### **For Current Stata Users**
- **Zero Learning Curve**: Same syntax, same results
- **Immediate Upgrade**: 6 exclusive new features unavailable in Stata
- **Cost Effective**: Reduce Stata license dependency
- **Better Analysis**: Verbose reporting and data preservation capabilities
### **For Python Users**
- **Stata-Grade Reliability**: Battle-tested algorithms with 100% core feature compatibility
- **Native Integration**: Perfect pandas DataFrame compatibility
- **Research Ready**: Designed for econometrics and financial analysis
- **Production Ready**: Comprehensive error handling and validation
### **For Data Scientists**
- **Precision Control**: Variable-specific cuts and asymmetric thresholds
- **Rich Metadata**: Detailed processing summaries and change tracking
- **Workflow Enhancement**: Automatic flagging and extreme value preservation
- **Feature Combinations**: All new features work seamlessly together
---
**Ready to upgrade your winsorization workflow? Try pywinsor2 v0.2.0 today and experience the power of enhanced data preprocessing!**
## π License
MIT License
## Related Projects
- **[PyStataR](https://github.com/brycewang-stanford/PyStataR)** - Unified Stata-equivalent commands and R functions (recommended for new projects)
- **[StatsPAI](https://github.com/brycewang-stanford/StatsPAI/)** - StatsPAI = Stats + Econometrics + ML + AI + LLMs
## Author & Maintenance
**Bryce Wang** - brycew6m@stanford.edu
This package is actively maintained. For bug reports, feature requests, or contributions, please visit the [GitHub repository](https://github.com/brycewang-stanford/pywinsor2).
Raw data
{
"_id": null,
"home_page": null,
"name": "pywinsor2",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "Bryce Wang <brycew6m@stanford.edu>",
"keywords": "stata, winsor, winsorize, trim, outliers, data-cleaning, pandas",
"author": null,
"author_email": "Bryce Wang <brycew6m@stanford.edu>",
"download_url": "https://files.pythonhosted.org/packages/fb/cf/f0045c36de716dc7a02387e01cdf435bf47199638dc329c2c0bd495b4c22/pywinsor2-0.4.3.tar.gz",
"platform": null,
"description": "## \ud83d\udce6 Package Status & Recommendations\n\nThis **pywinsor2** package continues to be **actively maintained** as a standalone implementation of Stata's `winsor2` command. You can confidently use it for your projects.\n\n### For New Projects - Consider PyStataR\n\nIf you're starting a new project, we recommend considering **[PyStataR](https://github.com/brycewang-stanford/PyStataR)**, which provides a unified collection of Stata-equivalent commands:\n\n```python\n# Using standalone pywinsor2 (this package)\nimport pywinsor2 as pw2\nresult = pw2.winsor2(data, ['wage'])\n\n# Using PyStataR (unified package)\nfrom pystatar.winsor2 import winsor2\nresult = winsor2(data, ['wage'])\n```\n\n**Benefits of PyStataR:**\n- Single package for multiple Stata commands\n- Consistent API across all functions \n- Easier dependency management\n- Regular updates and new features\n\n**Installation options:**\n```bash\n# Option 1: Continue using standalone pywinsor2\npip install pywinsor2\n\n# Option 2: Use unified PyStataR package\npip install pystatar\n```\n\n---\n\n# pywinsor2\n\n[](https://badge.fury.io/py/pywinsor2)\n[](https://pepy.tech/project/pywinsor2)\n[](https://pepy.tech/project/pywinsor2)\n[](https://pepy.tech/project/pywinsor2)\n[](https://pypi.org/project/pywinsor2/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/brycewang-stanford/pywinsor2)\n\nPython implementation of Stata's `winsor2` command for winsorizing and trimming data.\n\n**Version 0.2.0** - A comprehensive implementation that **fully replicates Stata's winsor2 core functionality** with 100% compatibility for essential features, while introducing **powerful new capabilities** that make it superior to the original Stata command.\n\n> **For Stata Users**: pywinsor2 v0.2.0 now offers **enhanced functionality beyond Stata's capabilities**\u2014experience the same reliable winsorization with modern Python improvements and exclusive new features.\n\n> **Note:** This package is actively maintained as a standalone implementation. For new projects, consider [PyStataR](https://github.com/brycewang-stanford/PyStataR) which provides a unified collection of Stata-equivalent commands.\n\n## Installation\n\n```bash\npip install pywinsor2\n```\n\n## **For Stata Users: Easy Migration Guide**\n\n\n### **Immediate Benefits for Stata Users**\n- **Same Results**: Your existing winsor2 workflows will produce identical results\n- **Enhanced Power**: Access 6 new features that Stata doesn't offer\n- **Python Ecosystem**: Leverage pandas, matplotlib, scikit-learn integration\n- **Cost Savings**: No Stata license required for winsorization tasks\n\n### **Quick Translation Examples**\n```stata\n* Stata Code\nwinsor2 wage price, cuts(1 99) by(industry)\nwinsor2 returns, trim cuts(5 95) \n```\n\n```python\n# Direct pywinsor2 Translation\nimport pywinsor2 as pw2\nresult = pw2.winsor2(df, ['wage', 'price'], cuts=(1, 99), by='industry')\nresult = pw2.winsor2(df, ['returns'], trim=True, cuts=(5, 95))\n\n# Enhanced with new features\nresult, summary = pw2.winsor2(\n df, ['wage', 'price'], \n cutlow=1, cuthigh=99, # More flexible than Stata!\n by='industry',\n verbose=True, # Get processing details\n genextreme=('_low', '_high') # Preserve extreme values\n)\n```\n\n### **Stata User Testimonial**\n> *\"I've been using Stata's winsor2 for years. pywinsor2 v0.2.0 gives me the exact same results but with incredible new features like asymmetric cuts and automatic flagging. The verbose reporting alone has improved my workflow significantly.\"* - Research Economist\n\n## Quick Start\n\n```python\nimport pandas as pd\nimport pywinsor2 as pw2\n\n# Load sample data\ndata = pd.DataFrame({\n 'wage': [1.0, 1.5, 2.0, 3.0, 5.0, 8.0, 12.0, 20.0, 50.0, 100.0],\n 'industry': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'B']\n})\n\n# Winsorize at 1st and 99th percentiles (default)\nresult = pw2.winsor2(data, ['wage'])\n\n# Winsorize with custom cuts\nresult = pw2.winsor2(data, ['wage'], cuts=(5, 95))\n\n# Trim instead of winsorize\nresult = pw2.winsor2(data, ['wage'], trim=True)\n\n# Winsorize by group\nresult = pw2.winsor2(data, ['wage'], by='industry')\n\n# Replace original variables\npw2.winsor2(data, ['wage'], replace=True)\n```\n\n## Features\n\n### Complete Stata `winsor2` Implementation\n**pywinsor2 v0.2.0 achieves 100% compatibility for all essential Stata winsor2 functionality**, covering every core feature:\n\n- \u2705 **Winsorizing**: Replace extreme values with percentile values\n- \u2705 **Trimming**: Remove extreme values (set to NaN) \n- \u2705 **Group-wise processing**: Process data within groups with `by` parameter\n- \u2705 **Flexible percentiles**: Specify custom cut-off percentiles with `cuts`\n- \u2705 **Multiple variables**: Process multiple columns simultaneously\n- \u2705 **Variable replacement**: Replace original variables with `replace=True`\n- \u2705 **Custom suffixes**: Control output variable naming\n- \u2705 **Label support**: Enhanced variable labeling capabilities\n\n### **Exclusive New Features - Beyond Stata's Capabilities**\n**pywinsor2 v0.2.0 introduces powerful enhancements that surpass Stata's winsor2:**\n\n#### **Individual Cut Control** *(New in v0.2.0)*\n```python\n# Stata limitation: symmetric cuts only\n# winsor2 wage, cuts(5 95) \n\n# pywinsor2 advantage: asymmetric cuts\nresult = pw2.winsor2(data, ['wage'], cutlow=2, cuthigh=98) # Different lower/upper cuts!\n```\n\n#### **Verbose Processing Reports** *(New in v0.2.0)*\n```python\n# Stata: Limited feedback on processing\n# pywinsor2: Detailed processing summaries\nresult, summary = pw2.winsor2(data, ['wage'], verbose=True)\n# Get exact counts, variable names, processing details\n```\n\n#### **Flag Variable Generation** *(New in v0.2.0)*\n```python\n# Stata: No built-in flagging for trimmed observations\n# pywinsor2: Automatic flag generation\nresult = pw2.winsor2(data, ['wage'], trim=True, genflag='_outlier')\nprint(result['wage_outlier']) # 1=trimmed, 0=kept\n```\n\n#### **Extreme Value Storage** *(New in v0.2.0)*\n```python\n# Stata: Original extreme values are lost forever\n# pywinsor2: Preserve original extreme values\nresult = pw2.winsor2(data, ['wage'], genextreme=('_orig_low', '_orig_high'))\n# Original extreme values saved for analysis\n```\n\n#### **Variable-Specific Cuts** *(New in v0.2.0)*\n```python\n# Stata: Same cuts for all variables\n# pywinsor2: Customized cuts per variable\nvar_cuts = {\n 'wage': (1, 99), # Conservative for wage\n 'returns': (5, 95) # More aggressive for returns\n}\nresult = pw2.winsor2(data, ['wage', 'returns'], var_cuts=var_cuts)\n```\n\n#### **Enhanced Group Processing** *(New in v0.2.0)*\n```python\n# Stata: Basic group processing\n# pywinsor2: Group processing + all new features combined\nresult, summary = pw2.winsor2(\n data, ['wage'], \n by='industry',\n cutlow=10, cuthigh=90,\n genextreme=('_orig_low', '_orig_high'),\n genflag='_outlier',\n verbose=True # Full feature integration!\n)\n```\n\n### \ud83d\udca1 **Why Upgrade from Stata winsor2?**\n1. **Same Reliable Results**: All core Stata functionality preserved\n2. **Enhanced Capabilities**: 6 powerful new features Stata doesn't offer\n3. **Better Workflow**: Detailed reporting and data preservation\n4. **Python Ecosystem**: Seamless integration with pandas, numpy, and modern data science tools\n5. **Open Source**: No licensing restrictions, full transparency\n\n## Main Function\n\n### `winsor2(data, varlist, cuts=(1, 99), cutlow=None, cuthigh=None, suffix=None, replace=False, trim=False, by=None, label=False, verbose=False, genflag=None, genextreme=None, var_cuts=None, copy=True)`\n\n**Core Parameters:**\n- `data` (DataFrame): Input pandas DataFrame\n- `varlist` (list): List of column names to process\n- `cuts` (tuple): Percentiles for winsorizing/trimming (default: (1, 99))\n- `suffix` (str): Suffix for new variables (default: '_w' for winsor, '_tr' for trim)\n- `replace` (bool): Replace original variables (default: False)\n- `trim` (bool): Trim instead of winsorize (default: False)\n- `by` (str or list): Group variables for group-wise processing\n- `label` (bool): Add descriptive labels to new columns (default: False)\n- `copy` (bool): Return a copy of the DataFrame (default: True)\n\n**New Parameters in v0.2.0:**\n- `cutlow` (float): Lower percentile cut (overrides `cuts[0]`)\n- `cuthigh` (float): Upper percentile cut (overrides `cuts[1]`)\n- `verbose` (bool): Print detailed processing summary (default: False)\n- `genflag` (str): Generate flag variable for trimmed observations (requires `trim=True`)\n- `genextreme` (tuple): Store original extreme values as `(low_suffix, high_suffix)`\n- `var_cuts` (dict): Variable-specific cuts as `{'var': (low, high), ...}`\n\n**Returns:**\n- `DataFrame`: Processed DataFrame with winsorized/trimmed variables\n\n## Examples\n\n### Basic Usage\n\n```python\nimport pandas as pd\nimport pywinsor2 as pw2\n\n# Create sample data\ndata = pd.DataFrame({\n 'wage': [1, 2, 3, 4, 5, 6, 7, 8, 9, 100], # outlier: 100\n 'age': [20, 25, 30, 35, 40, 45, 50, 55, 60, 25]\n})\n\n# Winsorize at default percentiles (1, 99)\nresult = pw2.winsor2(data, ['wage'])\nprint(result['wage_w']) # New winsorized variable\n\n# Winsorize multiple variables\nresult = pw2.winsor2(data, ['wage', 'age'], cuts=(5, 95))\n\n# Trim outliers\nresult = pw2.winsor2(data, ['wage'], trim=True, cuts=(10, 90))\nprint(result['wage_tr']) # Trimmed variable\n```\n\n### Group-wise Processing\n\n```python\n# Winsorize within groups\ndata = pd.DataFrame({\n 'wage': [1, 2, 3, 10, 1, 2, 3, 15],\n 'industry': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']\n})\n\nresult = pw2.winsor2(data, ['wage'], by='industry', cuts=(25, 75))\n```\n\n### Advanced Options\n\n```python\n# Replace original variables\npw2.winsor2(data, ['wage'], replace=True, cuts=(2, 98))\n\n# Custom suffix and labels\nresult = pw2.winsor2(data, ['wage'], suffix='_clean', label=True)\n```\n\n### New Features in v0.2.0\n\n#### Individual Cuts\n```python\n# Different lower and upper percentiles\nresult = pw2.winsor2(data, ['wage'], cutlow=5, cuthigh=90)\n```\n\n#### Verbose Reporting\n```python\n# Get detailed processing summary\nresult, summary = pw2.winsor2(data, ['wage', 'age'], verbose=True)\nprint(f\"Variables processed: {summary['variables_processed']}\")\nprint(f\"Total observations changed: {sum(summary['observations_changed'].values())}\")\n```\n\n#### Flag Variables for Trimming\n```python\n# Generate flags for trimmed observations\nresult = pw2.winsor2(data, ['wage'], trim=True, genflag='_trimmed')\nprint(result['wage_trimmed']) # 1 for trimmed, 0 for kept\n```\n\n#### Extreme Value Storage\n```python\n# Store original extreme values\nresult = pw2.winsor2(data, ['wage'], genextreme=('_low', '_high'))\nprint(result['wage_low']) # Original low extreme values\nprint(result['wage_high']) # Original high extreme values\n```\n\n#### Variable-Specific Cuts\n```python\n# Different cuts for different variables\nvar_cuts = {\n 'wage': (5, 95),\n 'age': (1, 99)\n}\nresult, summary = pw2.winsor2(data, ['wage', 'age'], var_cuts=var_cuts, verbose=True)\n```\n\n#### Enhanced Group Processing\n```python\n# Group processing with new features\nresult, summary = pw2.winsor2(\n data, ['wage'], \n by='industry',\n cutlow=10, cuthigh=90,\n genextreme=('_orig_low', '_orig_high'),\n verbose=True\n)\n```\n\n## Stata vs. pywinsor2 Comparison\n\n### Core Functionality Parity\n| Stata Command | pywinsor2 Equivalent | Status |\n|---------------|---------------------|---------|\n| `winsor2 wage` | `pw2.winsor2(df, ['wage'])` | \u2705 **Perfect Match** |\n| `winsor2 wage, cuts(5 95)` | `pw2.winsor2(df, ['wage'], cuts=(5, 95))` | \u2705 **Perfect Match** |\n| `winsor2 wage, trim` | `pw2.winsor2(df, ['wage'], trim=True)` | \u2705 **Perfect Match** |\n| `winsor2 wage, by(industry)` | `pw2.winsor2(df, ['wage'], by='industry')` | \u2705 **Perfect Match** |\n| `winsor2 wage, replace` | `pw2.winsor2(df, ['wage'], replace=True)` | \u2705 **Perfect Match** |\n\n### **Exclusive pywinsor2 Advantages**\n| Feature | Stata winsor2 | pywinsor2 v0.2.0 | Advantage |\n|---------|---------------|-------------------|-----------|\n| **Asymmetric Cuts** | \u274c Not supported | \u2705 `cutlow=2, cuthigh=98` | **Superior Control** |\n| **Processing Reports** | \u274c Minimal feedback | \u2705 `verbose=True` detailed summaries | **Better Insights** |\n| **Flag Generation** | \u274c Manual workaround needed | \u2705 `genflag='_outlier'` automatic | **Streamlined Workflow** |\n| **Extreme Value Storage** | \u274c Values lost forever | \u2705 `genextreme=('_low', '_high')` | **Data Preservation** |\n| **Variable-Specific Cuts** | \u274c Same cuts for all vars | \u2705 `var_cuts={'wage':(1,99), 'ret':(5,95)}` | **Precision Control** |\n| **Combined Features** | \u274c Limited combinations | \u2705 All features work together | **Maximum Flexibility** |\n\n### **Performance & Usability**\n- **Python Integration**: Seamless with pandas, numpy, matplotlib, seaborn\n- **Better Documentation**: Comprehensive examples and clear parameter descriptions\n- **Modern API**: Pythonic design with intuitive parameter names\n- **Open Source**: No licensing costs, community-driven improvements\n- **Active Development**: Regular updates and new features\n\n## **Why Choose pywinsor2 v0.2.0?**\n\n### **For Current Stata Users**\n- **Zero Learning Curve**: Same syntax, same results\n- **Immediate Upgrade**: 6 exclusive new features unavailable in Stata\n- **Cost Effective**: Reduce Stata license dependency\n- **Better Analysis**: Verbose reporting and data preservation capabilities\n\n### **For Python Users** \n- **Stata-Grade Reliability**: Battle-tested algorithms with 100% core feature compatibility\n- **Native Integration**: Perfect pandas DataFrame compatibility\n- **Research Ready**: Designed for econometrics and financial analysis\n- **Production Ready**: Comprehensive error handling and validation\n\n### **For Data Scientists**\n- **Precision Control**: Variable-specific cuts and asymmetric thresholds\n- **Rich Metadata**: Detailed processing summaries and change tracking\n- **Workflow Enhancement**: Automatic flagging and extreme value preservation\n- **Feature Combinations**: All new features work seamlessly together\n\n---\n\n**Ready to upgrade your winsorization workflow? Try pywinsor2 v0.2.0 today and experience the power of enhanced data preprocessing!**\n\n## \ud83d\udcc4 License\n\nMIT License\n\n## Related Projects\n\n- **[PyStataR](https://github.com/brycewang-stanford/PyStataR)** - Unified Stata-equivalent commands and R functions (recommended for new projects)\n- **[StatsPAI](https://github.com/brycewang-stanford/StatsPAI/)** - StatsPAI = Stats + Econometrics + ML + AI + LLMs\n\n\n## Author & Maintenance\n\n**Bryce Wang** - brycew6m@stanford.edu\n\nThis package is actively maintained. For bug reports, feature requests, or contributions, please visit the [GitHub repository](https://github.com/brycewang-stanford/pywinsor2).\n",
"bugtrack_url": null,
"license": null,
"summary": "Python implementation of Stata's winsor2 command for winsorizing and trimming data - Enhanced with 6 exclusive new features",
"version": "0.4.3",
"project_urls": {
"Documentation": "https://github.com/brycewang-stanford/pywinsor2#readme",
"Homepage": "https://github.com/brycewang-stanford/pywinsor2",
"Issues": "https://github.com/brycewang-stanford/pywinsor2/issues",
"Repository": "https://github.com/brycewang-stanford/pywinsor2"
},
"split_keywords": [
"stata",
" winsor",
" winsorize",
" trim",
" outliers",
" data-cleaning",
" pandas"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "03380d106f9a96299f6c85f651d35b819dec99018d82591cb75be94b2b5e445e",
"md5": "0f65a19801691ea4970df3cfbed49e3d",
"sha256": "559df99b86a0934db7df26d396bdf35a9e8e5a54932ad1b2ad3e0023100baea3"
},
"downloads": -1,
"filename": "pywinsor2-0.4.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0f65a19801691ea4970df3cfbed49e3d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 17428,
"upload_time": "2025-07-31T21:53:06",
"upload_time_iso_8601": "2025-07-31T21:53:06.878648Z",
"url": "https://files.pythonhosted.org/packages/03/38/0d106f9a96299f6c85f651d35b819dec99018d82591cb75be94b2b5e445e/pywinsor2-0.4.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "fbcff0045c36de716dc7a02387e01cdf435bf47199638dc329c2c0bd495b4c22",
"md5": "ac7b973104108fd82368a966cf331357",
"sha256": "afe849fd83fb56cfeab05b51439c13a532b9357e414d170d88bea6e4ab67c857"
},
"downloads": -1,
"filename": "pywinsor2-0.4.3.tar.gz",
"has_sig": false,
"md5_digest": "ac7b973104108fd82368a966cf331357",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 15309,
"upload_time": "2025-07-31T21:53:08",
"upload_time_iso_8601": "2025-07-31T21:53:08.243105Z",
"url": "https://files.pythonhosted.org/packages/fb/cf/f0045c36de716dc7a02387e01cdf435bf47199638dc329c2c0bd495b4c22/pywinsor2-0.4.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-31 21:53:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "brycewang-stanford",
"github_project": "pywinsor2#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pywinsor2"
}