# PyStataR
[](https://pypi.org/project/pystatar/)
[](https://pypi.org/project/pystatar/)
[](https://github.com/brycewang-stanford/PyStataR/blob/main/LICENSE)
[](https://pypi.org/project/pystatar/)
> **The Ultimate Python Toolkit for Academic Research - Bringing Stata & R's Power to Python** π
## Project Vision & Goals
**PyStataR** aims to recreate and significantly enhance **the top 20 most frequently used Stata commands** in Python, transforming them into the most powerful and user-friendly statistical tools for academic research. Our goal is to not just replicate Stata's functionality, but to **expand and improve** upon it, leveraging Python's ecosystem to create superior research tools.
### Why This Project Matters
- **Bridge the Gap**: Seamless transition from Stata to Python for researchers
- **Enhanced Functionality**: Each command will be significantly expanded beyond Stata's original capabilities
- **Modern Research Tools**: Built for today's data science and research needs
- **Community-Driven**: Open source development with academic researchers in mind
### Target Commands (20 Most Used in Academic Research)
β
**tabulate** - Cross-tabulation and frequency analysis
β
**egen** - Extended data generation and manipulation
β
**reghdfe** - High-dimensional fixed effects regression
β
**winsor2** - Data winsorizing and trimming
π **Coming Soon**: `summarize`, `describe`, `merge`, `reshape`, `collapse`, `keep/drop`, `generate`, `replace`, `sort`, `by`, `if/in`, `reg`, `logit`, `probit`, `ivregress`, `xtreg`
**Want to see a specific command implemented?**
- [Create an issue](https://github.com/brycewang-stanford/PyStataR/issues) to request a command
- [Contribute](CONTRIBUTING.md) to help us complete this project faster
- β Star this repo to show your support!
## Core Modules Overview
### **tabulate** - Advanced Cross-tabulation and Frequency Analysis
- **Beyond Stata**: Enhanced statistical tests, multi-dimensional tables, and publication-ready output
- **Key Features**: Chi-square tests, Fisher's exact test, CramΓ©r's V, Kendall's tau, gamma coefficients
- **Use Cases**: Survey analysis, categorical data exploration, market research
### **egen** - Extended Data Generation and Manipulation
- **Beyond Stata**: Advanced ranking algorithms, robust statistical functions, and vectorized operations
- **Key Features**: Group operations, ranking with tie-breaking, row statistics, percentile calculations
- **Use Cases**: Data preprocessing, feature engineering, panel data construction
### **reghdfe** - High-Dimensional Fixed Effects Regression
- **Beyond Stata**: Memory-efficient algorithms, advanced clustering options, and diagnostic tools
- **Key Features**: Multiple fixed effects, clustered standard errors, instrumental variables, robust diagnostics
- **Use Cases**: Panel data analysis, causal inference, economic research
### **winsor2** - Advanced Outlier Detection and Treatment
- **Beyond Stata**: Multiple detection methods, group-specific treatment, and comprehensive diagnostics
- **Key Features**: IQR-based detection, percentile methods, group-wise operations, flexible trimming
- **Use Cases**: Data cleaning, outlier analysis, robust statistical modeling
## Advanced Features & Performance
### Performance Optimizations
- **Vectorized Operations**: All functions leverage NumPy and pandas for maximum speed
- **Memory Efficiency**: Optimized for large datasets common in academic research
- **Parallel Processing**: Multi-core support for computationally intensive operations
- **Lazy Evaluation**: Smart caching and delayed computation when beneficial
### Research-Grade Features
- **Publication Ready**: LaTeX and HTML output for academic papers
- **Reproducible Research**: Comprehensive logging and version tracking
- **Missing Data Handling**: Multiple imputation and robust missing value treatment
- **Bootstrapping**: Built-in bootstrap methods for confidence intervals
- **Cross-Validation**: Integrated CV methods for model validation
## Quick Installation
```bash
pip install pystatar
```
## Comprehensive Usage Examples
### `tabulate` - Advanced Cross-tabulation
The `tabulate` module provides comprehensive frequency analysis and cross-tabulation capabilities, extending far beyond Stata's original functionality.
#### Basic One-way Tabulation
```python
import pandas as pd
```python
from pystatar import tabulate
# Create sample dataset
df = pd.DataFrame({
'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'] * 100,
'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'] * 100,
'income': np.random.normal(50000, 15000, 600),
'age': np.random.randint(22, 65, 600),
'industry': np.random.choice(['Tech', 'Finance', 'Healthcare', 'Education'], 600)
})
# Simple frequency table
result = tabulate.tabulate(df, 'education')
print(result)
```
#### Advanced Two-way Cross-tabulation with Statistics
```python
# Two-way tabulation with comprehensive statistics
result = tabulate.tabulate(
df,
'gender', 'education',
chi2=True, # Chi-square test
exact=True, # Fisher's exact test
gamma=True, # Gamma coefficient
taub=True, # Kendall's tau-b
V=True, # CramΓ©r's V
missing=True, # Include missing values
row=True, # Row percentages
col=True, # Column percentages
cell=True # Cell percentages
)
# Access different components
print("Frequency Table:")
print(result.table)
print(f"\nChi-square p-value: {result.chi2_pvalue:.4f}")
print(f"CramΓ©r's V: {result.cramers_v:.4f}")
```
#### Multi-way Tabulation
```python
# Three-way tabulation with layering
result = tabulate.tabulate(
df,
'gender', 'education',
by='industry', # Layer by industry
chi2=True
)
# Access results by layer
for industry, table_result in result.by_results.items():
print(f"\n=== {industry} ===")
print(table_result.table)
```
### `egen` - Extended Data Generation
The `egen` module provides powerful data manipulation functions that extend Stata's egen capabilities.
#### Ranking and Percentile Functions
```python
from pystatar import egen
# Advanced ranking with tie-breaking options
df['income_rank'] = egen.rank(df['income'], method='average') # Handle ties
df['income_pctile'] = egen.xtile(df['income'], nquantiles=10) # Deciles
# Group-specific rankings
df['rank_within_industry'] = egen.rank(df['income'], by='industry')
# Percentile calculations
df['income_90th'] = egen.pctile(df['income'], 90)
df['income_iqr'] = egen.pctile(df['income'], 75) - egen.pctile(df['income'], 25)
```
#### Row Operations
```python
# Create test scores dataset
scores_df = pd.DataFrame({
'student': range(1, 101),
'math': np.random.normal(75, 10, 100),
'english': np.random.normal(80, 12, 100),
'science': np.random.normal(78, 11, 100),
'history': np.random.normal(82, 9, 100)
})
# Row statistics
scores_df['total_score'] = egen.rowtotal(scores_df, ['math', 'english', 'science', 'history'])
scores_df['avg_score'] = egen.rowmean(scores_df, ['math', 'english', 'science', 'history'])
scores_df['min_score'] = egen.rowmin(scores_df, ['math', 'english', 'science', 'history'])
scores_df['max_score'] = egen.rowmax(scores_df, ['math', 'english', 'science', 'history'])
scores_df['score_sd'] = egen.rowsd(scores_df, ['math', 'english', 'science', 'history'])
# Count non-missing values per row
scores_df['subjects_taken'] = egen.rownonmiss(scores_df, ['math', 'english', 'science', 'history'])
```
#### Group Statistics and Operations
```python
# Group summary statistics
df['mean_income_by_education'] = egen.mean(df['income'], by='education')
df['median_income_by_industry'] = egen.median(df['income'], by='industry')
df['sd_income_by_gender'] = egen.sd(df['income'], by='gender')
# Group identification and counting
df['education_group_size'] = egen.count(df, by='education')
df['first_in_group'] = egen.tag(df, ['education', 'gender']) # First observation in group
df['group_sequence'] = egen.seq(df, by='education') # Sequence within group
# Advanced group operations
df['income_rank_in_education'] = egen.rank(df['income'], by='education')
df['above_group_median'] = (df['income'] > egen.median(df['income'], by='education')).astype(int)
```
### `reghdfe` - Advanced Fixed Effects Regression
The `reghdfe` module provides state-of-the-art estimation for linear models with high-dimensional fixed effects.
#### Basic Fixed Effects Regression
```python
from pystatar import reghdfe
# Create panel dataset
np.random.seed(42)
n_firms, n_years = 100, 10
n_obs = n_firms * n_years
panel_df = pd.DataFrame({
'firm_id': np.repeat(range(n_firms), n_years),
'year': np.tile(range(2010, 2020), n_firms),
'log_sales': np.random.normal(10, 1, n_obs),
'log_employment': np.random.normal(4, 0.5, n_obs),
'log_capital': np.random.normal(8, 0.8, n_obs),
'industry': np.repeat(np.random.choice(['Tech', 'Manufacturing', 'Services'], n_firms), n_years)
})
# Basic regression with firm and year fixed effects
result = reghdfe.reghdfe(
data=panel_df,
depvar='log_sales',
regressors=['log_employment', 'log_capital'],
absorb=['firm_id', 'year']
)
print(result.summary())
print(f"R-squared: {result.r2:.4f}")
print(f"Number of observations: {result.N}")
```
#### Advanced Regression with Clustering and Instruments
```python
# Add instrumental variables
panel_df['instrument1'] = np.random.normal(0, 1, n_obs)
panel_df['instrument2'] = np.random.normal(0, 1, n_obs)
# Regression with clustering and multiple fixed effects
result = reghdfe.reghdfe(
data=panel_df,
depvar='log_sales',
regressors=['log_employment', 'log_capital'],
absorb=['firm_id', 'year', 'industry'], # Multiple fixed effects
cluster='firm_id', # Clustered standard errors
weights='employment', # Weighted regression
subset=(panel_df['year'] >= 2012) # Conditional estimation
)
# Access detailed results
print("Coefficient Table:")
print(result.coef_table)
print(f"\nFixed Effects absorbed: {result.absorbed_fe}")
print(f"Clusters: {result.n_clusters}")
```
#### Instrumental Variables with High-Dimensional FE
```python
# IV regression with fixed effects
iv_result = reghdfe.ivreghdfe(
data=panel_df,
depvar='log_sales',
endogenous=['log_employment'], # Endogenous variable
instruments=['instrument1', 'instrument2'], # Instruments
exogenous=['log_capital'], # Exogenous controls
absorb=['firm_id', 'year'],
cluster='firm_id'
)
print("First Stage Results:")
print(iv_result.first_stage)
print(f"\nWeak instruments test (F-stat): {iv_result.first_stage_fstat:.2f}")
print(f"Overidentification test (Hansen J): {iv_result.hansen_j:.4f}")
```
### `winsor2` - Advanced Outlier Treatment
The `winsor2` module provides comprehensive outlier detection and treatment methods.
#### Basic Winsorizing
```python
from pystatar import winsor2
# Create dataset with outliers
outlier_df = pd.DataFrame({
'income': np.concatenate([
np.random.normal(50000, 10000, 950), # Normal observations
np.random.uniform(200000, 500000, 50) # Outliers
]),
'age': np.random.randint(18, 70, 1000),
'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)
})
# Basic winsorizing at 1st and 99th percentiles
result = winsor2.winsor2(outlier_df, ['income'])
print("Original vs Winsorized:")
print(f"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}")
print(f"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}")
```
#### Group-wise Winsorizing
```python
# Winsorize within groups
result = winsor2.winsor2(
outlier_df,
['income'],
by='industry', # Winsorize within each industry
cuts=(5, 95), # Use 5th and 95th percentiles
suffix='_clean' # Custom suffix
)
# Compare distributions by group
for industry in outlier_df['industry'].unique():
mask = outlier_df['industry'] == industry
original = outlier_df.loc[mask, 'income']
winsorized = result.loc[mask, 'income_clean']
print(f"\n{industry}:")
print(f" Original: {original.describe()}")
print(f" Winsorized: {winsorized.describe()}")
```
#### Trimming vs Winsorizing Comparison
```python
# Compare different outlier treatment methods
trim_result = winsor2.winsor2(
outlier_df,
['income'],
trim=True, # Trim (remove) instead of winsorize
cuts=(2.5, 97.5) # Trim 2.5% from each tail
)
winsor_result = winsor2.winsor2(
outlier_df,
['income'],
trim=False, # Winsorize (cap) outliers
cuts=(2.5, 97.5)
)
print("Treatment Comparison:")
print(f"Original N: {len(outlier_df)}")
print(f"After trimming N: {trim_result['income_tr'].notna().sum()}")
print(f"After winsorizing N: {len(winsor_result)}")
print(f"Trimmed mean: {trim_result['income_tr'].mean():.0f}")
print(f"Winsorized mean: {winsor_result['income_w'].mean():.0f}")
```
#### Advanced Outlier Detection
```python
# Multiple variable winsorizing with custom thresholds
multi_result = winsor2.winsor2(
outlier_df,
['income', 'age'],
cuts=(1, 99), # Different cuts for different variables
by='industry', # Group-specific treatment
replace=True, # Replace original variables
label=True # Add descriptive labels
)
# Generate outlier indicators
outlier_df['income_outlier'] = winsor2.outlier_indicator(
outlier_df['income'],
method='iqr', # Use IQR method
factor=1.5 # 1.5 * IQR threshold
)
outlier_df['extreme_outlier'] = winsor2.outlier_indicator(
outlier_df['income'],
method='percentile', # Use percentile method
cuts=(1, 99)
)
print("Outlier Detection Results:")
print(f"IQR method detected {outlier_df['income_outlier'].sum()} outliers")
print(f"Percentile method detected {outlier_df['extreme_outlier'].sum()} outliers")
```
## Project Structure
```
pystatar/
βββ __init__.py # Main package initialization
βββ tabulate/ # Cross-tabulation module
β βββ __init__.py
β βββ core.py
β βββ results.py
β βββ stats.py
βββ egen/ # Extended generation module
β βββ __init__.py
β βββ core.py
βββ reghdfe/ # High-dimensional FE regression
β βββ __init__.py
β βββ core.py
β βββ estimation.py
β βββ utils.py
βββ winsor2/ # Winsorizing module
β βββ __init__.py
β βββ core.py
β βββ utils.py
βββ utils/ # Shared utilities
β βββ __init__.py
β βββ common.py
βββ tests/ # Test suite
βββ test_tabulate.py
βββ test_egen.py
βββ test_reghdfe.py
βββ test_winsor2.py
```
## Key Features
- **Familiar Syntax**: Stata-like command structure and parameters
- **Pandas Integration**: Seamless integration with pandas DataFrames
- **High Performance**: Optimized implementations using pandas and NumPy
- **Comprehensive Coverage**: Most commonly used Stata commands
- **Statistical Rigor**: Proper statistical tests and robust standard errors
- **Flexible Output**: Multiple output formats and customization options
- **Missing Value Handling**: Configurable treatment of missing data
## Documentation
Each module comes with comprehensive documentation and examples:
- [**tabulate Documentation**](docs/tabulate.md) - Cross-tabulation and frequency analysis
- [**egen Documentation**](docs/egen.md) - Extended data generation functions
- [**reghdfe Documentation**](docs/reghdfe.md) - High-dimensional fixed effects regression
- [**winsor2 Documentation**](docs/winsor2.md) - Data winsorizing and trimming
## Contributing to the Project
We're building the future of academic research tools in Python! Here's how you can help:
### Priority Commands Needed
Help us implement the remaining **16 high-priority commands**:
**Data Management**: `summarize`, `describe`, `merge`, `reshape`, `collapse`, `keep`, `drop`, `generate`, `replace`, `sort`
**Statistical Analysis**: `reg`, `logit`, `probit`, `ivregress`, `xtreg`, `anova`
### How to Contribute
1. **Request a Command**: [Open an issue](https://github.com/brycewang-stanford/PyStataR/issues/new) with the command you need
2. ** Implement a Command**: Check our [contribution guidelines](CONTRIBUTING.md) and submit a PR
3. ** Report Bugs**: Help us improve existing functionality
4. ** Improve Documentation**: Add examples, tutorials, or clarifications
5. ** Spread the Word**: Star the repo and share with fellow researchers
### Recognition
All contributors will be recognized in our documentation and release notes. Major contributors will be listed as co-authors on any academic publications about this project.
### Academic Collaboration
We welcome partnerships with universities and research institutions. If you're interested in using this project in your coursework or research, please reach out!
## Community & Support
- **Documentation**: [https://pystatar.readthedocs.io](docs/)
- **Discussions**: [GitHub Discussions](https://github.com/brycewang-stanford/PyStataR/discussions)
- **Issues**: [Bug Reports & Feature Requests](https://github.com/brycewang-stanford/PyStataR/issues)
- ** Email**: brycew6m@stanford.edu for academic collaborations
## Comparison with Stata
| Feature | Stata | PyStataR | Advantage |
|---------|-------|-------------------|-----------|
| **Speed** | Base performance | 2-10x faster* | Vectorized operations |
| **Memory** | Limited by system | Efficient pandas backend | Better large dataset handling |
| **Extensibility** | Ado files | Python ecosystem | Unlimited customization |
| **Cost** | $$$$ | Free & Open Source | Accessible to all researchers |
| **Integration** | Standalone | Python data science stack | Seamless workflow |
| **Output** | Limited formats | Multiple (LaTeX, HTML, etc.) | Publication ready |
*Performance comparison based on typical academic datasets (1M+ observations)
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Acknowledgments
This package builds upon the excellent work of:
- [pandas](https://pandas.pydata.org/) - The backbone of our data manipulation
- [numpy](https://numpy.org/) - Powering our numerical computations
- [scipy](https://scipy.org/) - Statistical functions and algorithms
- [statsmodels](https://www.statsmodels.org/) - Statistical modeling foundations
- [pyhdfe](https://github.com/jeffgortmaker/pyhdfe) - High-dimensional fixed effects algorithms
- The entire **Stata community** - For decades of statistical innovation that inspired this project
## Future Roadmap
### Version 1.0 Goals (Target: End of 2025)
- Core 4 commands implemented
- Additional 16 high-priority commands
- Comprehensive test suite (>95% coverage)
- Complete documentation with tutorials
- Performance benchmarks vs Stata
### Version 2.0 Vision (2026)
- Machine learning integration
- R integration for cross-platform compatibility
- Web interface for non-programmers
- Jupyter notebook extensions
## π Project Statistics
[](https://github.com/brycewang-stanford/PyStataR/stargazers)
[](https://github.com/brycewang-stanford/PyStataR/network)
[](https://github.com/brycewang-stanford/PyStataR/issues)
[](https://github.com/brycewang-stanford/PyStataR/pulls)
## Contact & Collaboration
**Created by [Bryce Wang](https://github.com/brycewang-stanford)** - Stanford University
- **Email**: brycew6m@stanford.edu
- **GitHub**: [@brycewang-stanford](https://github.com/brycewang-stanford)
- **Academic**: Stanford Graduate School of Business
- **LinkedIn**: [Connect with me](https://linkedin.com/in/brycewang)
### Academic Partnerships Welcome!
- Course integration and teaching materials
- Research collaborations and citations
- Institutional licensing and support
- Student contributor programs
---
### β **Love this project? Give it a star and help us reach more researchers!** β
**Together, we're building the future of academic research in Python**
Raw data
{
"_id": null,
"home_page": null,
"name": "pystatar",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Bryce Wang <brycew6m@stanford.edu>",
"keywords": "stata, pandas, econometrics, statistics, data-analysis, tabulate, egen, reghdfe, winsor, cross-tabulation, fixed-effects, panel-data, regression",
"author": null,
"author_email": "Bryce Wang <brycew6m@stanford.edu>",
"download_url": "https://files.pythonhosted.org/packages/cd/c0/4b4dfb7aa9282bbe97bd02b1d7b29600146188fad3fe56919b30f8283fc9/pystatar-0.0.0.tar.gz",
"platform": null,
"description": "# PyStataR\n\n[](https://pypi.org/project/pystatar/)\n[](https://pypi.org/project/pystatar/)\n[](https://github.com/brycewang-stanford/PyStataR/blob/main/LICENSE)\n[](https://pypi.org/project/pystatar/)\n\n> **The Ultimate Python Toolkit for Academic Research - Bringing Stata & R's Power to Python** \ud83d\ude80\n\n## Project Vision & Goals\n\n**PyStataR** aims to recreate and significantly enhance **the top 20 most frequently used Stata commands** in Python, transforming them into the most powerful and user-friendly statistical tools for academic research. Our goal is to not just replicate Stata's functionality, but to **expand and improve** upon it, leveraging Python's ecosystem to create superior research tools.\n\n### Why This Project Matters\n- **Bridge the Gap**: Seamless transition from Stata to Python for researchers\n- **Enhanced Functionality**: Each command will be significantly expanded beyond Stata's original capabilities\n- **Modern Research Tools**: Built for today's data science and research needs\n- **Community-Driven**: Open source development with academic researchers in mind\n\n### Target Commands (20 Most Used in Academic Research)\n\u2705 **tabulate** - Cross-tabulation and frequency analysis \n\u2705 **egen** - Extended data generation and manipulation \n\u2705 **reghdfe** - High-dimensional fixed effects regression \n\u2705 **winsor2** - Data winsorizing and trimming \n\ud83d\udd04 **Coming Soon**: `summarize`, `describe`, `merge`, `reshape`, `collapse`, `keep/drop`, `generate`, `replace`, `sort`, `by`, `if/in`, `reg`, `logit`, `probit`, `ivregress`, `xtreg`\n\n**Want to see a specific command implemented?** \n- [Create an issue](https://github.com/brycewang-stanford/PyStataR/issues) to request a command\n- [Contribute](CONTRIBUTING.md) to help us complete this project faster\n- \u2b50 Star this repo to show your support!\n\n## Core Modules Overview\n\n### **tabulate** - Advanced Cross-tabulation and Frequency Analysis\n- **Beyond Stata**: Enhanced statistical tests, multi-dimensional tables, and publication-ready output\n- **Key Features**: Chi-square tests, Fisher's exact test, Cram\u00e9r's V, Kendall's tau, gamma coefficients\n- **Use Cases**: Survey analysis, categorical data exploration, market research\n\n### **egen** - Extended Data Generation and Manipulation \n- **Beyond Stata**: Advanced ranking algorithms, robust statistical functions, and vectorized operations\n- **Key Features**: Group operations, ranking with tie-breaking, row statistics, percentile calculations\n- **Use Cases**: Data preprocessing, feature engineering, panel data construction\n\n### **reghdfe** - High-Dimensional Fixed Effects Regression\n- **Beyond Stata**: Memory-efficient algorithms, advanced clustering options, and diagnostic tools\n- **Key Features**: Multiple fixed effects, clustered standard errors, instrumental variables, robust diagnostics\n- **Use Cases**: Panel data analysis, causal inference, economic research\n\n### **winsor2** - Advanced Outlier Detection and Treatment\n- **Beyond Stata**: Multiple detection methods, group-specific treatment, and comprehensive diagnostics\n- **Key Features**: IQR-based detection, percentile methods, group-wise operations, flexible trimming\n- **Use Cases**: Data cleaning, outlier analysis, robust statistical modeling\n\n## Advanced Features & Performance\n\n### Performance Optimizations\n- **Vectorized Operations**: All functions leverage NumPy and pandas for maximum speed\n- **Memory Efficiency**: Optimized for large datasets common in academic research\n- **Parallel Processing**: Multi-core support for computationally intensive operations\n- **Lazy Evaluation**: Smart caching and delayed computation when beneficial\n\n### Research-Grade Features\n- **Publication Ready**: LaTeX and HTML output for academic papers\n- **Reproducible Research**: Comprehensive logging and version tracking\n- **Missing Data Handling**: Multiple imputation and robust missing value treatment\n- **Bootstrapping**: Built-in bootstrap methods for confidence intervals\n- **Cross-Validation**: Integrated CV methods for model validation\n\n## Quick Installation\n\n```bash\npip install pystatar\n```\n\n## Comprehensive Usage Examples\n\n### `tabulate` - Advanced Cross-tabulation\n\nThe `tabulate` module provides comprehensive frequency analysis and cross-tabulation capabilities, extending far beyond Stata's original functionality.\n\n#### Basic One-way Tabulation\n```python\nimport pandas as pd\n```python\nfrom pystatar import tabulate\n\n# Create sample dataset\ndf = pd.DataFrame({\n 'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'] * 100,\n 'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'] * 100,\n 'income': np.random.normal(50000, 15000, 600),\n 'age': np.random.randint(22, 65, 600),\n 'industry': np.random.choice(['Tech', 'Finance', 'Healthcare', 'Education'], 600)\n})\n\n# Simple frequency table\nresult = tabulate.tabulate(df, 'education')\nprint(result)\n```\n\n#### Advanced Two-way Cross-tabulation with Statistics\n```python\n# Two-way tabulation with comprehensive statistics\nresult = tabulate.tabulate(\n df, \n 'gender', 'education',\n chi2=True, # Chi-square test\n exact=True, # Fisher's exact test\n gamma=True, # Gamma coefficient\n taub=True, # Kendall's tau-b\n V=True, # Cram\u00e9r's V\n missing=True, # Include missing values\n row=True, # Row percentages\n col=True, # Column percentages\n cell=True # Cell percentages\n)\n\n# Access different components\nprint(\"Frequency Table:\")\nprint(result.table)\nprint(f\"\\nChi-square p-value: {result.chi2_pvalue:.4f}\")\nprint(f\"Cram\u00e9r's V: {result.cramers_v:.4f}\")\n```\n\n#### Multi-way Tabulation\n```python\n# Three-way tabulation with layering\nresult = tabulate.tabulate(\n df, \n 'gender', 'education',\n by='industry', # Layer by industry\n chi2=True\n)\n\n# Access results by layer\nfor industry, table_result in result.by_results.items():\n print(f\"\\n=== {industry} ===\")\n print(table_result.table)\n```\n\n### `egen` - Extended Data Generation\n\nThe `egen` module provides powerful data manipulation functions that extend Stata's egen capabilities.\n\n#### Ranking and Percentile Functions\n```python\nfrom pystatar import egen\n\n# Advanced ranking with tie-breaking options\ndf['income_rank'] = egen.rank(df['income'], method='average') # Handle ties\ndf['income_pctile'] = egen.xtile(df['income'], nquantiles=10) # Deciles\n\n# Group-specific rankings\ndf['rank_within_industry'] = egen.rank(df['income'], by='industry')\n\n# Percentile calculations\ndf['income_90th'] = egen.pctile(df['income'], 90)\ndf['income_iqr'] = egen.pctile(df['income'], 75) - egen.pctile(df['income'], 25)\n```\n\n#### Row Operations\n```python\n# Create test scores dataset\nscores_df = pd.DataFrame({\n 'student': range(1, 101),\n 'math': np.random.normal(75, 10, 100),\n 'english': np.random.normal(80, 12, 100),\n 'science': np.random.normal(78, 11, 100),\n 'history': np.random.normal(82, 9, 100)\n})\n\n# Row statistics\nscores_df['total_score'] = egen.rowtotal(scores_df, ['math', 'english', 'science', 'history'])\nscores_df['avg_score'] = egen.rowmean(scores_df, ['math', 'english', 'science', 'history'])\nscores_df['min_score'] = egen.rowmin(scores_df, ['math', 'english', 'science', 'history'])\nscores_df['max_score'] = egen.rowmax(scores_df, ['math', 'english', 'science', 'history'])\nscores_df['score_sd'] = egen.rowsd(scores_df, ['math', 'english', 'science', 'history'])\n\n# Count non-missing values per row\nscores_df['subjects_taken'] = egen.rownonmiss(scores_df, ['math', 'english', 'science', 'history'])\n```\n\n#### Group Statistics and Operations\n```python\n# Group summary statistics\ndf['mean_income_by_education'] = egen.mean(df['income'], by='education')\ndf['median_income_by_industry'] = egen.median(df['income'], by='industry')\ndf['sd_income_by_gender'] = egen.sd(df['income'], by='gender')\n\n# Group identification and counting\ndf['education_group_size'] = egen.count(df, by='education')\ndf['first_in_group'] = egen.tag(df, ['education', 'gender']) # First observation in group\ndf['group_sequence'] = egen.seq(df, by='education') # Sequence within group\n\n# Advanced group operations\ndf['income_rank_in_education'] = egen.rank(df['income'], by='education')\ndf['above_group_median'] = (df['income'] > egen.median(df['income'], by='education')).astype(int)\n```\n\n### `reghdfe` - Advanced Fixed Effects Regression\n\nThe `reghdfe` module provides state-of-the-art estimation for linear models with high-dimensional fixed effects.\n\n#### Basic Fixed Effects Regression\n```python\nfrom pystatar import reghdfe\n\n# Create panel dataset\nnp.random.seed(42)\nn_firms, n_years = 100, 10\nn_obs = n_firms * n_years\n\npanel_df = pd.DataFrame({\n 'firm_id': np.repeat(range(n_firms), n_years),\n 'year': np.tile(range(2010, 2020), n_firms),\n 'log_sales': np.random.normal(10, 1, n_obs),\n 'log_employment': np.random.normal(4, 0.5, n_obs),\n 'log_capital': np.random.normal(8, 0.8, n_obs),\n 'industry': np.repeat(np.random.choice(['Tech', 'Manufacturing', 'Services'], n_firms), n_years)\n})\n\n# Basic regression with firm and year fixed effects\nresult = reghdfe.reghdfe(\n data=panel_df,\n depvar='log_sales',\n regressors=['log_employment', 'log_capital'],\n absorb=['firm_id', 'year']\n)\n\nprint(result.summary())\nprint(f\"R-squared: {result.r2:.4f}\")\nprint(f\"Number of observations: {result.N}\")\n```\n\n#### Advanced Regression with Clustering and Instruments\n```python\n# Add instrumental variables\npanel_df['instrument1'] = np.random.normal(0, 1, n_obs)\npanel_df['instrument2'] = np.random.normal(0, 1, n_obs)\n\n# Regression with clustering and multiple fixed effects\nresult = reghdfe.reghdfe(\n data=panel_df,\n depvar='log_sales',\n regressors=['log_employment', 'log_capital'],\n absorb=['firm_id', 'year', 'industry'], # Multiple fixed effects\n cluster='firm_id', # Clustered standard errors\n weights='employment', # Weighted regression\n subset=(panel_df['year'] >= 2012) # Conditional estimation\n)\n\n# Access detailed results\nprint(\"Coefficient Table:\")\nprint(result.coef_table)\nprint(f\"\\nFixed Effects absorbed: {result.absorbed_fe}\")\nprint(f\"Clusters: {result.n_clusters}\")\n```\n\n#### Instrumental Variables with High-Dimensional FE\n```python\n# IV regression with fixed effects\niv_result = reghdfe.ivreghdfe(\n data=panel_df,\n depvar='log_sales',\n endogenous=['log_employment'], # Endogenous variable\n instruments=['instrument1', 'instrument2'], # Instruments\n exogenous=['log_capital'], # Exogenous controls\n absorb=['firm_id', 'year'],\n cluster='firm_id'\n)\n\nprint(\"First Stage Results:\")\nprint(iv_result.first_stage)\nprint(f\"\\nWeak instruments test (F-stat): {iv_result.first_stage_fstat:.2f}\")\nprint(f\"Overidentification test (Hansen J): {iv_result.hansen_j:.4f}\")\n```\n\n### `winsor2` - Advanced Outlier Treatment\n\nThe `winsor2` module provides comprehensive outlier detection and treatment methods.\n\n#### Basic Winsorizing\n```python\nfrom pystatar import winsor2\n\n# Create dataset with outliers\noutlier_df = pd.DataFrame({\n 'income': np.concatenate([\n np.random.normal(50000, 10000, 950), # Normal observations\n np.random.uniform(200000, 500000, 50) # Outliers\n ]),\n 'age': np.random.randint(18, 70, 1000),\n 'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)\n})\n\n# Basic winsorizing at 1st and 99th percentiles\nresult = winsor2.winsor2(outlier_df, ['income'])\nprint(\"Original vs Winsorized:\")\nprint(f\"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}\")\nprint(f\"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}\")\n```\n\n#### Group-wise Winsorizing\n```python\n# Winsorize within groups\nresult = winsor2.winsor2(\n outlier_df, \n ['income'],\n by='industry', # Winsorize within each industry\n cuts=(5, 95), # Use 5th and 95th percentiles\n suffix='_clean' # Custom suffix\n)\n\n# Compare distributions by group\nfor industry in outlier_df['industry'].unique():\n mask = outlier_df['industry'] == industry\n original = outlier_df.loc[mask, 'income']\n winsorized = result.loc[mask, 'income_clean']\n print(f\"\\n{industry}:\")\n print(f\" Original: {original.describe()}\")\n print(f\" Winsorized: {winsorized.describe()}\")\n```\n\n#### Trimming vs Winsorizing Comparison\n```python\n# Compare different outlier treatment methods\ntrim_result = winsor2.winsor2(\n outlier_df, \n ['income'],\n trim=True, # Trim (remove) instead of winsorize\n cuts=(2.5, 97.5) # Trim 2.5% from each tail\n)\n\nwinsor_result = winsor2.winsor2(\n outlier_df, \n ['income'],\n trim=False, # Winsorize (cap) outliers\n cuts=(2.5, 97.5)\n)\n\nprint(\"Treatment Comparison:\")\nprint(f\"Original N: {len(outlier_df)}\")\nprint(f\"After trimming N: {trim_result['income_tr'].notna().sum()}\")\nprint(f\"After winsorizing N: {len(winsor_result)}\")\nprint(f\"Trimmed mean: {trim_result['income_tr'].mean():.0f}\")\nprint(f\"Winsorized mean: {winsor_result['income_w'].mean():.0f}\")\n```\n\n#### Advanced Outlier Detection\n```python\n# Multiple variable winsorizing with custom thresholds\nmulti_result = winsor2.winsor2(\n outlier_df,\n ['income', 'age'],\n cuts=(1, 99), # Different cuts for different variables\n by='industry', # Group-specific treatment\n replace=True, # Replace original variables\n label=True # Add descriptive labels\n)\n\n# Generate outlier indicators\noutlier_df['income_outlier'] = winsor2.outlier_indicator(\n outlier_df['income'], \n method='iqr', # Use IQR method\n factor=1.5 # 1.5 * IQR threshold\n)\n\noutlier_df['extreme_outlier'] = winsor2.outlier_indicator(\n outlier_df['income'],\n method='percentile', # Use percentile method\n cuts=(1, 99)\n)\n\nprint(\"Outlier Detection Results:\")\nprint(f\"IQR method detected {outlier_df['income_outlier'].sum()} outliers\")\nprint(f\"Percentile method detected {outlier_df['extreme_outlier'].sum()} outliers\")\n```\n\n## Project Structure\n\n```\npystatar/\n\u251c\u2500\u2500 __init__.py # Main package initialization\n\u251c\u2500\u2500 tabulate/ # Cross-tabulation module\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 core.py\n\u2502 \u251c\u2500\u2500 results.py\n\u2502 \u2514\u2500\u2500 stats.py\n\u251c\u2500\u2500 egen/ # Extended generation module\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2514\u2500\u2500 core.py\n\u251c\u2500\u2500 reghdfe/ # High-dimensional FE regression\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 core.py\n\u2502 \u251c\u2500\u2500 estimation.py\n\u2502 \u2514\u2500\u2500 utils.py\n\u251c\u2500\u2500 winsor2/ # Winsorizing module\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 core.py\n\u2502 \u2514\u2500\u2500 utils.py\n\u251c\u2500\u2500 utils/ # Shared utilities\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2514\u2500\u2500 common.py\n\u2514\u2500\u2500 tests/ # Test suite\n \u251c\u2500\u2500 test_tabulate.py\n \u251c\u2500\u2500 test_egen.py\n \u251c\u2500\u2500 test_reghdfe.py\n \u2514\u2500\u2500 test_winsor2.py\n```\n\n## Key Features\n\n- **Familiar Syntax**: Stata-like command structure and parameters\n- **Pandas Integration**: Seamless integration with pandas DataFrames\n- **High Performance**: Optimized implementations using pandas and NumPy\n- **Comprehensive Coverage**: Most commonly used Stata commands\n- **Statistical Rigor**: Proper statistical tests and robust standard errors\n- **Flexible Output**: Multiple output formats and customization options\n- **Missing Value Handling**: Configurable treatment of missing data\n\n## Documentation\n\nEach module comes with comprehensive documentation and examples:\n\n- [**tabulate Documentation**](docs/tabulate.md) - Cross-tabulation and frequency analysis\n- [**egen Documentation**](docs/egen.md) - Extended data generation functions\n- [**reghdfe Documentation**](docs/reghdfe.md) - High-dimensional fixed effects regression \n- [**winsor2 Documentation**](docs/winsor2.md) - Data winsorizing and trimming\n\n## Contributing to the Project\n\nWe're building the future of academic research tools in Python! Here's how you can help:\n\n### Priority Commands Needed\nHelp us implement the remaining **16 high-priority commands**:\n\n**Data Management**: `summarize`, `describe`, `merge`, `reshape`, `collapse`, `keep`, `drop`, `generate`, `replace`, `sort`\n\n**Statistical Analysis**: `reg`, `logit`, `probit`, `ivregress`, `xtreg`, `anova`\n\n### How to Contribute\n\n1. **Request a Command**: [Open an issue](https://github.com/brycewang-stanford/PyStataR/issues/new) with the command you need\n2. ** Implement a Command**: Check our [contribution guidelines](CONTRIBUTING.md) and submit a PR\n3. ** Report Bugs**: Help us improve existing functionality\n4. ** Improve Documentation**: Add examples, tutorials, or clarifications\n5. ** Spread the Word**: Star the repo and share with fellow researchers\n\n### Recognition\nAll contributors will be recognized in our documentation and release notes. Major contributors will be listed as co-authors on any academic publications about this project.\n\n### Academic Collaboration\nWe welcome partnerships with universities and research institutions. If you're interested in using this project in your coursework or research, please reach out!\n\n## Community & Support\n\n- **Documentation**: [https://pystatar.readthedocs.io](docs/)\n- **Discussions**: [GitHub Discussions](https://github.com/brycewang-stanford/PyStataR/discussions)\n- **Issues**: [Bug Reports & Feature Requests](https://github.com/brycewang-stanford/PyStataR/issues)\n- ** Email**: brycew6m@stanford.edu for academic collaborations\n\n## Comparison with Stata\n\n| Feature | Stata | PyStataR | Advantage |\n|---------|-------|-------------------|-----------|\n| **Speed** | Base performance | 2-10x faster* | Vectorized operations |\n| **Memory** | Limited by system | Efficient pandas backend | Better large dataset handling |\n| **Extensibility** | Ado files | Python ecosystem | Unlimited customization |\n| **Cost** | $$$$ | Free & Open Source | Accessible to all researchers |\n| **Integration** | Standalone | Python data science stack | Seamless workflow |\n| **Output** | Limited formats | Multiple (LaTeX, HTML, etc.) | Publication ready |\n\n*Performance comparison based on typical academic datasets (1M+ observations)\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\nThis package builds upon the excellent work of:\n- [pandas](https://pandas.pydata.org/) - The backbone of our data manipulation\n- [numpy](https://numpy.org/) - Powering our numerical computations\n- [scipy](https://scipy.org/) - Statistical functions and algorithms\n- [statsmodels](https://www.statsmodels.org/) - Statistical modeling foundations\n- [pyhdfe](https://github.com/jeffgortmaker/pyhdfe) - High-dimensional fixed effects algorithms\n- The entire **Stata community** - For decades of statistical innovation that inspired this project\n\n## Future Roadmap\n\n### Version 1.0 Goals (Target: End of 2025)\n- Core 4 commands implemented\n- Additional 16 high-priority commands\n- Comprehensive test suite (>95% coverage)\n- Complete documentation with tutorials\n- Performance benchmarks vs Stata\n\n### Version 2.0 Vision (2026)\n- Machine learning integration\n- R integration for cross-platform compatibility\n- Web interface for non-programmers\n- Jupyter notebook extensions\n\n## \ud83d\udcc8 Project Statistics\n\n[](https://github.com/brycewang-stanford/PyStataR/stargazers)\n[](https://github.com/brycewang-stanford/PyStataR/network)\n[](https://github.com/brycewang-stanford/PyStataR/issues)\n[](https://github.com/brycewang-stanford/PyStataR/pulls)\n\n## Contact & Collaboration\n\n**Created by [Bryce Wang](https://github.com/brycewang-stanford)** - Stanford University\n\n- **Email**: brycew6m@stanford.edu \n- **GitHub**: [@brycewang-stanford](https://github.com/brycewang-stanford)\n- **Academic**: Stanford Graduate School of Business\n- **LinkedIn**: [Connect with me](https://linkedin.com/in/brycewang)\n\n### Academic Partnerships Welcome!\n- Course integration and teaching materials\n- Research collaborations and citations\n- Institutional licensing and support\n- Student contributor programs\n\n---\n\n### \u2b50 **Love this project? Give it a star and help us reach more researchers!** \u2b50\n\n**Together, we're building the future of academic research in Python** \n",
"bugtrack_url": null,
"license": null,
"summary": "Comprehensive Python package providing Stata-equivalent commands for pandas DataFrames",
"version": "0.0.0",
"project_urls": {
"Bug Tracker": "https://github.com/brycewang-stanford/PyStataR/issues",
"Documentation": "https://github.com/brycewang-stanford/PyStataR/docs",
"Homepage": "https://github.com/brycewang-stanford/PyStataR",
"Repository": "https://github.com/brycewang-stanford/PyStataR"
},
"split_keywords": [
"stata",
" pandas",
" econometrics",
" statistics",
" data-analysis",
" tabulate",
" egen",
" reghdfe",
" winsor",
" cross-tabulation",
" fixed-effects",
" panel-data",
" regression"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "232cc0069fd4068c238eb99e4747b0da29de59413478e7c2bdafd47947af8feb",
"md5": "b686a99d9d5ed074f825d483fc6e26b1",
"sha256": "71346c0bdd36011c818fe4dc32334ac481d19a5258e1340193f7e20719f8b4a2"
},
"downloads": -1,
"filename": "pystatar-0.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b686a99d9d5ed074f825d483fc6e26b1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 42822,
"upload_time": "2025-07-27T08:19:16",
"upload_time_iso_8601": "2025-07-27T08:19:16.780991Z",
"url": "https://files.pythonhosted.org/packages/23/2c/c0069fd4068c238eb99e4747b0da29de59413478e7c2bdafd47947af8feb/pystatar-0.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "cdc04b4dfb7aa9282bbe97bd02b1d7b29600146188fad3fe56919b30f8283fc9",
"md5": "c3370fe63f97066098c972e2071852f3",
"sha256": "caf6f1df7780f92350b80f4dfddc6d1bab6dd386fc9d855a8ea9d340b72677ab"
},
"downloads": -1,
"filename": "pystatar-0.0.0.tar.gz",
"has_sig": false,
"md5_digest": "c3370fe63f97066098c972e2071852f3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 50842,
"upload_time": "2025-07-27T08:19:18",
"upload_time_iso_8601": "2025-07-27T08:19:18.272576Z",
"url": "https://files.pythonhosted.org/packages/cd/c0/4b4dfb7aa9282bbe97bd02b1d7b29600146188fad3fe56919b30f8283fc9/pystatar-0.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-27 08:19:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "brycewang-stanford",
"github_project": "PyStataR",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pystatar"
}