pystatar


Namepystatar JSON
Version 0.4.0 PyPI version JSON
download
home_pageNone
SummaryPyStataR aims to recreate and significantly enhance the top and most frequently used Stata commands in Python, transforming them into the most powerful and user-friendly statistical tools for academic research. Our goal is to not just replicate Stata's functionality, but to expand and improve upon it, leveraging Python's ecosystem to create superior research tools.
upload_time2025-08-02 00:19:39
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords stata pandas econometrics statistics data-analysis tabulate egen winsor cross-tabulation pyegen pywinsor2 pdtab data-manipulation winsorizing frequency-analysis outreg pyoutreg regression-tables outreg2 model-output research-tools
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PyStataR## ๐Ÿ†• What's New in v0.4.0

โœจ **New Integration**: Added pyoutreg for professional regression output tables (Stata's `outreg2` equivalent)  
๐Ÿ“Š **Enhanced Functionality**: Comprehensive regression result export to Excel/Word with publication-quality formatting  
๐Ÿ”ง **Four-Package Integration**: Now includes pyegen, pywinsor2, pdtab, and pyoutreg under unified interface  
๐Ÿ“š **Extended Documentation**: Complete examples for regression output and model comparison  
๐Ÿš€ **Research-Ready**: End-to-end workflow from data processing to publication tablesython Version](https://img.shields.io/pypi/pyversions/pystatar)](https://pypi.org/project/pystatar/)
[![PyPI Version](https://img.shields.io/pypi/v/pystatar)](https://pypi.org/project/pystatar/)
[![License](https://img.shields.io/pypi/l/pystatar)](https://github.com/brycewang-stanford/PyStataR/blob/main/LICENSE)
[![Downloads](https://img.shields.io/pypi/dm/pystatar)](https://pypi.org/project/pystatar/)

> **The Ultimate Python Toolkit for Academic Research - Bringing Stata & R's Power to Python**  
> **้›†ๆˆ Stata ๅ’Œ R ่ฏญ่จ€็š„ๆœ€้ซ˜้ข‘ไฝฟ็”จๅทฅๅ…ท๏ผŒ่ฎฉ็คพ็ง‘ๅญฆๆœฏๅ’Œ็ปŸ่ฎก็ ”็ฉถ๏ผŒๅ…จ้ขๆ‹ฅๆŠฑ Python/AI/ๅผ€ๆบ็คพๅŒบ**

## What's New in v0.3.0

**Enhanced Architecture**: Improved unified interface with better error handling and documentation  
**Cleaner Codebase**: Removed duplicate code and streamlined module structure  
**Better Documentation**: Enhanced examples and clearer API documentation  
**Performance**: Optimized imports and reduced overhead for faster loading  

## Project Vision & Goals

**PyStataR** serves as a unified interface to the most powerful and frequently used Stata-equivalent packages in Python. Instead of reinventing the wheel, we provide seamless integration of four mature PyPI packages under one convenient interface.

- **Seamless Integration**: Four proven PyPI packages unified under one interface
- **Familiar Workflow**: Stata-like syntax and functionality for Python users  
- **Academic Focus**: Built specifically for research and statistical analysis needs
- **Open Source**: Free and accessible to all researchers worldwide
- **No Reinvention**: Leverages existing, mature packages rather than duplicating functionality


### Why This Project Matters
- **Bridge the Gap**: Seamless transition from Stata to Python for researchers
- **Unified Interface**: One package, multiple powerful tools - no need to learn different APIs
- **Mature Foundation**: Built on battle-tested PyPI packages with years of development
- **Community-Driven**: Open source development with academic researchers in mind
- **No Maintenance Overhead**: Leverages existing packages rather than maintaining duplicate code

### Target Stata Commands (The Most Used in Academic Research)
โœ… **pyegen** - Extended data generation and manipulation (Stata's `egen`)  
โœ… **pywinsor2** - Data winsorizing and trimming (Stata's `winsor2`)  
โœ… **pdtab** - Cross-tabulation and frequency analysis (Stata's `tabulate`)  
โœ… **pyoutreg** - Professional regression output tables (Stata's `outreg2`)

**Based on mature PyPI packages**:
- [pyegen](https://pypi.org/project/pyegen/) - version 0.2.4+
- [pywinsor2](https://pypi.org/project/pywinsor2/) - version 0.4.3+  
- [pdtab](https://pypi.org/project/pdtab/) - version 0.1.1+
- [pyoutreg](https://pypi.org/project/pyoutreg/) - version 0.1.1+

**Want to contribute or request features?** 
-  [Create an issue](https://github.com/brycewang-stanford/PyStataR/issues) to request functionality
-  [Contribute](CONTRIBUTING.md) to help us improve the package
- โญ Star this repo to show your support!
---- 
## Core Modules Overview
### **pyegen** - Extended Data Generation and Manipulation  
- **Built on**: [pyegen v0.2.4](https://pypi.org/project/pyegen/) PyPI package
- **Key Features**: Group operations, ranking with tie-breaking, row statistics, percentile calculations
- **Use Cases**: Data preprocessing, feature engineering, panel data construction

### **pdtab** - Advanced Cross-tabulation and Frequency Analysis
- **Built on**: [pdtab v0.1.1](https://pypi.org/project/pdtab/) PyPI package  
- **Key Features**: One-way and two-way tables, statistical tests, comprehensive output formatting
- **Use Cases**: Survey analysis, categorical data exploration, market research

### **pywinsor2** - Advanced Outlier Detection and Treatment
- **Built on**: [pywinsor2 v0.4.3](https://pypi.org/project/pywinsor2/) PyPI package
- **Key Features**: IQR-based detection, percentile methods, group-wise operations, flexible trimming
- **Use Cases**: Data cleaning, outlier analysis, robust statistical modeling

### **pyoutreg** - Professional Regression Output Tables  
- **Built on**: [pyoutreg v0.1.1](https://pypi.org/project/pyoutreg/) PyPI package
- **Key Features**: Stata `outreg2` equivalent, Excel/Word export, model comparison, publication-quality formatting
- **Use Cases**: Academic papers, research reports, model comparison tables, publication workflows

## Advanced Features & Performance

### Performance Optimizations
- **Vectorized Operations**: All functions leverage NumPy and pandas for maximum speed
- **Memory Efficiency**: Optimized for large datasets common in academic research
- **Proven Reliability**: Built on four mature PyPI packages with extensive testing
- **Modular Design**: Use individual modules independently or together

### Research-Grade Features
- **Publication Ready**: Clean output formatting suitable for academic papers
- **Reproducible Research**: Consistent results and comprehensive documentation
- **Missing Data Handling**: Robust missing value treatment across all modules
- **Academic Standards**: Follows statistical best practices and conventions

## Quick Installation

```bash
pip install pystatar
```

## Comprehensive Usage Examples

### Two Ways to Use PyStataR

#### Method 1: Module-based Import (Recommended)
```python
from pystatar import pyegen, pywinsor2, pdtab, pyoutreg

# Each module maintains its independence and full functionality
```

#### Method 2: Direct Function Import (Convenience)
```python
from pystatar import rank, rowmean, winsor2, tabulate, outreg

# Direct access to key functions
```

### `pdtab` - Advanced Cross-tabulation

The `pdtab` module provides comprehensive frequency analysis and cross-tabulation capabilities.

#### Basic Usage Examples
```python
import pandas as pd
import numpy as np
from pystatar import pdtab

# Create sample dataset
df = pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'] * 100,
    'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'] * 100,
    'income_level': np.random.choice(['Low', 'Medium', 'High'], 600),
    'age': np.random.randint(22, 65, 600),
    'industry': np.random.choice(['Tech', 'Finance', 'Healthcare', 'Education'], 600)
})

# One-way frequency table
result = pdtab.tab1('education', df)
print(result)

# Two-way cross-tabulation
result = pdtab.tab2('gender', 'education', df)
print(result)

# Using convenience function
result = pdtab.tabulate('gender', 'education', df)
print(result)
```
### `pyegen` - Extended Data Generation

The `pyegen` module provides powerful data manipulation functions that extend Stata's egen capabilities.

#### Ranking and Statistical Functions
```python
from pystatar import pyegen

# Create test data
df = pd.DataFrame({
    'income': np.random.normal(50000, 15000, 1000),
    'industry': np.random.choice(['Tech', 'Finance', 'Healthcare'], 1000),
    'experience': np.random.randint(0, 30, 1000)
})

# Basic ranking functions
df['income_rank'] = pyegen.rank(df['income'])
df['income_rank_by_industry'] = pyegen.rank(df['income'], by=df['industry'])

# Group statistics
df['mean_income_by_industry'] = pyegen.mean(df['income'], by=df['industry'])
df['industry_count'] = pyegen.count(df, by='industry')

# Row operations (for multiple variables)
scores_df = pd.DataFrame({
    'math': np.random.normal(75, 10, 100),
    'english': np.random.normal(80, 12, 100),
    'science': np.random.normal(78, 11, 100)
})

scores_df['total_score'] = pyegen.rowtotal(scores_df, ['math', 'english', 'science'])
scores_df['avg_score'] = pyegen.rowmean(scores_df, ['math', 'english', 'science'])
scores_df['max_score'] = pyegen.rowmax(scores_df, ['math', 'english', 'science'])
```
```python
# Create test scores dataset
scores_df = pd.DataFrame({
    'student': range(1, 101),
    'math': np.random.normal(75, 10, 100),
    'english': np.random.normal(80, 12, 100),
    'science': np.random.normal(78, 11, 100),
    'history': np.random.normal(82, 9, 100)
})

# Row statistics
scores_df['total_score'] = egen.rowtotal(scores_df, ['math', 'english', 'science', 'history'])
scores_df['avg_score'] = egen.rowmean(scores_df, ['math', 'english', 'science', 'history'])
scores_df['min_score'] = egen.rowmin(scores_df, ['math', 'english', 'science', 'history'])
### `pywinsor2` - Advanced Outlier Treatment

The `pywinsor2` module provides comprehensive outlier detection and treatment methods.

#### Basic Winsorizing
```python
from pystatar import pywinsor2

# Create dataset with outliers
outlier_df = pd.DataFrame({
    'income': np.concatenate([
        np.random.normal(50000, 10000, 950),  # Normal observations
        np.random.uniform(200000, 500000, 50)  # Outliers
    ]),
    'age': np.random.randint(18, 70, 1000),
    'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)
})

# Basic winsorizing at 1st and 99th percentiles
result = pywinsor2.winsor2(outlier_df, ['income'])
print("Original vs Winsorized:")
print(f"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}")
print(f"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}")

# Group-wise winsorizing
result = pywinsor2.winsor2(
    outlier_df, 
    ['income'],
    by='industry',          # Winsorize within each industry
    cuts=(5, 95),          # Use 5th and 95th percentiles
    suffix='_clean'        # Custom suffix
)

# Trimming vs Winsorizing
trim_result = pywinsor2.winsor2(
    outlier_df, 
    ['income'],
    trim=True,              # Trim (remove) instead of winsorize
    cuts=(2.5, 97.5)       # Trim 2.5% from each tail
)

print(f"Original N: {len(outlier_df)}")
print(f"After trimming N: {trim_result['income_tr'].notna().sum()}")
```
    'log_employment': np.random.normal(4, 0.5, n_obs),
    'log_capital': np.random.normal(8, 0.8, n_obs),
    'industry': np.repeat(np.random.choice(['Tech', 'Manufacturing', 'Services'], n_firms), n_years)
})

### `winsor2` - Advanced Outlier Treatment

The `winsor2` module provides comprehensive outlier detection and treatment methods.

#### Basic Winsorizing
```python
from pystatar import winsor2

# Create dataset with outliers
outlier_df = pd.DataFrame({
    'income': np.concatenate([
        np.random.normal(50000, 10000, 950),  # Normal observations
        np.random.uniform(200000, 500000, 50)  # Outliers
    ]),
    'age': np.random.randint(18, 70, 1000),
    'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)
})

# Basic winsorizing at 1st and 99th percentiles
result = winsor2.winsor2(outlier_df, ['income'])
print("Original vs Winsorized:")
print(f"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}")
print(f"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}")
```

#### Group-wise Winsorizing
```python
# Winsorize within groups
result = winsor2.winsor2(
    outlier_df, 
    ['income'],
    by='industry',          # Winsorize within each industry
    cuts=(5, 95),          # Use 5th and 95th percentiles
    suffix='_clean'        # Custom suffix
)

# Compare distributions by group
for industry in outlier_df['industry'].unique():
    mask = outlier_df['industry'] == industry
    original = outlier_df.loc[mask, 'income']
    winsorized = result.loc[mask, 'income_clean']
    print(f"\n{industry}:")
    print(f"  Original: {original.describe()}")
    print(f"  Winsorized: {winsorized.describe()}")
```

#### Trimming vs Winsorizing Comparison
```python
# Compare different outlier treatment methods
trim_result = winsor2.winsor2(
    outlier_df, 
    ['income'],
    trim=True,              # Trim (remove) instead of winsorize
    cuts=(2.5, 97.5)       # Trim 2.5% from each tail
)

winsor_result = winsor2.winsor2(
    outlier_df, 
    ['income'],
    trim=False,             # Winsorize (cap) outliers
    cuts=(2.5, 97.5)
)

print("Treatment Comparison:")
print(f"Original N: {len(outlier_df)}")
print(f"After trimming N: {trim_result['income_tr'].notna().sum()}")
print(f"After winsorizing N: {len(winsor_result)}")
print(f"Trimmed mean: {trim_result['income_tr'].mean():.0f}")
print(f"Winsorized mean: {winsor_result['income_w'].mean():.0f}")
```

#### Advanced Outlier Detection
```python
# Multiple variable winsorizing with custom thresholds
multi_result = winsor2.winsor2(
    outlier_df,
    ['income', 'age'],
    cuts=(1, 99),           # Different cuts for different variables
    by='industry',          # Group-specific treatment
    replace=True,           # Replace original variables
    label=True              # Add descriptive labels
)

# Generate outlier indicators
outlier_df['income_outlier'] = winsor2.outlier_indicator(
    outlier_df['income'], 
    method='iqr',           # Use IQR method
    factor=1.5              # 1.5 * IQR threshold
)

outlier_df['extreme_outlier'] = winsor2.outlier_indicator(
    outlier_df['income'],
    method='percentile',    # Use percentile method
    cuts=(1, 99)
)

print("Outlier Detection Results:")
print(f"IQR method detected {outlier_df['income_outlier'].sum()} outliers")
print(f"Percentile method detected {outlier_df['extreme_outlier'].sum()} outliers")
```

### `pyoutreg` - Professional Regression Output Tables

The `pyoutreg` module provides Stata's `outreg2` equivalent functionality for exporting regression results to publication-quality tables.

#### Basic Regression Output
```python
import pandas as pd
import numpy as np
import statsmodels.api as sm
from pystatar import pyoutreg

# Create sample dataset
np.random.seed(42)
n = 1000
data = pd.DataFrame({
    'y': np.random.normal(50, 10, n),
    'x1': np.random.normal(0, 1, n),
    'x2': np.random.normal(0, 1, n),
    'x3': np.random.normal(0, 1, n),
    'industry': np.random.choice(['Tech', 'Finance', 'Healthcare'], n)
})

# Add some realistic relationships
data['y'] = 50 + 3*data['x1'] + 2*data['x2'] + np.random.normal(0, 5, n)

# Run regression
X = sm.add_constant(data[['x1', 'x2', 'x3']])
model = sm.OLS(data['y'], X).fit()

# Export to Excel (Stata outreg2 equivalent)
pyoutreg.outreg(model, 'regression_results.xlsx', replace=True, 
                ctitle='Model 1', title='My Research Results')
print("Regression results exported to Excel!")
```

#### Multiple Model Comparison
```python
# Compare multiple models (like Stata's outreg2 append)
model1 = sm.OLS(data['y'], sm.add_constant(data[['x1']])).fit()
model2 = sm.OLS(data['y'], sm.add_constant(data[['x1', 'x2']])).fit()
model3 = sm.OLS(data['y'], sm.add_constant(data[['x1', 'x2', 'x3']])).fit()

# Export multiple models to same file
pyoutreg.outreg(model1, 'comparison.xlsx', replace=True, ctitle='Model 1')
pyoutreg.outreg(model2, 'comparison.xlsx', append=True, ctitle='Model 2')
pyoutreg.outreg(model3, 'comparison.xlsx', append=True, ctitle='Model 3')

# Or use the comparison function
pyoutreg.outreg_compare([model1, model2, model3], 
                       'model_comparison.xlsx',
                       model_names=['Basic', 'Extended', 'Full Model'])
```

#### Summary Statistics Export
```python
# Export summary statistics (Stata's outreg2 sum)
pyoutreg.outreg(data=data[['y', 'x1', 'x2', 'x3']], 
                filename='summary_stats.xlsx',
                sum_stats=True, 
                replace=True,
                title='Descriptive Statistics')

# Grouped summary statistics
pyoutreg.outreg(data=data, 
                filename='summary_by_industry.xlsx',
                sum_stats=True,
                by='industry',
                replace=True,
                title='Statistics by Industry')
```

#### Advanced Output Formatting
```python
# Customize output format
pyoutreg.outreg(model, 'formatted_results.xlsx',
                replace=True,
                dec=3,                    # 3 decimal places
                bdec=4,                   # 4 decimal places for coefficients
                keep=['x1', 'x2'],        # Only show x1 and x2
                title='Publication Table',
                addnote='Robust standard errors in parentheses',
                font_size=12,
                font_name='Arial')

# Export to Word document
pyoutreg.outreg(model, 'results.docx',
                replace=True,
                landscape=True,           # Landscape orientation
                title='Research Results')
```

## Project Structure

```
pystatar/
โ”œโ”€โ”€ __init__.py              # Main package with unified interface to:
โ”‚                           #   - pyegen (v0.2.4+)
โ”‚                           #   - pywinsor2 (v0.4.3+)
โ”‚                           #   - pdtab (v0.1.1+)
โ”‚                           #   - pyoutreg (v0.1.1+)
โ””โ”€โ”€ tests/                  # Integration tests
    โ”œโ”€โ”€ test_basic.py       # Basic integration tests
    โ”œโ”€โ”€ test_egen.py        # pyegen functionality tests
    โ”œโ”€โ”€ test_pdtab.py       # pdtab functionality tests
    โ”œโ”€โ”€ test_winsor2.py     # pywinsor2 functionality tests
    โ””โ”€โ”€ test_outreg.py      # pyoutreg functionality tests
```

### Why This Architecture?

- **No Code Duplication**: We don't reinvent the wheel - we use proven packages
- **Easier Maintenance**: Updates and bug fixes come from the original package maintainers
- **Better Reliability**: Built on packages with existing user bases and testing
- **Unified Interface**: One import gives you access to all functionality
- **Future-Proof**: Automatically benefits from improvements in underlying packages

## Key Features

- **Familiar Syntax**: Stata-like command structure and parameters
- **Unified Interface**: Access four powerful modules (pdtab, pyegen, pywinsor2, pyoutreg) through a single package
- **Namespace Design**: Maintains module independence while providing integrated functionality
- **Pandas Integration**: Seamless integration with pandas DataFrames
- **High Performance**: Optimized implementations using pandas and NumPy
- **Comprehensive Coverage**: Cross-tabulation, data generation, outlier treatment, and regression output functions
- **Statistical Rigor**: Proper statistical tests and robust calculations
- **Flexible Output**: Multiple output formats (Excel, Word, DataFrame) and customization options
- **Missing Value Handling**: Configurable treatment of missing data
- **Publication Ready**: Professional table formatting for academic papers and reports

## Documentation

Each module comes with comprehensive documentation and examples:

- [**pdtab Documentation**](docs/pdtab.md) - Cross-tabulation and contingency table analysis
- [**pyegen Documentation**](docs/pyegen.md) - Extended data generation functions
- [**pywinsor2 Documentation**](docs/pywinsor2.md) - Data winsorizing and outlier treatment
- [**pyoutreg Documentation**](docs/pyoutreg.md) - Professional regression output tables

## Contributing to the Project

We're building the future of academic research tools in Python! Here's how you can help:

### Priority Commands Needed
Help us implement the remaining **16 high-priority commands**:

**Data Management**: `summarize`, `describe`, `merge`, `reshape`, `collapse`, `keep`, `drop`, `generate`, `replace`, `sort`

**Statistical Analysis**: `reg`, `logit`, `probit`, `ivregress`, `xtreg`, `anova`

### How to Contribute

1. **Request a Command**: [Open an issue](https://github.com/brycewang-stanford/PyStataR/issues/new) with the command you need
2. **Implement a Command**: Check our [contribution guidelines](CONTRIBUTING.md) and submit a PR
3. **Report Bugs**: Help us improve existing functionality
4. **Improve Documentation**: Add examples, tutorials, or clarifications
5. **Spread the Word**: Star the repo and share with fellow researchers

###  Recognition
All contributors will be recognized in our documentation and release notes. Major contributors will be listed as co-authors on any academic publications about this project.

###  Academic Collaboration
We welcome partnerships with universities and research institutions. If you're interested in using this project in your coursework or research, please reach out!

## Community & Support

- **Documentation**: [https://pystatar.readthedocs.io](docs/)
- **Discussions**: [GitHub Discussions](https://github.com/brycewang-stanford/PyStataR/discussions)
- **Issues**: [Bug Reports & Feature Requests](https://github.com/brycewang-stanford/PyStataR/issues)
- **Email**: brycew6m@stanford.edu for academic collaborations

## Comparison with Stata

| Feature | Stata | PyStataR | Advantage |
|---------|-------|-------------------|-----------|
| **Speed** | Base performance | 2-10x faster* | Vectorized operations |
| **Memory** | Limited by system | Efficient pandas backend | Better large dataset handling |
| **Extensibility** | Ado files | Python ecosystem | Unlimited customization |
| **Cost** | $$$$ | Free & Open Source | Accessible to all researchers |
| **Integration** | Standalone | Python data science stack | Seamless workflow |
| **Output** | Limited formats | Multiple (LaTeX, HTML, etc.) | Publication ready |

*Performance comparison based on typical academic datasets (1M+ observations)

## ๐Ÿ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## ๐Ÿ™ Acknowledgments

This package builds upon the excellent work of:
- [pandas](https://pandas.pydata.org/) - The backbone of our data manipulation
- [numpy](https://numpy.org/) - Powering our numerical computations
- [scipy](https://scipy.org/) - Statistical functions and algorithms
- [statsmodels](https://www.statsmodels.org/) - Statistical modeling foundations
- [pyhdfe](https://github.com/jeffgortmaker/pyhdfe) - High-dimensional fixed effects algorithms
- The entire **Stata community** - For decades of statistical innovation that inspired this project

##  Future Roadmap

### Version 1.0 Goals (Target: End of 2025)
-  Core 4 commands implemented
-  Additional 16 high-priority commands
-  Comprehensive test suite (>95% coverage)
-  Complete documentation with tutorials
-  Performance benchmarks vs Stata

### Version 2.0 Vision (2026)
-  Machine learning integration
-  R integration for cross-platform compatibility
-  Web interface for non-programmers
-  Jupyter notebook extensions

## ๐Ÿ“ˆ Project Statistics

[![GitHub stars](https://img.shields.io/github/stars/brycewang-stanford/PyStataR?style=social)](https://github.com/brycewang-stanford/PyStataR/stargazers)
[![GitHub forks](https://img.shields.io/github/forks/brycewang-stanford/PyStataR?style=social)](https://github.com/brycewang-stanford/PyStataR/network)
[![GitHub issues](https://img.shields.io/github/issues/brycewang-stanford/PyStataR)](https://github.com/brycewang-stanford/PyStataR/issues)
[![GitHub pull requests](https://img.shields.io/github/issues-pr/brycewang-stanford/PyStataR)](https://github.com/brycewang-stanford/PyStataR/pulls)

##  Contact & Collaboration

**Created by [Bryce Wang](https://github.com/brycewang-stanford)** - Stanford University

-  **Email**: brycew6m@stanford.edu  
-  **GitHub**: [@brycewang-stanford](https://github.com/brycewang-stanford)
-  **LinkedIn**: [Connect with me](https://linkedin.com/in/brycewang)

### Academic Partnerships Welcome!
-  Course integration and teaching materials
-  Research collaborations and citations
-  Institutional licensing and support
-  Student contributor programs

---

### โญ **Love this project? Give it a star and help us reach more researchers!** โญ

**Together, we're building the future of academic research in Python** 

### Disclaimer
The PyStataR tool is not affiliated with, endorsed by, or in any way associated with Stata or StataCorp LLC.
โ€œStataโ€ is a registered trademark of StataCorp LLC. Any mention of it in this project is solely for academic reference and comparative functionality purposes.
This tool is independently developed by the author and does not copy or reuse any part of the Stata source code. It is inspired by the design of Stata's analytical features to support similar workflows in Python.
For any trademark or copyright concerns, please contact the author for resolution.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pystatar",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Bryce Wang <brycew6m@stanford.edu>",
    "keywords": "stata, pandas, econometrics, statistics, data-analysis, tabulate, egen, winsor, cross-tabulation, pyegen, pywinsor2, pdtab, data-manipulation, winsorizing, frequency-analysis, outreg, pyoutreg, regression-tables, outreg2, model-output, research-tools",
    "author": null,
    "author_email": "Bryce Wang <brycew6m@stanford.edu>",
    "download_url": "https://files.pythonhosted.org/packages/3c/26/6964fc98d8091f50f9dd1cacd69e093a0ff77d3d7b817a39bc21941127c0/pystatar-0.4.0.tar.gz",
    "platform": null,
    "description": "# PyStataR## \ud83c\udd95 What's New in v0.4.0\n\n\u2728 **New Integration**: Added pyoutreg for professional regression output tables (Stata's `outreg2` equivalent)  \n\ud83d\udcca **Enhanced Functionality**: Comprehensive regression result export to Excel/Word with publication-quality formatting  \n\ud83d\udd27 **Four-Package Integration**: Now includes pyegen, pywinsor2, pdtab, and pyoutreg under unified interface  \n\ud83d\udcda **Extended Documentation**: Complete examples for regression output and model comparison  \n\ud83d\ude80 **Research-Ready**: End-to-end workflow from data processing to publication tablesython Version](https://img.shields.io/pypi/pyversions/pystatar)](https://pypi.org/project/pystatar/)\n[![PyPI Version](https://img.shields.io/pypi/v/pystatar)](https://pypi.org/project/pystatar/)\n[![License](https://img.shields.io/pypi/l/pystatar)](https://github.com/brycewang-stanford/PyStataR/blob/main/LICENSE)\n[![Downloads](https://img.shields.io/pypi/dm/pystatar)](https://pypi.org/project/pystatar/)\n\n> **The Ultimate Python Toolkit for Academic Research - Bringing Stata & R's Power to Python**  \n> **\u96c6\u6210 Stata \u548c R \u8bed\u8a00\u7684\u6700\u9ad8\u9891\u4f7f\u7528\u5de5\u5177\uff0c\u8ba9\u793e\u79d1\u5b66\u672f\u548c\u7edf\u8ba1\u7814\u7a76\uff0c\u5168\u9762\u62e5\u62b1 Python/AI/\u5f00\u6e90\u793e\u533a**\n\n## What's New in v0.3.0\n\n**Enhanced Architecture**: Improved unified interface with better error handling and documentation  \n**Cleaner Codebase**: Removed duplicate code and streamlined module structure  \n**Better Documentation**: Enhanced examples and clearer API documentation  \n**Performance**: Optimized imports and reduced overhead for faster loading  \n\n## Project Vision & Goals\n\n**PyStataR** serves as a unified interface to the most powerful and frequently used Stata-equivalent packages in Python. Instead of reinventing the wheel, we provide seamless integration of four mature PyPI packages under one convenient interface.\n\n- **Seamless Integration**: Four proven PyPI packages unified under one interface\n- **Familiar Workflow**: Stata-like syntax and functionality for Python users  \n- **Academic Focus**: Built specifically for research and statistical analysis needs\n- **Open Source**: Free and accessible to all researchers worldwide\n- **No Reinvention**: Leverages existing, mature packages rather than duplicating functionality\n\n\n### Why This Project Matters\n- **Bridge the Gap**: Seamless transition from Stata to Python for researchers\n- **Unified Interface**: One package, multiple powerful tools - no need to learn different APIs\n- **Mature Foundation**: Built on battle-tested PyPI packages with years of development\n- **Community-Driven**: Open source development with academic researchers in mind\n- **No Maintenance Overhead**: Leverages existing packages rather than maintaining duplicate code\n\n### Target Stata Commands (The Most Used in Academic Research)\n\u2705 **pyegen** - Extended data generation and manipulation (Stata's `egen`)  \n\u2705 **pywinsor2** - Data winsorizing and trimming (Stata's `winsor2`)  \n\u2705 **pdtab** - Cross-tabulation and frequency analysis (Stata's `tabulate`)  \n\u2705 **pyoutreg** - Professional regression output tables (Stata's `outreg2`)\n\n**Based on mature PyPI packages**:\n- [pyegen](https://pypi.org/project/pyegen/) - version 0.2.4+\n- [pywinsor2](https://pypi.org/project/pywinsor2/) - version 0.4.3+  \n- [pdtab](https://pypi.org/project/pdtab/) - version 0.1.1+\n- [pyoutreg](https://pypi.org/project/pyoutreg/) - version 0.1.1+\n\n**Want to contribute or request features?** \n-  [Create an issue](https://github.com/brycewang-stanford/PyStataR/issues) to request functionality\n-  [Contribute](CONTRIBUTING.md) to help us improve the package\n- \u2b50 Star this repo to show your support!\n---- \n## Core Modules Overview\n### **pyegen** - Extended Data Generation and Manipulation  \n- **Built on**: [pyegen v0.2.4](https://pypi.org/project/pyegen/) PyPI package\n- **Key Features**: Group operations, ranking with tie-breaking, row statistics, percentile calculations\n- **Use Cases**: Data preprocessing, feature engineering, panel data construction\n\n### **pdtab** - Advanced Cross-tabulation and Frequency Analysis\n- **Built on**: [pdtab v0.1.1](https://pypi.org/project/pdtab/) PyPI package  \n- **Key Features**: One-way and two-way tables, statistical tests, comprehensive output formatting\n- **Use Cases**: Survey analysis, categorical data exploration, market research\n\n### **pywinsor2** - Advanced Outlier Detection and Treatment\n- **Built on**: [pywinsor2 v0.4.3](https://pypi.org/project/pywinsor2/) PyPI package\n- **Key Features**: IQR-based detection, percentile methods, group-wise operations, flexible trimming\n- **Use Cases**: Data cleaning, outlier analysis, robust statistical modeling\n\n### **pyoutreg** - Professional Regression Output Tables  \n- **Built on**: [pyoutreg v0.1.1](https://pypi.org/project/pyoutreg/) PyPI package\n- **Key Features**: Stata `outreg2` equivalent, Excel/Word export, model comparison, publication-quality formatting\n- **Use Cases**: Academic papers, research reports, model comparison tables, publication workflows\n\n## Advanced Features & Performance\n\n### Performance Optimizations\n- **Vectorized Operations**: All functions leverage NumPy and pandas for maximum speed\n- **Memory Efficiency**: Optimized for large datasets common in academic research\n- **Proven Reliability**: Built on four mature PyPI packages with extensive testing\n- **Modular Design**: Use individual modules independently or together\n\n### Research-Grade Features\n- **Publication Ready**: Clean output formatting suitable for academic papers\n- **Reproducible Research**: Consistent results and comprehensive documentation\n- **Missing Data Handling**: Robust missing value treatment across all modules\n- **Academic Standards**: Follows statistical best practices and conventions\n\n## Quick Installation\n\n```bash\npip install pystatar\n```\n\n## Comprehensive Usage Examples\n\n### Two Ways to Use PyStataR\n\n#### Method 1: Module-based Import (Recommended)\n```python\nfrom pystatar import pyegen, pywinsor2, pdtab, pyoutreg\n\n# Each module maintains its independence and full functionality\n```\n\n#### Method 2: Direct Function Import (Convenience)\n```python\nfrom pystatar import rank, rowmean, winsor2, tabulate, outreg\n\n# Direct access to key functions\n```\n\n### `pdtab` - Advanced Cross-tabulation\n\nThe `pdtab` module provides comprehensive frequency analysis and cross-tabulation capabilities.\n\n#### Basic Usage Examples\n```python\nimport pandas as pd\nimport numpy as np\nfrom pystatar import pdtab\n\n# Create sample dataset\ndf = pd.DataFrame({\n    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'] * 100,\n    'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'] * 100,\n    'income_level': np.random.choice(['Low', 'Medium', 'High'], 600),\n    'age': np.random.randint(22, 65, 600),\n    'industry': np.random.choice(['Tech', 'Finance', 'Healthcare', 'Education'], 600)\n})\n\n# One-way frequency table\nresult = pdtab.tab1('education', df)\nprint(result)\n\n# Two-way cross-tabulation\nresult = pdtab.tab2('gender', 'education', df)\nprint(result)\n\n# Using convenience function\nresult = pdtab.tabulate('gender', 'education', df)\nprint(result)\n```\n### `pyegen` - Extended Data Generation\n\nThe `pyegen` module provides powerful data manipulation functions that extend Stata's egen capabilities.\n\n#### Ranking and Statistical Functions\n```python\nfrom pystatar import pyegen\n\n# Create test data\ndf = pd.DataFrame({\n    'income': np.random.normal(50000, 15000, 1000),\n    'industry': np.random.choice(['Tech', 'Finance', 'Healthcare'], 1000),\n    'experience': np.random.randint(0, 30, 1000)\n})\n\n# Basic ranking functions\ndf['income_rank'] = pyegen.rank(df['income'])\ndf['income_rank_by_industry'] = pyegen.rank(df['income'], by=df['industry'])\n\n# Group statistics\ndf['mean_income_by_industry'] = pyegen.mean(df['income'], by=df['industry'])\ndf['industry_count'] = pyegen.count(df, by='industry')\n\n# Row operations (for multiple variables)\nscores_df = pd.DataFrame({\n    'math': np.random.normal(75, 10, 100),\n    'english': np.random.normal(80, 12, 100),\n    'science': np.random.normal(78, 11, 100)\n})\n\nscores_df['total_score'] = pyegen.rowtotal(scores_df, ['math', 'english', 'science'])\nscores_df['avg_score'] = pyegen.rowmean(scores_df, ['math', 'english', 'science'])\nscores_df['max_score'] = pyegen.rowmax(scores_df, ['math', 'english', 'science'])\n```\n```python\n# Create test scores dataset\nscores_df = pd.DataFrame({\n    'student': range(1, 101),\n    'math': np.random.normal(75, 10, 100),\n    'english': np.random.normal(80, 12, 100),\n    'science': np.random.normal(78, 11, 100),\n    'history': np.random.normal(82, 9, 100)\n})\n\n# Row statistics\nscores_df['total_score'] = egen.rowtotal(scores_df, ['math', 'english', 'science', 'history'])\nscores_df['avg_score'] = egen.rowmean(scores_df, ['math', 'english', 'science', 'history'])\nscores_df['min_score'] = egen.rowmin(scores_df, ['math', 'english', 'science', 'history'])\n### `pywinsor2` - Advanced Outlier Treatment\n\nThe `pywinsor2` module provides comprehensive outlier detection and treatment methods.\n\n#### Basic Winsorizing\n```python\nfrom pystatar import pywinsor2\n\n# Create dataset with outliers\noutlier_df = pd.DataFrame({\n    'income': np.concatenate([\n        np.random.normal(50000, 10000, 950),  # Normal observations\n        np.random.uniform(200000, 500000, 50)  # Outliers\n    ]),\n    'age': np.random.randint(18, 70, 1000),\n    'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)\n})\n\n# Basic winsorizing at 1st and 99th percentiles\nresult = pywinsor2.winsor2(outlier_df, ['income'])\nprint(\"Original vs Winsorized:\")\nprint(f\"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}\")\nprint(f\"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}\")\n\n# Group-wise winsorizing\nresult = pywinsor2.winsor2(\n    outlier_df, \n    ['income'],\n    by='industry',          # Winsorize within each industry\n    cuts=(5, 95),          # Use 5th and 95th percentiles\n    suffix='_clean'        # Custom suffix\n)\n\n# Trimming vs Winsorizing\ntrim_result = pywinsor2.winsor2(\n    outlier_df, \n    ['income'],\n    trim=True,              # Trim (remove) instead of winsorize\n    cuts=(2.5, 97.5)       # Trim 2.5% from each tail\n)\n\nprint(f\"Original N: {len(outlier_df)}\")\nprint(f\"After trimming N: {trim_result['income_tr'].notna().sum()}\")\n```\n    'log_employment': np.random.normal(4, 0.5, n_obs),\n    'log_capital': np.random.normal(8, 0.8, n_obs),\n    'industry': np.repeat(np.random.choice(['Tech', 'Manufacturing', 'Services'], n_firms), n_years)\n})\n\n### `winsor2` - Advanced Outlier Treatment\n\nThe `winsor2` module provides comprehensive outlier detection and treatment methods.\n\n#### Basic Winsorizing\n```python\nfrom pystatar import winsor2\n\n# Create dataset with outliers\noutlier_df = pd.DataFrame({\n    'income': np.concatenate([\n        np.random.normal(50000, 10000, 950),  # Normal observations\n        np.random.uniform(200000, 500000, 50)  # Outliers\n    ]),\n    'age': np.random.randint(18, 70, 1000),\n    'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)\n})\n\n# Basic winsorizing at 1st and 99th percentiles\nresult = winsor2.winsor2(outlier_df, ['income'])\nprint(\"Original vs Winsorized:\")\nprint(f\"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}\")\nprint(f\"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}\")\n```\n\n#### Group-wise Winsorizing\n```python\n# Winsorize within groups\nresult = winsor2.winsor2(\n    outlier_df, \n    ['income'],\n    by='industry',          # Winsorize within each industry\n    cuts=(5, 95),          # Use 5th and 95th percentiles\n    suffix='_clean'        # Custom suffix\n)\n\n# Compare distributions by group\nfor industry in outlier_df['industry'].unique():\n    mask = outlier_df['industry'] == industry\n    original = outlier_df.loc[mask, 'income']\n    winsorized = result.loc[mask, 'income_clean']\n    print(f\"\\n{industry}:\")\n    print(f\"  Original: {original.describe()}\")\n    print(f\"  Winsorized: {winsorized.describe()}\")\n```\n\n#### Trimming vs Winsorizing Comparison\n```python\n# Compare different outlier treatment methods\ntrim_result = winsor2.winsor2(\n    outlier_df, \n    ['income'],\n    trim=True,              # Trim (remove) instead of winsorize\n    cuts=(2.5, 97.5)       # Trim 2.5% from each tail\n)\n\nwinsor_result = winsor2.winsor2(\n    outlier_df, \n    ['income'],\n    trim=False,             # Winsorize (cap) outliers\n    cuts=(2.5, 97.5)\n)\n\nprint(\"Treatment Comparison:\")\nprint(f\"Original N: {len(outlier_df)}\")\nprint(f\"After trimming N: {trim_result['income_tr'].notna().sum()}\")\nprint(f\"After winsorizing N: {len(winsor_result)}\")\nprint(f\"Trimmed mean: {trim_result['income_tr'].mean():.0f}\")\nprint(f\"Winsorized mean: {winsor_result['income_w'].mean():.0f}\")\n```\n\n#### Advanced Outlier Detection\n```python\n# Multiple variable winsorizing with custom thresholds\nmulti_result = winsor2.winsor2(\n    outlier_df,\n    ['income', 'age'],\n    cuts=(1, 99),           # Different cuts for different variables\n    by='industry',          # Group-specific treatment\n    replace=True,           # Replace original variables\n    label=True              # Add descriptive labels\n)\n\n# Generate outlier indicators\noutlier_df['income_outlier'] = winsor2.outlier_indicator(\n    outlier_df['income'], \n    method='iqr',           # Use IQR method\n    factor=1.5              # 1.5 * IQR threshold\n)\n\noutlier_df['extreme_outlier'] = winsor2.outlier_indicator(\n    outlier_df['income'],\n    method='percentile',    # Use percentile method\n    cuts=(1, 99)\n)\n\nprint(\"Outlier Detection Results:\")\nprint(f\"IQR method detected {outlier_df['income_outlier'].sum()} outliers\")\nprint(f\"Percentile method detected {outlier_df['extreme_outlier'].sum()} outliers\")\n```\n\n### `pyoutreg` - Professional Regression Output Tables\n\nThe `pyoutreg` module provides Stata's `outreg2` equivalent functionality for exporting regression results to publication-quality tables.\n\n#### Basic Regression Output\n```python\nimport pandas as pd\nimport numpy as np\nimport statsmodels.api as sm\nfrom pystatar import pyoutreg\n\n# Create sample dataset\nnp.random.seed(42)\nn = 1000\ndata = pd.DataFrame({\n    'y': np.random.normal(50, 10, n),\n    'x1': np.random.normal(0, 1, n),\n    'x2': np.random.normal(0, 1, n),\n    'x3': np.random.normal(0, 1, n),\n    'industry': np.random.choice(['Tech', 'Finance', 'Healthcare'], n)\n})\n\n# Add some realistic relationships\ndata['y'] = 50 + 3*data['x1'] + 2*data['x2'] + np.random.normal(0, 5, n)\n\n# Run regression\nX = sm.add_constant(data[['x1', 'x2', 'x3']])\nmodel = sm.OLS(data['y'], X).fit()\n\n# Export to Excel (Stata outreg2 equivalent)\npyoutreg.outreg(model, 'regression_results.xlsx', replace=True, \n                ctitle='Model 1', title='My Research Results')\nprint(\"Regression results exported to Excel!\")\n```\n\n#### Multiple Model Comparison\n```python\n# Compare multiple models (like Stata's outreg2 append)\nmodel1 = sm.OLS(data['y'], sm.add_constant(data[['x1']])).fit()\nmodel2 = sm.OLS(data['y'], sm.add_constant(data[['x1', 'x2']])).fit()\nmodel3 = sm.OLS(data['y'], sm.add_constant(data[['x1', 'x2', 'x3']])).fit()\n\n# Export multiple models to same file\npyoutreg.outreg(model1, 'comparison.xlsx', replace=True, ctitle='Model 1')\npyoutreg.outreg(model2, 'comparison.xlsx', append=True, ctitle='Model 2')\npyoutreg.outreg(model3, 'comparison.xlsx', append=True, ctitle='Model 3')\n\n# Or use the comparison function\npyoutreg.outreg_compare([model1, model2, model3], \n                       'model_comparison.xlsx',\n                       model_names=['Basic', 'Extended', 'Full Model'])\n```\n\n#### Summary Statistics Export\n```python\n# Export summary statistics (Stata's outreg2 sum)\npyoutreg.outreg(data=data[['y', 'x1', 'x2', 'x3']], \n                filename='summary_stats.xlsx',\n                sum_stats=True, \n                replace=True,\n                title='Descriptive Statistics')\n\n# Grouped summary statistics\npyoutreg.outreg(data=data, \n                filename='summary_by_industry.xlsx',\n                sum_stats=True,\n                by='industry',\n                replace=True,\n                title='Statistics by Industry')\n```\n\n#### Advanced Output Formatting\n```python\n# Customize output format\npyoutreg.outreg(model, 'formatted_results.xlsx',\n                replace=True,\n                dec=3,                    # 3 decimal places\n                bdec=4,                   # 4 decimal places for coefficients\n                keep=['x1', 'x2'],        # Only show x1 and x2\n                title='Publication Table',\n                addnote='Robust standard errors in parentheses',\n                font_size=12,\n                font_name='Arial')\n\n# Export to Word document\npyoutreg.outreg(model, 'results.docx',\n                replace=True,\n                landscape=True,           # Landscape orientation\n                title='Research Results')\n```\n\n## Project Structure\n\n```\npystatar/\n\u251c\u2500\u2500 __init__.py              # Main package with unified interface to:\n\u2502                           #   - pyegen (v0.2.4+)\n\u2502                           #   - pywinsor2 (v0.4.3+)\n\u2502                           #   - pdtab (v0.1.1+)\n\u2502                           #   - pyoutreg (v0.1.1+)\n\u2514\u2500\u2500 tests/                  # Integration tests\n    \u251c\u2500\u2500 test_basic.py       # Basic integration tests\n    \u251c\u2500\u2500 test_egen.py        # pyegen functionality tests\n    \u251c\u2500\u2500 test_pdtab.py       # pdtab functionality tests\n    \u251c\u2500\u2500 test_winsor2.py     # pywinsor2 functionality tests\n    \u2514\u2500\u2500 test_outreg.py      # pyoutreg functionality tests\n```\n\n### Why This Architecture?\n\n- **No Code Duplication**: We don't reinvent the wheel - we use proven packages\n- **Easier Maintenance**: Updates and bug fixes come from the original package maintainers\n- **Better Reliability**: Built on packages with existing user bases and testing\n- **Unified Interface**: One import gives you access to all functionality\n- **Future-Proof**: Automatically benefits from improvements in underlying packages\n\n## Key Features\n\n- **Familiar Syntax**: Stata-like command structure and parameters\n- **Unified Interface**: Access four powerful modules (pdtab, pyegen, pywinsor2, pyoutreg) through a single package\n- **Namespace Design**: Maintains module independence while providing integrated functionality\n- **Pandas Integration**: Seamless integration with pandas DataFrames\n- **High Performance**: Optimized implementations using pandas and NumPy\n- **Comprehensive Coverage**: Cross-tabulation, data generation, outlier treatment, and regression output functions\n- **Statistical Rigor**: Proper statistical tests and robust calculations\n- **Flexible Output**: Multiple output formats (Excel, Word, DataFrame) and customization options\n- **Missing Value Handling**: Configurable treatment of missing data\n- **Publication Ready**: Professional table formatting for academic papers and reports\n\n## Documentation\n\nEach module comes with comprehensive documentation and examples:\n\n- [**pdtab Documentation**](docs/pdtab.md) - Cross-tabulation and contingency table analysis\n- [**pyegen Documentation**](docs/pyegen.md) - Extended data generation functions\n- [**pywinsor2 Documentation**](docs/pywinsor2.md) - Data winsorizing and outlier treatment\n- [**pyoutreg Documentation**](docs/pyoutreg.md) - Professional regression output tables\n\n## Contributing to the Project\n\nWe're building the future of academic research tools in Python! Here's how you can help:\n\n### Priority Commands Needed\nHelp us implement the remaining **16 high-priority commands**:\n\n**Data Management**: `summarize`, `describe`, `merge`, `reshape`, `collapse`, `keep`, `drop`, `generate`, `replace`, `sort`\n\n**Statistical Analysis**: `reg`, `logit`, `probit`, `ivregress`, `xtreg`, `anova`\n\n### How to Contribute\n\n1. **Request a Command**: [Open an issue](https://github.com/brycewang-stanford/PyStataR/issues/new) with the command you need\n2. **Implement a Command**: Check our [contribution guidelines](CONTRIBUTING.md) and submit a PR\n3. **Report Bugs**: Help us improve existing functionality\n4. **Improve Documentation**: Add examples, tutorials, or clarifications\n5. **Spread the Word**: Star the repo and share with fellow researchers\n\n###  Recognition\nAll contributors will be recognized in our documentation and release notes. Major contributors will be listed as co-authors on any academic publications about this project.\n\n###  Academic Collaboration\nWe welcome partnerships with universities and research institutions. If you're interested in using this project in your coursework or research, please reach out!\n\n## Community & Support\n\n- **Documentation**: [https://pystatar.readthedocs.io](docs/)\n- **Discussions**: [GitHub Discussions](https://github.com/brycewang-stanford/PyStataR/discussions)\n- **Issues**: [Bug Reports & Feature Requests](https://github.com/brycewang-stanford/PyStataR/issues)\n- **Email**: brycew6m@stanford.edu for academic collaborations\n\n## Comparison with Stata\n\n| Feature | Stata | PyStataR | Advantage |\n|---------|-------|-------------------|-----------|\n| **Speed** | Base performance | 2-10x faster* | Vectorized operations |\n| **Memory** | Limited by system | Efficient pandas backend | Better large dataset handling |\n| **Extensibility** | Ado files | Python ecosystem | Unlimited customization |\n| **Cost** | $$$$ | Free & Open Source | Accessible to all researchers |\n| **Integration** | Standalone | Python data science stack | Seamless workflow |\n| **Output** | Limited formats | Multiple (LaTeX, HTML, etc.) | Publication ready |\n\n*Performance comparison based on typical academic datasets (1M+ observations)\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\nThis package builds upon the excellent work of:\n- [pandas](https://pandas.pydata.org/) - The backbone of our data manipulation\n- [numpy](https://numpy.org/) - Powering our numerical computations\n- [scipy](https://scipy.org/) - Statistical functions and algorithms\n- [statsmodels](https://www.statsmodels.org/) - Statistical modeling foundations\n- [pyhdfe](https://github.com/jeffgortmaker/pyhdfe) - High-dimensional fixed effects algorithms\n- The entire **Stata community** - For decades of statistical innovation that inspired this project\n\n##  Future Roadmap\n\n### Version 1.0 Goals (Target: End of 2025)\n-  Core 4 commands implemented\n-  Additional 16 high-priority commands\n-  Comprehensive test suite (>95% coverage)\n-  Complete documentation with tutorials\n-  Performance benchmarks vs Stata\n\n### Version 2.0 Vision (2026)\n-  Machine learning integration\n-  R integration for cross-platform compatibility\n-  Web interface for non-programmers\n-  Jupyter notebook extensions\n\n## \ud83d\udcc8 Project Statistics\n\n[![GitHub stars](https://img.shields.io/github/stars/brycewang-stanford/PyStataR?style=social)](https://github.com/brycewang-stanford/PyStataR/stargazers)\n[![GitHub forks](https://img.shields.io/github/forks/brycewang-stanford/PyStataR?style=social)](https://github.com/brycewang-stanford/PyStataR/network)\n[![GitHub issues](https://img.shields.io/github/issues/brycewang-stanford/PyStataR)](https://github.com/brycewang-stanford/PyStataR/issues)\n[![GitHub pull requests](https://img.shields.io/github/issues-pr/brycewang-stanford/PyStataR)](https://github.com/brycewang-stanford/PyStataR/pulls)\n\n##  Contact & Collaboration\n\n**Created by [Bryce Wang](https://github.com/brycewang-stanford)** - Stanford University\n\n-  **Email**: brycew6m@stanford.edu  \n-  **GitHub**: [@brycewang-stanford](https://github.com/brycewang-stanford)\n-  **LinkedIn**: [Connect with me](https://linkedin.com/in/brycewang)\n\n### Academic Partnerships Welcome!\n-  Course integration and teaching materials\n-  Research collaborations and citations\n-  Institutional licensing and support\n-  Student contributor programs\n\n---\n\n### \u2b50 **Love this project? Give it a star and help us reach more researchers!** \u2b50\n\n**Together, we're building the future of academic research in Python** \n\n### Disclaimer\nThe PyStataR tool is not affiliated with, endorsed by, or in any way associated with Stata or StataCorp LLC.\n\u201cStata\u201d is a registered trademark of StataCorp LLC. Any mention of it in this project is solely for academic reference and comparative functionality purposes.\nThis tool is independently developed by the author and does not copy or reuse any part of the Stata source code. It is inspired by the design of Stata's analytical features to support similar workflows in Python.\nFor any trademark or copyright concerns, please contact the author for resolution.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "PyStataR aims to recreate and significantly enhance the top and most frequently used Stata commands in Python, transforming them into the most powerful and user-friendly statistical tools for academic research. Our goal is to not just replicate Stata's functionality, but to expand and improve upon it, leveraging Python's ecosystem to create superior research tools.",
    "version": "0.4.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/brycewang-stanford/PyStataR/issues",
        "Documentation": "https://github.com/brycewang-stanford/PyStataR/docs",
        "Homepage": "https://github.com/brycewang-stanford/PyStataR",
        "Repository": "https://github.com/brycewang-stanford/PyStataR"
    },
    "split_keywords": [
        "stata",
        " pandas",
        " econometrics",
        " statistics",
        " data-analysis",
        " tabulate",
        " egen",
        " winsor",
        " cross-tabulation",
        " pyegen",
        " pywinsor2",
        " pdtab",
        " data-manipulation",
        " winsorizing",
        " frequency-analysis",
        " outreg",
        " pyoutreg",
        " regression-tables",
        " outreg2",
        " model-output",
        " research-tools"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9a15bbaed3b614c67dd8d251524cdeb3cf55b0983a37552af9e441c481631dfe",
                "md5": "ce820917036fc9743b991dc1ababe360",
                "sha256": "811d8e7ec45cd35cb7a2914e9005e5b314bec2ec9e208bb5ccbfe33ee00a1bb3"
            },
            "downloads": -1,
            "filename": "pystatar-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ce820917036fc9743b991dc1ababe360",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 12543,
            "upload_time": "2025-08-02T00:19:37",
            "upload_time_iso_8601": "2025-08-02T00:19:37.913946Z",
            "url": "https://files.pythonhosted.org/packages/9a/15/bbaed3b614c67dd8d251524cdeb3cf55b0983a37552af9e441c481631dfe/pystatar-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3c266964fc98d8091f50f9dd1cacd69e093a0ff77d3d7b817a39bc21941127c0",
                "md5": "92256d0ab5b5866f3fb78ff39fa10f9b",
                "sha256": "69d1fc2a638b05e5f716eb656bae535c22f903e06d76739f295c4c7336eeb4a2"
            },
            "downloads": -1,
            "filename": "pystatar-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "92256d0ab5b5866f3fb78ff39fa10f9b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 25264,
            "upload_time": "2025-08-02T00:19:39",
            "upload_time_iso_8601": "2025-08-02T00:19:39.253461Z",
            "url": "https://files.pythonhosted.org/packages/3c/26/6964fc98d8091f50f9dd1cacd69e093a0ff77d3d7b817a39bc21941127c0/pystatar-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-02 00:19:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "brycewang-stanford",
    "github_project": "PyStataR",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pystatar"
}
        
Elapsed time: 4.11835s