pystatar


Namepystatar JSON
Version 0.0.0 PyPI version JSON
download
home_pageNone
SummaryComprehensive Python package providing Stata-equivalent commands for pandas DataFrames
upload_time2025-07-27 08:19:18
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords stata pandas econometrics statistics data-analysis tabulate egen reghdfe winsor cross-tabulation fixed-effects panel-data regression
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PyStataR

[![Python Version](https://img.shields.io/pypi/pyversions/pystatar)](https://pypi.org/project/pystatar/)
[![PyPI Version](https://img.shields.io/pypi/v/pystatar)](https://pypi.org/project/pystatar/)
[![License](https://img.shields.io/pypi/l/pystatar)](https://github.com/brycewang-stanford/PyStataR/blob/main/LICENSE)
[![Downloads](https://img.shields.io/pypi/dm/pystatar)](https://pypi.org/project/pystatar/)

> **The Ultimate Python Toolkit for Academic Research - Bringing Stata & R's Power to Python** πŸš€

## Project Vision & Goals

**PyStataR** aims to recreate and significantly enhance **the top 20 most frequently used Stata commands** in Python, transforming them into the most powerful and user-friendly statistical tools for academic research. Our goal is to not just replicate Stata's functionality, but to **expand and improve** upon it, leveraging Python's ecosystem to create superior research tools.

### Why This Project Matters
- **Bridge the Gap**: Seamless transition from Stata to Python for researchers
- **Enhanced Functionality**: Each command will be significantly expanded beyond Stata's original capabilities
- **Modern Research Tools**: Built for today's data science and research needs
- **Community-Driven**: Open source development with academic researchers in mind

### Target Commands (20 Most Used in Academic Research)
βœ… **tabulate** - Cross-tabulation and frequency analysis  
βœ… **egen** - Extended data generation and manipulation  
βœ… **reghdfe** - High-dimensional fixed effects regression  
βœ… **winsor2** - Data winsorizing and trimming  
πŸ”„ **Coming Soon**: `summarize`, `describe`, `merge`, `reshape`, `collapse`, `keep/drop`, `generate`, `replace`, `sort`, `by`, `if/in`, `reg`, `logit`, `probit`, `ivregress`, `xtreg`

**Want to see a specific command implemented?** 
-  [Create an issue](https://github.com/brycewang-stanford/PyStataR/issues) to request a command
-  [Contribute](CONTRIBUTING.md) to help us complete this project faster
- ⭐ Star this repo to show your support!

## Core Modules Overview

### **tabulate** - Advanced Cross-tabulation and Frequency Analysis
- **Beyond Stata**: Enhanced statistical tests, multi-dimensional tables, and publication-ready output
- **Key Features**: Chi-square tests, Fisher's exact test, CramΓ©r's V, Kendall's tau, gamma coefficients
- **Use Cases**: Survey analysis, categorical data exploration, market research

### **egen** - Extended Data Generation and Manipulation  
- **Beyond Stata**: Advanced ranking algorithms, robust statistical functions, and vectorized operations
- **Key Features**: Group operations, ranking with tie-breaking, row statistics, percentile calculations
- **Use Cases**: Data preprocessing, feature engineering, panel data construction

### **reghdfe** - High-Dimensional Fixed Effects Regression
- **Beyond Stata**: Memory-efficient algorithms, advanced clustering options, and diagnostic tools
- **Key Features**: Multiple fixed effects, clustered standard errors, instrumental variables, robust diagnostics
- **Use Cases**: Panel data analysis, causal inference, economic research

### **winsor2** - Advanced Outlier Detection and Treatment
- **Beyond Stata**: Multiple detection methods, group-specific treatment, and comprehensive diagnostics
- **Key Features**: IQR-based detection, percentile methods, group-wise operations, flexible trimming
- **Use Cases**: Data cleaning, outlier analysis, robust statistical modeling

## Advanced Features & Performance

### Performance Optimizations
- **Vectorized Operations**: All functions leverage NumPy and pandas for maximum speed
- **Memory Efficiency**: Optimized for large datasets common in academic research
- **Parallel Processing**: Multi-core support for computationally intensive operations
- **Lazy Evaluation**: Smart caching and delayed computation when beneficial

### Research-Grade Features
- **Publication Ready**: LaTeX and HTML output for academic papers
- **Reproducible Research**: Comprehensive logging and version tracking
- **Missing Data Handling**: Multiple imputation and robust missing value treatment
- **Bootstrapping**: Built-in bootstrap methods for confidence intervals
- **Cross-Validation**: Integrated CV methods for model validation

## Quick Installation

```bash
pip install pystatar
```

## Comprehensive Usage Examples

### `tabulate` - Advanced Cross-tabulation

The `tabulate` module provides comprehensive frequency analysis and cross-tabulation capabilities, extending far beyond Stata's original functionality.

#### Basic One-way Tabulation
```python
import pandas as pd
```python
from pystatar import tabulate

# Create sample dataset
df = pd.DataFrame({
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'] * 100,
    'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'] * 100,
    'income': np.random.normal(50000, 15000, 600),
    'age': np.random.randint(22, 65, 600),
    'industry': np.random.choice(['Tech', 'Finance', 'Healthcare', 'Education'], 600)
})

# Simple frequency table
result = tabulate.tabulate(df, 'education')
print(result)
```

#### Advanced Two-way Cross-tabulation with Statistics
```python
# Two-way tabulation with comprehensive statistics
result = tabulate.tabulate(
    df, 
    'gender', 'education',
    chi2=True,           # Chi-square test
    exact=True,          # Fisher's exact test
    gamma=True,          # Gamma coefficient
    taub=True,           # Kendall's tau-b
    V=True,              # CramΓ©r's V
    missing=True,        # Include missing values
    row=True,            # Row percentages
    col=True,            # Column percentages
    cell=True            # Cell percentages
)

# Access different components
print("Frequency Table:")
print(result.table)
print(f"\nChi-square p-value: {result.chi2_pvalue:.4f}")
print(f"CramΓ©r's V: {result.cramers_v:.4f}")
```

#### Multi-way Tabulation
```python
# Three-way tabulation with layering
result = tabulate.tabulate(
    df, 
    'gender', 'education',
    by='industry',       # Layer by industry
    chi2=True
)

# Access results by layer
for industry, table_result in result.by_results.items():
    print(f"\n=== {industry} ===")
    print(table_result.table)
```

### `egen` - Extended Data Generation

The `egen` module provides powerful data manipulation functions that extend Stata's egen capabilities.

#### Ranking and Percentile Functions
```python
from pystatar import egen

# Advanced ranking with tie-breaking options
df['income_rank'] = egen.rank(df['income'], method='average')  # Handle ties
df['income_pctile'] = egen.xtile(df['income'], nquantiles=10)  # Deciles

# Group-specific rankings
df['rank_within_industry'] = egen.rank(df['income'], by='industry')

# Percentile calculations
df['income_90th'] = egen.pctile(df['income'], 90)
df['income_iqr'] = egen.pctile(df['income'], 75) - egen.pctile(df['income'], 25)
```

#### Row Operations
```python
# Create test scores dataset
scores_df = pd.DataFrame({
    'student': range(1, 101),
    'math': np.random.normal(75, 10, 100),
    'english': np.random.normal(80, 12, 100),
    'science': np.random.normal(78, 11, 100),
    'history': np.random.normal(82, 9, 100)
})

# Row statistics
scores_df['total_score'] = egen.rowtotal(scores_df, ['math', 'english', 'science', 'history'])
scores_df['avg_score'] = egen.rowmean(scores_df, ['math', 'english', 'science', 'history'])
scores_df['min_score'] = egen.rowmin(scores_df, ['math', 'english', 'science', 'history'])
scores_df['max_score'] = egen.rowmax(scores_df, ['math', 'english', 'science', 'history'])
scores_df['score_sd'] = egen.rowsd(scores_df, ['math', 'english', 'science', 'history'])

# Count non-missing values per row
scores_df['subjects_taken'] = egen.rownonmiss(scores_df, ['math', 'english', 'science', 'history'])
```

#### Group Statistics and Operations
```python
# Group summary statistics
df['mean_income_by_education'] = egen.mean(df['income'], by='education')
df['median_income_by_industry'] = egen.median(df['income'], by='industry')
df['sd_income_by_gender'] = egen.sd(df['income'], by='gender')

# Group identification and counting
df['education_group_size'] = egen.count(df, by='education')
df['first_in_group'] = egen.tag(df, ['education', 'gender'])  # First observation in group
df['group_sequence'] = egen.seq(df, by='education')          # Sequence within group

# Advanced group operations
df['income_rank_in_education'] = egen.rank(df['income'], by='education')
df['above_group_median'] = (df['income'] > egen.median(df['income'], by='education')).astype(int)
```

### `reghdfe` - Advanced Fixed Effects Regression

The `reghdfe` module provides state-of-the-art estimation for linear models with high-dimensional fixed effects.

#### Basic Fixed Effects Regression
```python
from pystatar import reghdfe

# Create panel dataset
np.random.seed(42)
n_firms, n_years = 100, 10
n_obs = n_firms * n_years

panel_df = pd.DataFrame({
    'firm_id': np.repeat(range(n_firms), n_years),
    'year': np.tile(range(2010, 2020), n_firms),
    'log_sales': np.random.normal(10, 1, n_obs),
    'log_employment': np.random.normal(4, 0.5, n_obs),
    'log_capital': np.random.normal(8, 0.8, n_obs),
    'industry': np.repeat(np.random.choice(['Tech', 'Manufacturing', 'Services'], n_firms), n_years)
})

# Basic regression with firm and year fixed effects
result = reghdfe.reghdfe(
    data=panel_df,
    depvar='log_sales',
    regressors=['log_employment', 'log_capital'],
    absorb=['firm_id', 'year']
)

print(result.summary())
print(f"R-squared: {result.r2:.4f}")
print(f"Number of observations: {result.N}")
```

#### Advanced Regression with Clustering and Instruments
```python
# Add instrumental variables
panel_df['instrument1'] = np.random.normal(0, 1, n_obs)
panel_df['instrument2'] = np.random.normal(0, 1, n_obs)

# Regression with clustering and multiple fixed effects
result = reghdfe.reghdfe(
    data=panel_df,
    depvar='log_sales',
    regressors=['log_employment', 'log_capital'],
    absorb=['firm_id', 'year', 'industry'],  # Multiple fixed effects
    cluster='firm_id',                        # Clustered standard errors
    weights='employment',                     # Weighted regression
    subset=(panel_df['year'] >= 2012)        # Conditional estimation
)

# Access detailed results
print("Coefficient Table:")
print(result.coef_table)
print(f"\nFixed Effects absorbed: {result.absorbed_fe}")
print(f"Clusters: {result.n_clusters}")
```

#### Instrumental Variables with High-Dimensional FE
```python
# IV regression with fixed effects
iv_result = reghdfe.ivreghdfe(
    data=panel_df,
    depvar='log_sales',
    endogenous=['log_employment'],           # Endogenous variable
    instruments=['instrument1', 'instrument2'],  # Instruments
    exogenous=['log_capital'],               # Exogenous controls
    absorb=['firm_id', 'year'],
    cluster='firm_id'
)

print("First Stage Results:")
print(iv_result.first_stage)
print(f"\nWeak instruments test (F-stat): {iv_result.first_stage_fstat:.2f}")
print(f"Overidentification test (Hansen J): {iv_result.hansen_j:.4f}")
```

### `winsor2` - Advanced Outlier Treatment

The `winsor2` module provides comprehensive outlier detection and treatment methods.

#### Basic Winsorizing
```python
from pystatar import winsor2

# Create dataset with outliers
outlier_df = pd.DataFrame({
    'income': np.concatenate([
        np.random.normal(50000, 10000, 950),  # Normal observations
        np.random.uniform(200000, 500000, 50)  # Outliers
    ]),
    'age': np.random.randint(18, 70, 1000),
    'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)
})

# Basic winsorizing at 1st and 99th percentiles
result = winsor2.winsor2(outlier_df, ['income'])
print("Original vs Winsorized:")
print(f"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}")
print(f"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}")
```

#### Group-wise Winsorizing
```python
# Winsorize within groups
result = winsor2.winsor2(
    outlier_df, 
    ['income'],
    by='industry',          # Winsorize within each industry
    cuts=(5, 95),          # Use 5th and 95th percentiles
    suffix='_clean'        # Custom suffix
)

# Compare distributions by group
for industry in outlier_df['industry'].unique():
    mask = outlier_df['industry'] == industry
    original = outlier_df.loc[mask, 'income']
    winsorized = result.loc[mask, 'income_clean']
    print(f"\n{industry}:")
    print(f"  Original: {original.describe()}")
    print(f"  Winsorized: {winsorized.describe()}")
```

#### Trimming vs Winsorizing Comparison
```python
# Compare different outlier treatment methods
trim_result = winsor2.winsor2(
    outlier_df, 
    ['income'],
    trim=True,              # Trim (remove) instead of winsorize
    cuts=(2.5, 97.5)       # Trim 2.5% from each tail
)

winsor_result = winsor2.winsor2(
    outlier_df, 
    ['income'],
    trim=False,             # Winsorize (cap) outliers
    cuts=(2.5, 97.5)
)

print("Treatment Comparison:")
print(f"Original N: {len(outlier_df)}")
print(f"After trimming N: {trim_result['income_tr'].notna().sum()}")
print(f"After winsorizing N: {len(winsor_result)}")
print(f"Trimmed mean: {trim_result['income_tr'].mean():.0f}")
print(f"Winsorized mean: {winsor_result['income_w'].mean():.0f}")
```

#### Advanced Outlier Detection
```python
# Multiple variable winsorizing with custom thresholds
multi_result = winsor2.winsor2(
    outlier_df,
    ['income', 'age'],
    cuts=(1, 99),           # Different cuts for different variables
    by='industry',          # Group-specific treatment
    replace=True,           # Replace original variables
    label=True              # Add descriptive labels
)

# Generate outlier indicators
outlier_df['income_outlier'] = winsor2.outlier_indicator(
    outlier_df['income'], 
    method='iqr',           # Use IQR method
    factor=1.5              # 1.5 * IQR threshold
)

outlier_df['extreme_outlier'] = winsor2.outlier_indicator(
    outlier_df['income'],
    method='percentile',    # Use percentile method
    cuts=(1, 99)
)

print("Outlier Detection Results:")
print(f"IQR method detected {outlier_df['income_outlier'].sum()} outliers")
print(f"Percentile method detected {outlier_df['extreme_outlier'].sum()} outliers")
```

## Project Structure

```
pystatar/
β”œβ”€β”€ __init__.py              # Main package initialization
β”œβ”€β”€ tabulate/               # Cross-tabulation module
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ core.py
β”‚   β”œβ”€β”€ results.py
β”‚   └── stats.py
β”œβ”€β”€ egen/                   # Extended generation module
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── core.py
β”œβ”€β”€ reghdfe/               # High-dimensional FE regression
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ core.py
β”‚   β”œβ”€β”€ estimation.py
β”‚   └── utils.py
β”œβ”€β”€ winsor2/               # Winsorizing module
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ core.py
β”‚   └── utils.py
β”œβ”€β”€ utils/                 # Shared utilities
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── common.py
└── tests/                 # Test suite
    β”œβ”€β”€ test_tabulate.py
    β”œβ”€β”€ test_egen.py
    β”œβ”€β”€ test_reghdfe.py
    └── test_winsor2.py
```

## Key Features

- **Familiar Syntax**: Stata-like command structure and parameters
- **Pandas Integration**: Seamless integration with pandas DataFrames
- **High Performance**: Optimized implementations using pandas and NumPy
- **Comprehensive Coverage**: Most commonly used Stata commands
- **Statistical Rigor**: Proper statistical tests and robust standard errors
- **Flexible Output**: Multiple output formats and customization options
- **Missing Value Handling**: Configurable treatment of missing data

## Documentation

Each module comes with comprehensive documentation and examples:

- [**tabulate Documentation**](docs/tabulate.md) - Cross-tabulation and frequency analysis
- [**egen Documentation**](docs/egen.md) - Extended data generation functions
- [**reghdfe Documentation**](docs/reghdfe.md) - High-dimensional fixed effects regression  
- [**winsor2 Documentation**](docs/winsor2.md) - Data winsorizing and trimming

## Contributing to the Project

We're building the future of academic research tools in Python! Here's how you can help:

### Priority Commands Needed
Help us implement the remaining **16 high-priority commands**:

**Data Management**: `summarize`, `describe`, `merge`, `reshape`, `collapse`, `keep`, `drop`, `generate`, `replace`, `sort`

**Statistical Analysis**: `reg`, `logit`, `probit`, `ivregress`, `xtreg`, `anova`

### How to Contribute

1. **Request a Command**: [Open an issue](https://github.com/brycewang-stanford/PyStataR/issues/new) with the command you need
2. ** Implement a Command**: Check our [contribution guidelines](CONTRIBUTING.md) and submit a PR
3. ** Report Bugs**: Help us improve existing functionality
4. ** Improve Documentation**: Add examples, tutorials, or clarifications
5. ** Spread the Word**: Star the repo and share with fellow researchers

###  Recognition
All contributors will be recognized in our documentation and release notes. Major contributors will be listed as co-authors on any academic publications about this project.

###  Academic Collaboration
We welcome partnerships with universities and research institutions. If you're interested in using this project in your coursework or research, please reach out!

## Community & Support

- **Documentation**: [https://pystatar.readthedocs.io](docs/)
- **Discussions**: [GitHub Discussions](https://github.com/brycewang-stanford/PyStataR/discussions)
- **Issues**: [Bug Reports & Feature Requests](https://github.com/brycewang-stanford/PyStataR/issues)
- ** Email**: brycew6m@stanford.edu for academic collaborations

## Comparison with Stata

| Feature | Stata | PyStataR | Advantage |
|---------|-------|-------------------|-----------|
| **Speed** | Base performance | 2-10x faster* | Vectorized operations |
| **Memory** | Limited by system | Efficient pandas backend | Better large dataset handling |
| **Extensibility** | Ado files | Python ecosystem | Unlimited customization |
| **Cost** | $$$$ | Free & Open Source | Accessible to all researchers |
| **Integration** | Standalone | Python data science stack | Seamless workflow |
| **Output** | Limited formats | Multiple (LaTeX, HTML, etc.) | Publication ready |

*Performance comparison based on typical academic datasets (1M+ observations)

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## πŸ™ Acknowledgments

This package builds upon the excellent work of:
- [pandas](https://pandas.pydata.org/) - The backbone of our data manipulation
- [numpy](https://numpy.org/) - Powering our numerical computations
- [scipy](https://scipy.org/) - Statistical functions and algorithms
- [statsmodels](https://www.statsmodels.org/) - Statistical modeling foundations
- [pyhdfe](https://github.com/jeffgortmaker/pyhdfe) - High-dimensional fixed effects algorithms
- The entire **Stata community** - For decades of statistical innovation that inspired this project

##  Future Roadmap

### Version 1.0 Goals (Target: End of 2025)
-  Core 4 commands implemented
-  Additional 16 high-priority commands
-  Comprehensive test suite (>95% coverage)
-  Complete documentation with tutorials
-  Performance benchmarks vs Stata

### Version 2.0 Vision (2026)
-  Machine learning integration
-  R integration for cross-platform compatibility
-  Web interface for non-programmers
-  Jupyter notebook extensions

## πŸ“ˆ Project Statistics

[![GitHub stars](https://img.shields.io/github/stars/brycewang-stanford/PyStataR?style=social)](https://github.com/brycewang-stanford/PyStataR/stargazers)
[![GitHub forks](https://img.shields.io/github/forks/brycewang-stanford/PyStataR?style=social)](https://github.com/brycewang-stanford/PyStataR/network)
[![GitHub issues](https://img.shields.io/github/issues/brycewang-stanford/PyStataR)](https://github.com/brycewang-stanford/PyStataR/issues)
[![GitHub pull requests](https://img.shields.io/github/issues-pr/brycewang-stanford/PyStataR)](https://github.com/brycewang-stanford/PyStataR/pulls)

##  Contact & Collaboration

**Created by [Bryce Wang](https://github.com/brycewang-stanford)** - Stanford University

-  **Email**: brycew6m@stanford.edu  
-  **GitHub**: [@brycewang-stanford](https://github.com/brycewang-stanford)
-  **Academic**: Stanford Graduate School of Business
-  **LinkedIn**: [Connect with me](https://linkedin.com/in/brycewang)

### Academic Partnerships Welcome!
-  Course integration and teaching materials
-  Research collaborations and citations
-  Institutional licensing and support
-  Student contributor programs

---

### ⭐ **Love this project? Give it a star and help us reach more researchers!** ⭐

**Together, we're building the future of academic research in Python** 

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pystatar",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Bryce Wang <brycew6m@stanford.edu>",
    "keywords": "stata, pandas, econometrics, statistics, data-analysis, tabulate, egen, reghdfe, winsor, cross-tabulation, fixed-effects, panel-data, regression",
    "author": null,
    "author_email": "Bryce Wang <brycew6m@stanford.edu>",
    "download_url": "https://files.pythonhosted.org/packages/cd/c0/4b4dfb7aa9282bbe97bd02b1d7b29600146188fad3fe56919b30f8283fc9/pystatar-0.0.0.tar.gz",
    "platform": null,
    "description": "# PyStataR\n\n[![Python Version](https://img.shields.io/pypi/pyversions/pystatar)](https://pypi.org/project/pystatar/)\n[![PyPI Version](https://img.shields.io/pypi/v/pystatar)](https://pypi.org/project/pystatar/)\n[![License](https://img.shields.io/pypi/l/pystatar)](https://github.com/brycewang-stanford/PyStataR/blob/main/LICENSE)\n[![Downloads](https://img.shields.io/pypi/dm/pystatar)](https://pypi.org/project/pystatar/)\n\n> **The Ultimate Python Toolkit for Academic Research - Bringing Stata & R's Power to Python** \ud83d\ude80\n\n## Project Vision & Goals\n\n**PyStataR** aims to recreate and significantly enhance **the top 20 most frequently used Stata commands** in Python, transforming them into the most powerful and user-friendly statistical tools for academic research. Our goal is to not just replicate Stata's functionality, but to **expand and improve** upon it, leveraging Python's ecosystem to create superior research tools.\n\n### Why This Project Matters\n- **Bridge the Gap**: Seamless transition from Stata to Python for researchers\n- **Enhanced Functionality**: Each command will be significantly expanded beyond Stata's original capabilities\n- **Modern Research Tools**: Built for today's data science and research needs\n- **Community-Driven**: Open source development with academic researchers in mind\n\n### Target Commands (20 Most Used in Academic Research)\n\u2705 **tabulate** - Cross-tabulation and frequency analysis  \n\u2705 **egen** - Extended data generation and manipulation  \n\u2705 **reghdfe** - High-dimensional fixed effects regression  \n\u2705 **winsor2** - Data winsorizing and trimming  \n\ud83d\udd04 **Coming Soon**: `summarize`, `describe`, `merge`, `reshape`, `collapse`, `keep/drop`, `generate`, `replace`, `sort`, `by`, `if/in`, `reg`, `logit`, `probit`, `ivregress`, `xtreg`\n\n**Want to see a specific command implemented?** \n-  [Create an issue](https://github.com/brycewang-stanford/PyStataR/issues) to request a command\n-  [Contribute](CONTRIBUTING.md) to help us complete this project faster\n- \u2b50 Star this repo to show your support!\n\n## Core Modules Overview\n\n### **tabulate** - Advanced Cross-tabulation and Frequency Analysis\n- **Beyond Stata**: Enhanced statistical tests, multi-dimensional tables, and publication-ready output\n- **Key Features**: Chi-square tests, Fisher's exact test, Cram\u00e9r's V, Kendall's tau, gamma coefficients\n- **Use Cases**: Survey analysis, categorical data exploration, market research\n\n### **egen** - Extended Data Generation and Manipulation  \n- **Beyond Stata**: Advanced ranking algorithms, robust statistical functions, and vectorized operations\n- **Key Features**: Group operations, ranking with tie-breaking, row statistics, percentile calculations\n- **Use Cases**: Data preprocessing, feature engineering, panel data construction\n\n### **reghdfe** - High-Dimensional Fixed Effects Regression\n- **Beyond Stata**: Memory-efficient algorithms, advanced clustering options, and diagnostic tools\n- **Key Features**: Multiple fixed effects, clustered standard errors, instrumental variables, robust diagnostics\n- **Use Cases**: Panel data analysis, causal inference, economic research\n\n### **winsor2** - Advanced Outlier Detection and Treatment\n- **Beyond Stata**: Multiple detection methods, group-specific treatment, and comprehensive diagnostics\n- **Key Features**: IQR-based detection, percentile methods, group-wise operations, flexible trimming\n- **Use Cases**: Data cleaning, outlier analysis, robust statistical modeling\n\n## Advanced Features & Performance\n\n### Performance Optimizations\n- **Vectorized Operations**: All functions leverage NumPy and pandas for maximum speed\n- **Memory Efficiency**: Optimized for large datasets common in academic research\n- **Parallel Processing**: Multi-core support for computationally intensive operations\n- **Lazy Evaluation**: Smart caching and delayed computation when beneficial\n\n### Research-Grade Features\n- **Publication Ready**: LaTeX and HTML output for academic papers\n- **Reproducible Research**: Comprehensive logging and version tracking\n- **Missing Data Handling**: Multiple imputation and robust missing value treatment\n- **Bootstrapping**: Built-in bootstrap methods for confidence intervals\n- **Cross-Validation**: Integrated CV methods for model validation\n\n## Quick Installation\n\n```bash\npip install pystatar\n```\n\n## Comprehensive Usage Examples\n\n### `tabulate` - Advanced Cross-tabulation\n\nThe `tabulate` module provides comprehensive frequency analysis and cross-tabulation capabilities, extending far beyond Stata's original functionality.\n\n#### Basic One-way Tabulation\n```python\nimport pandas as pd\n```python\nfrom pystatar import tabulate\n\n# Create sample dataset\ndf = pd.DataFrame({\n    'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'] * 100,\n    'education': ['High School', 'College', 'Graduate', 'High School', 'College', 'Graduate'] * 100,\n    'income': np.random.normal(50000, 15000, 600),\n    'age': np.random.randint(22, 65, 600),\n    'industry': np.random.choice(['Tech', 'Finance', 'Healthcare', 'Education'], 600)\n})\n\n# Simple frequency table\nresult = tabulate.tabulate(df, 'education')\nprint(result)\n```\n\n#### Advanced Two-way Cross-tabulation with Statistics\n```python\n# Two-way tabulation with comprehensive statistics\nresult = tabulate.tabulate(\n    df, \n    'gender', 'education',\n    chi2=True,           # Chi-square test\n    exact=True,          # Fisher's exact test\n    gamma=True,          # Gamma coefficient\n    taub=True,           # Kendall's tau-b\n    V=True,              # Cram\u00e9r's V\n    missing=True,        # Include missing values\n    row=True,            # Row percentages\n    col=True,            # Column percentages\n    cell=True            # Cell percentages\n)\n\n# Access different components\nprint(\"Frequency Table:\")\nprint(result.table)\nprint(f\"\\nChi-square p-value: {result.chi2_pvalue:.4f}\")\nprint(f\"Cram\u00e9r's V: {result.cramers_v:.4f}\")\n```\n\n#### Multi-way Tabulation\n```python\n# Three-way tabulation with layering\nresult = tabulate.tabulate(\n    df, \n    'gender', 'education',\n    by='industry',       # Layer by industry\n    chi2=True\n)\n\n# Access results by layer\nfor industry, table_result in result.by_results.items():\n    print(f\"\\n=== {industry} ===\")\n    print(table_result.table)\n```\n\n### `egen` - Extended Data Generation\n\nThe `egen` module provides powerful data manipulation functions that extend Stata's egen capabilities.\n\n#### Ranking and Percentile Functions\n```python\nfrom pystatar import egen\n\n# Advanced ranking with tie-breaking options\ndf['income_rank'] = egen.rank(df['income'], method='average')  # Handle ties\ndf['income_pctile'] = egen.xtile(df['income'], nquantiles=10)  # Deciles\n\n# Group-specific rankings\ndf['rank_within_industry'] = egen.rank(df['income'], by='industry')\n\n# Percentile calculations\ndf['income_90th'] = egen.pctile(df['income'], 90)\ndf['income_iqr'] = egen.pctile(df['income'], 75) - egen.pctile(df['income'], 25)\n```\n\n#### Row Operations\n```python\n# Create test scores dataset\nscores_df = pd.DataFrame({\n    'student': range(1, 101),\n    'math': np.random.normal(75, 10, 100),\n    'english': np.random.normal(80, 12, 100),\n    'science': np.random.normal(78, 11, 100),\n    'history': np.random.normal(82, 9, 100)\n})\n\n# Row statistics\nscores_df['total_score'] = egen.rowtotal(scores_df, ['math', 'english', 'science', 'history'])\nscores_df['avg_score'] = egen.rowmean(scores_df, ['math', 'english', 'science', 'history'])\nscores_df['min_score'] = egen.rowmin(scores_df, ['math', 'english', 'science', 'history'])\nscores_df['max_score'] = egen.rowmax(scores_df, ['math', 'english', 'science', 'history'])\nscores_df['score_sd'] = egen.rowsd(scores_df, ['math', 'english', 'science', 'history'])\n\n# Count non-missing values per row\nscores_df['subjects_taken'] = egen.rownonmiss(scores_df, ['math', 'english', 'science', 'history'])\n```\n\n#### Group Statistics and Operations\n```python\n# Group summary statistics\ndf['mean_income_by_education'] = egen.mean(df['income'], by='education')\ndf['median_income_by_industry'] = egen.median(df['income'], by='industry')\ndf['sd_income_by_gender'] = egen.sd(df['income'], by='gender')\n\n# Group identification and counting\ndf['education_group_size'] = egen.count(df, by='education')\ndf['first_in_group'] = egen.tag(df, ['education', 'gender'])  # First observation in group\ndf['group_sequence'] = egen.seq(df, by='education')          # Sequence within group\n\n# Advanced group operations\ndf['income_rank_in_education'] = egen.rank(df['income'], by='education')\ndf['above_group_median'] = (df['income'] > egen.median(df['income'], by='education')).astype(int)\n```\n\n### `reghdfe` - Advanced Fixed Effects Regression\n\nThe `reghdfe` module provides state-of-the-art estimation for linear models with high-dimensional fixed effects.\n\n#### Basic Fixed Effects Regression\n```python\nfrom pystatar import reghdfe\n\n# Create panel dataset\nnp.random.seed(42)\nn_firms, n_years = 100, 10\nn_obs = n_firms * n_years\n\npanel_df = pd.DataFrame({\n    'firm_id': np.repeat(range(n_firms), n_years),\n    'year': np.tile(range(2010, 2020), n_firms),\n    'log_sales': np.random.normal(10, 1, n_obs),\n    'log_employment': np.random.normal(4, 0.5, n_obs),\n    'log_capital': np.random.normal(8, 0.8, n_obs),\n    'industry': np.repeat(np.random.choice(['Tech', 'Manufacturing', 'Services'], n_firms), n_years)\n})\n\n# Basic regression with firm and year fixed effects\nresult = reghdfe.reghdfe(\n    data=panel_df,\n    depvar='log_sales',\n    regressors=['log_employment', 'log_capital'],\n    absorb=['firm_id', 'year']\n)\n\nprint(result.summary())\nprint(f\"R-squared: {result.r2:.4f}\")\nprint(f\"Number of observations: {result.N}\")\n```\n\n#### Advanced Regression with Clustering and Instruments\n```python\n# Add instrumental variables\npanel_df['instrument1'] = np.random.normal(0, 1, n_obs)\npanel_df['instrument2'] = np.random.normal(0, 1, n_obs)\n\n# Regression with clustering and multiple fixed effects\nresult = reghdfe.reghdfe(\n    data=panel_df,\n    depvar='log_sales',\n    regressors=['log_employment', 'log_capital'],\n    absorb=['firm_id', 'year', 'industry'],  # Multiple fixed effects\n    cluster='firm_id',                        # Clustered standard errors\n    weights='employment',                     # Weighted regression\n    subset=(panel_df['year'] >= 2012)        # Conditional estimation\n)\n\n# Access detailed results\nprint(\"Coefficient Table:\")\nprint(result.coef_table)\nprint(f\"\\nFixed Effects absorbed: {result.absorbed_fe}\")\nprint(f\"Clusters: {result.n_clusters}\")\n```\n\n#### Instrumental Variables with High-Dimensional FE\n```python\n# IV regression with fixed effects\niv_result = reghdfe.ivreghdfe(\n    data=panel_df,\n    depvar='log_sales',\n    endogenous=['log_employment'],           # Endogenous variable\n    instruments=['instrument1', 'instrument2'],  # Instruments\n    exogenous=['log_capital'],               # Exogenous controls\n    absorb=['firm_id', 'year'],\n    cluster='firm_id'\n)\n\nprint(\"First Stage Results:\")\nprint(iv_result.first_stage)\nprint(f\"\\nWeak instruments test (F-stat): {iv_result.first_stage_fstat:.2f}\")\nprint(f\"Overidentification test (Hansen J): {iv_result.hansen_j:.4f}\")\n```\n\n### `winsor2` - Advanced Outlier Treatment\n\nThe `winsor2` module provides comprehensive outlier detection and treatment methods.\n\n#### Basic Winsorizing\n```python\nfrom pystatar import winsor2\n\n# Create dataset with outliers\noutlier_df = pd.DataFrame({\n    'income': np.concatenate([\n        np.random.normal(50000, 10000, 950),  # Normal observations\n        np.random.uniform(200000, 500000, 50)  # Outliers\n    ]),\n    'age': np.random.randint(18, 70, 1000),\n    'industry': np.random.choice(['Tech', 'Finance', 'Retail', 'Healthcare'], 1000)\n})\n\n# Basic winsorizing at 1st and 99th percentiles\nresult = winsor2.winsor2(outlier_df, ['income'])\nprint(\"Original vs Winsorized:\")\nprint(f\"Original: min={outlier_df['income'].min():.0f}, max={outlier_df['income'].max():.0f}\")\nprint(f\"Winsorized: min={result['income_w'].min():.0f}, max={result['income_w'].max():.0f}\")\n```\n\n#### Group-wise Winsorizing\n```python\n# Winsorize within groups\nresult = winsor2.winsor2(\n    outlier_df, \n    ['income'],\n    by='industry',          # Winsorize within each industry\n    cuts=(5, 95),          # Use 5th and 95th percentiles\n    suffix='_clean'        # Custom suffix\n)\n\n# Compare distributions by group\nfor industry in outlier_df['industry'].unique():\n    mask = outlier_df['industry'] == industry\n    original = outlier_df.loc[mask, 'income']\n    winsorized = result.loc[mask, 'income_clean']\n    print(f\"\\n{industry}:\")\n    print(f\"  Original: {original.describe()}\")\n    print(f\"  Winsorized: {winsorized.describe()}\")\n```\n\n#### Trimming vs Winsorizing Comparison\n```python\n# Compare different outlier treatment methods\ntrim_result = winsor2.winsor2(\n    outlier_df, \n    ['income'],\n    trim=True,              # Trim (remove) instead of winsorize\n    cuts=(2.5, 97.5)       # Trim 2.5% from each tail\n)\n\nwinsor_result = winsor2.winsor2(\n    outlier_df, \n    ['income'],\n    trim=False,             # Winsorize (cap) outliers\n    cuts=(2.5, 97.5)\n)\n\nprint(\"Treatment Comparison:\")\nprint(f\"Original N: {len(outlier_df)}\")\nprint(f\"After trimming N: {trim_result['income_tr'].notna().sum()}\")\nprint(f\"After winsorizing N: {len(winsor_result)}\")\nprint(f\"Trimmed mean: {trim_result['income_tr'].mean():.0f}\")\nprint(f\"Winsorized mean: {winsor_result['income_w'].mean():.0f}\")\n```\n\n#### Advanced Outlier Detection\n```python\n# Multiple variable winsorizing with custom thresholds\nmulti_result = winsor2.winsor2(\n    outlier_df,\n    ['income', 'age'],\n    cuts=(1, 99),           # Different cuts for different variables\n    by='industry',          # Group-specific treatment\n    replace=True,           # Replace original variables\n    label=True              # Add descriptive labels\n)\n\n# Generate outlier indicators\noutlier_df['income_outlier'] = winsor2.outlier_indicator(\n    outlier_df['income'], \n    method='iqr',           # Use IQR method\n    factor=1.5              # 1.5 * IQR threshold\n)\n\noutlier_df['extreme_outlier'] = winsor2.outlier_indicator(\n    outlier_df['income'],\n    method='percentile',    # Use percentile method\n    cuts=(1, 99)\n)\n\nprint(\"Outlier Detection Results:\")\nprint(f\"IQR method detected {outlier_df['income_outlier'].sum()} outliers\")\nprint(f\"Percentile method detected {outlier_df['extreme_outlier'].sum()} outliers\")\n```\n\n## Project Structure\n\n```\npystatar/\n\u251c\u2500\u2500 __init__.py              # Main package initialization\n\u251c\u2500\u2500 tabulate/               # Cross-tabulation module\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 core.py\n\u2502   \u251c\u2500\u2500 results.py\n\u2502   \u2514\u2500\u2500 stats.py\n\u251c\u2500\u2500 egen/                   # Extended generation module\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2514\u2500\u2500 core.py\n\u251c\u2500\u2500 reghdfe/               # High-dimensional FE regression\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 core.py\n\u2502   \u251c\u2500\u2500 estimation.py\n\u2502   \u2514\u2500\u2500 utils.py\n\u251c\u2500\u2500 winsor2/               # Winsorizing module\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 core.py\n\u2502   \u2514\u2500\u2500 utils.py\n\u251c\u2500\u2500 utils/                 # Shared utilities\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u2514\u2500\u2500 common.py\n\u2514\u2500\u2500 tests/                 # Test suite\n    \u251c\u2500\u2500 test_tabulate.py\n    \u251c\u2500\u2500 test_egen.py\n    \u251c\u2500\u2500 test_reghdfe.py\n    \u2514\u2500\u2500 test_winsor2.py\n```\n\n## Key Features\n\n- **Familiar Syntax**: Stata-like command structure and parameters\n- **Pandas Integration**: Seamless integration with pandas DataFrames\n- **High Performance**: Optimized implementations using pandas and NumPy\n- **Comprehensive Coverage**: Most commonly used Stata commands\n- **Statistical Rigor**: Proper statistical tests and robust standard errors\n- **Flexible Output**: Multiple output formats and customization options\n- **Missing Value Handling**: Configurable treatment of missing data\n\n## Documentation\n\nEach module comes with comprehensive documentation and examples:\n\n- [**tabulate Documentation**](docs/tabulate.md) - Cross-tabulation and frequency analysis\n- [**egen Documentation**](docs/egen.md) - Extended data generation functions\n- [**reghdfe Documentation**](docs/reghdfe.md) - High-dimensional fixed effects regression  \n- [**winsor2 Documentation**](docs/winsor2.md) - Data winsorizing and trimming\n\n## Contributing to the Project\n\nWe're building the future of academic research tools in Python! Here's how you can help:\n\n### Priority Commands Needed\nHelp us implement the remaining **16 high-priority commands**:\n\n**Data Management**: `summarize`, `describe`, `merge`, `reshape`, `collapse`, `keep`, `drop`, `generate`, `replace`, `sort`\n\n**Statistical Analysis**: `reg`, `logit`, `probit`, `ivregress`, `xtreg`, `anova`\n\n### How to Contribute\n\n1. **Request a Command**: [Open an issue](https://github.com/brycewang-stanford/PyStataR/issues/new) with the command you need\n2. ** Implement a Command**: Check our [contribution guidelines](CONTRIBUTING.md) and submit a PR\n3. ** Report Bugs**: Help us improve existing functionality\n4. ** Improve Documentation**: Add examples, tutorials, or clarifications\n5. ** Spread the Word**: Star the repo and share with fellow researchers\n\n###  Recognition\nAll contributors will be recognized in our documentation and release notes. Major contributors will be listed as co-authors on any academic publications about this project.\n\n###  Academic Collaboration\nWe welcome partnerships with universities and research institutions. If you're interested in using this project in your coursework or research, please reach out!\n\n## Community & Support\n\n- **Documentation**: [https://pystatar.readthedocs.io](docs/)\n- **Discussions**: [GitHub Discussions](https://github.com/brycewang-stanford/PyStataR/discussions)\n- **Issues**: [Bug Reports & Feature Requests](https://github.com/brycewang-stanford/PyStataR/issues)\n- ** Email**: brycew6m@stanford.edu for academic collaborations\n\n## Comparison with Stata\n\n| Feature | Stata | PyStataR | Advantage |\n|---------|-------|-------------------|-----------|\n| **Speed** | Base performance | 2-10x faster* | Vectorized operations |\n| **Memory** | Limited by system | Efficient pandas backend | Better large dataset handling |\n| **Extensibility** | Ado files | Python ecosystem | Unlimited customization |\n| **Cost** | $$$$ | Free & Open Source | Accessible to all researchers |\n| **Integration** | Standalone | Python data science stack | Seamless workflow |\n| **Output** | Limited formats | Multiple (LaTeX, HTML, etc.) | Publication ready |\n\n*Performance comparison based on typical academic datasets (1M+ observations)\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\nThis package builds upon the excellent work of:\n- [pandas](https://pandas.pydata.org/) - The backbone of our data manipulation\n- [numpy](https://numpy.org/) - Powering our numerical computations\n- [scipy](https://scipy.org/) - Statistical functions and algorithms\n- [statsmodels](https://www.statsmodels.org/) - Statistical modeling foundations\n- [pyhdfe](https://github.com/jeffgortmaker/pyhdfe) - High-dimensional fixed effects algorithms\n- The entire **Stata community** - For decades of statistical innovation that inspired this project\n\n##  Future Roadmap\n\n### Version 1.0 Goals (Target: End of 2025)\n-  Core 4 commands implemented\n-  Additional 16 high-priority commands\n-  Comprehensive test suite (>95% coverage)\n-  Complete documentation with tutorials\n-  Performance benchmarks vs Stata\n\n### Version 2.0 Vision (2026)\n-  Machine learning integration\n-  R integration for cross-platform compatibility\n-  Web interface for non-programmers\n-  Jupyter notebook extensions\n\n## \ud83d\udcc8 Project Statistics\n\n[![GitHub stars](https://img.shields.io/github/stars/brycewang-stanford/PyStataR?style=social)](https://github.com/brycewang-stanford/PyStataR/stargazers)\n[![GitHub forks](https://img.shields.io/github/forks/brycewang-stanford/PyStataR?style=social)](https://github.com/brycewang-stanford/PyStataR/network)\n[![GitHub issues](https://img.shields.io/github/issues/brycewang-stanford/PyStataR)](https://github.com/brycewang-stanford/PyStataR/issues)\n[![GitHub pull requests](https://img.shields.io/github/issues-pr/brycewang-stanford/PyStataR)](https://github.com/brycewang-stanford/PyStataR/pulls)\n\n##  Contact & Collaboration\n\n**Created by [Bryce Wang](https://github.com/brycewang-stanford)** - Stanford University\n\n-  **Email**: brycew6m@stanford.edu  \n-  **GitHub**: [@brycewang-stanford](https://github.com/brycewang-stanford)\n-  **Academic**: Stanford Graduate School of Business\n-  **LinkedIn**: [Connect with me](https://linkedin.com/in/brycewang)\n\n### Academic Partnerships Welcome!\n-  Course integration and teaching materials\n-  Research collaborations and citations\n-  Institutional licensing and support\n-  Student contributor programs\n\n---\n\n### \u2b50 **Love this project? Give it a star and help us reach more researchers!** \u2b50\n\n**Together, we're building the future of academic research in Python** \n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Comprehensive Python package providing Stata-equivalent commands for pandas DataFrames",
    "version": "0.0.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/brycewang-stanford/PyStataR/issues",
        "Documentation": "https://github.com/brycewang-stanford/PyStataR/docs",
        "Homepage": "https://github.com/brycewang-stanford/PyStataR",
        "Repository": "https://github.com/brycewang-stanford/PyStataR"
    },
    "split_keywords": [
        "stata",
        " pandas",
        " econometrics",
        " statistics",
        " data-analysis",
        " tabulate",
        " egen",
        " reghdfe",
        " winsor",
        " cross-tabulation",
        " fixed-effects",
        " panel-data",
        " regression"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "232cc0069fd4068c238eb99e4747b0da29de59413478e7c2bdafd47947af8feb",
                "md5": "b686a99d9d5ed074f825d483fc6e26b1",
                "sha256": "71346c0bdd36011c818fe4dc32334ac481d19a5258e1340193f7e20719f8b4a2"
            },
            "downloads": -1,
            "filename": "pystatar-0.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b686a99d9d5ed074f825d483fc6e26b1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 42822,
            "upload_time": "2025-07-27T08:19:16",
            "upload_time_iso_8601": "2025-07-27T08:19:16.780991Z",
            "url": "https://files.pythonhosted.org/packages/23/2c/c0069fd4068c238eb99e4747b0da29de59413478e7c2bdafd47947af8feb/pystatar-0.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "cdc04b4dfb7aa9282bbe97bd02b1d7b29600146188fad3fe56919b30f8283fc9",
                "md5": "c3370fe63f97066098c972e2071852f3",
                "sha256": "caf6f1df7780f92350b80f4dfddc6d1bab6dd386fc9d855a8ea9d340b72677ab"
            },
            "downloads": -1,
            "filename": "pystatar-0.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c3370fe63f97066098c972e2071852f3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 50842,
            "upload_time": "2025-07-27T08:19:18",
            "upload_time_iso_8601": "2025-07-27T08:19:18.272576Z",
            "url": "https://files.pythonhosted.org/packages/cd/c0/4b4dfb7aa9282bbe97bd02b1d7b29600146188fad3fe56919b30f8283fc9/pystatar-0.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-27 08:19:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "brycewang-stanford",
    "github_project": "PyStataR",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pystatar"
}
        
Elapsed time: 0.91140s