# PyEgen
[](https://badge.fury.io/py/pyegen)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
🐍 **Python implementation of Stata's `egen` command**
PyEgen brings the power and convenience of Stata's `egen` (extended generate) command to Python pandas DataFrames. If you're a researcher transitioning from Stata to Python, this package will make your data manipulation tasks much more familiar and efficient.
## Key Features
- **Familiar Syntax**: Stata-like syntax for data manipulation
- **Pandas Integration**: Seamless integration with pandas DataFrames
- **High Performance**: Optimized implementations using pandas operations
- **Comprehensive**: Covers most commonly used `egen` functions
- **Easy to Use**: Simple, intuitive API
## Installation
```bash
pip install pyegen
```
## Quick Start
```python
import pandas as pd
import pyegen as egen
# Create sample data
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B', 'C', 'C'],
'value1': [10, 20, 30, 40, 50, 60],
'value2': [1, 2, 3, 4, 5, 6],
'value3': [100, 200, 300, 400, 500, 600]
})
# Stata: egen newvar = rank(value1)
df['rank_val'] = egen.rank(df['value1'])
# Stata: egen newvar = rowmean(value1 value2 value3)
df['row_mean'] = egen.rowmean(df, ['value1', 'value2', 'value3'])
# Stata: egen newvar = rowtotal(value1 value2 value3)
df['row_total'] = egen.rowtotal(df, ['value1', 'value2', 'value3'])
# Stata: egen newvar = tag(group)
df['tag'] = egen.tag(df, ['group'])
# Stata: egen newvar = count(value1), by(group)
df['count_by_group'] = egen.count(df['value1'], by=df['group'])
```
## Available Functions
### Basic Functions
- **`rank(series, method='average')`** - Rank values (like Stata's `egen rank`)
- **`rowmean(df, columns)`** - Row-wise mean across specified columns
- **`rowtotal(df, columns)`** - Row-wise sum across specified columns
- **`rowmax(df, columns)`** - Row-wise maximum across specified columns
- **`rowmin(df, columns)`** - Row-wise minimum across specified columns
- **`rowcount(df, columns)`** - Count non-missing values across columns
- **`rowsd(df, columns)`** - Row-wise standard deviation
### Grouping Functions
- **`tag(df, columns)`** - Tag first occurrence in each group
- **`count(series, by=None)`** - Count observations (optionally by group)
- **`mean(series, by=None)`** - Group means
- **`sum(series, by=None)`** - Group sums
- **`max(series, by=None)`** - Group maxima
- **`min(series, by=None)`** - Group minima
- **`sd(series, by=None)`** - Group standard deviations
### Advanced Functions
- **`seq()`** - Generate sequence numbers
- **`group(df, columns)`** - Create group identifiers
- **`pc(series, by=None)`** - Calculate percentiles
- **`iqr(series, by=None)`** - Interquartile range
## 💡 Detailed Examples
### Working with Missing Values
```python
import numpy as np
# Data with missing values
df = pd.DataFrame({
'var1': [1, 2, np.nan, 4, 5],
'var2': [10, np.nan, 30, 40, 50],
'var3': [100, 200, 300, np.nan, 500]
})
# Row statistics excluding missing values
df['mean_nonmissing'] = egen.rowmean(df, ['var1', 'var2', 'var3'])
df['count_nonmissing'] = egen.rowcount(df, ['var1', 'var2', 'var3'])
```
### Group Operations
```python
# Sample data with groups
df = pd.DataFrame({
'country': ['USA', 'USA', 'CHN', 'CHN', 'DEU', 'DEU'],
'year': [2020, 2021, 2020, 2021, 2020, 2021],
'gdp': [21.43, 22.32, 14.72, 17.73, 3.84, 4.26],
'population': [331, 332, 1439, 1412, 83, 83]
})
# Group-wise operations
df['mean_gdp_by_country'] = egen.mean(df['gdp'], by=df['country'])
df['country_tag'] = egen.tag(df, ['country'])
df['obs_per_country'] = egen.count(df['gdp'], by=df['country'])
```
### Ranking and Percentiles
```python
# Ranking
df['gdp_rank'] = egen.rank(df['gdp']) # Overall ranking
df['gdp_rank_by_year'] = egen.rank(df['gdp'], by=df['year']) # Ranking within year
# Percentiles
df['gdp_percentile'] = egen.pc(df['gdp'])
```
## Stata to Python Translation Guide
| Stata Command | PyEgen Equivalent |
|---------------|-------------------|
| `egen newvar = rank(var)` | `df['newvar'] = egen.rank(df['var'])` |
| `egen newvar = rowmean(var1-var3)` | `df['newvar'] = egen.rowmean(df, ['var1', 'var2', 'var3'])` |
| `egen newvar = rowtotal(var1-var3)` | `df['newvar'] = egen.rowtotal(df, ['var1', 'var2', 'var3'])` |
| `egen newvar = tag(group)` | `df['newvar'] = egen.tag(df, ['group'])` |
| `egen newvar = count(var), by(group)` | `df['newvar'] = egen.count(df['var'], by=df['group'])` |
| `egen newvar = mean(var), by(group)` | `df['newvar'] = egen.mean(df['var'], by=df['group'])` |
## 🤝 Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
### Development Setup
```bash
git clone https://github.com/brycewang-stanford/pyegen.git
cd pyegen
pip install -e ".[dev]"
python -m pytest tests/
```
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- Inspired by Stata's `egen` command
- Built on the excellent pandas library
- Thanks to the open-source community for feedback and contributions
## 📞 Support
- 🐛 **Bug Reports**: [GitHub Issues](https://github.com/brycewang-stanford/pyegen/issues)
- 💡 **Feature Requests**: [GitHub Discussions](https://github.com/brycewang-stanford/pyegen/discussions)
- 📧 **Email**: brycew6m@stanford.edu
---
⭐ **If this package helps your research, please consider starring the repository!**
Raw data
{
"_id": null,
"home_page": null,
"name": "pyegen",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "stata, egen, pandas, data-analysis, econometrics, statistics",
"author": null,
"author_email": "Bryce Wang <brycew6m@stanford.edu>",
"download_url": "https://files.pythonhosted.org/packages/e1/c0/f852725a5cbc4401d777859a7dac27250eb7022d77a7df92d99430cbae2d/pyegen-0.1.0.tar.gz",
"platform": null,
"description": "# PyEgen\n\n[](https://badge.fury.io/py/pyegen)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n\n\ud83d\udc0d **Python implementation of Stata's `egen` command**\n\nPyEgen brings the power and convenience of Stata's `egen` (extended generate) command to Python pandas DataFrames. If you're a researcher transitioning from Stata to Python, this package will make your data manipulation tasks much more familiar and efficient.\n\n## Key Features\n\n- **Familiar Syntax**: Stata-like syntax for data manipulation\n- **Pandas Integration**: Seamless integration with pandas DataFrames \n- **High Performance**: Optimized implementations using pandas operations\n- **Comprehensive**: Covers most commonly used `egen` functions\n- **Easy to Use**: Simple, intuitive API\n\n## Installation\n\n```bash\npip install pyegen\n```\n\n## Quick Start\n\n```python\nimport pandas as pd\nimport pyegen as egen\n\n# Create sample data\ndf = pd.DataFrame({\n 'group': ['A', 'A', 'B', 'B', 'C', 'C'],\n 'value1': [10, 20, 30, 40, 50, 60],\n 'value2': [1, 2, 3, 4, 5, 6],\n 'value3': [100, 200, 300, 400, 500, 600]\n})\n\n# Stata: egen newvar = rank(value1)\ndf['rank_val'] = egen.rank(df['value1'])\n\n# Stata: egen newvar = rowmean(value1 value2 value3)\ndf['row_mean'] = egen.rowmean(df, ['value1', 'value2', 'value3'])\n\n# Stata: egen newvar = rowtotal(value1 value2 value3)\ndf['row_total'] = egen.rowtotal(df, ['value1', 'value2', 'value3'])\n\n# Stata: egen newvar = tag(group)\ndf['tag'] = egen.tag(df, ['group'])\n\n# Stata: egen newvar = count(value1), by(group)\ndf['count_by_group'] = egen.count(df['value1'], by=df['group'])\n```\n\n## Available Functions\n\n### Basic Functions\n- **`rank(series, method='average')`** - Rank values (like Stata's `egen rank`)\n- **`rowmean(df, columns)`** - Row-wise mean across specified columns\n- **`rowtotal(df, columns)`** - Row-wise sum across specified columns\n- **`rowmax(df, columns)`** - Row-wise maximum across specified columns\n- **`rowmin(df, columns)`** - Row-wise minimum across specified columns\n- **`rowcount(df, columns)`** - Count non-missing values across columns\n- **`rowsd(df, columns)`** - Row-wise standard deviation\n\n### Grouping Functions\n- **`tag(df, columns)`** - Tag first occurrence in each group\n- **`count(series, by=None)`** - Count observations (optionally by group)\n- **`mean(series, by=None)`** - Group means\n- **`sum(series, by=None)`** - Group sums\n- **`max(series, by=None)`** - Group maxima\n- **`min(series, by=None)`** - Group minima\n- **`sd(series, by=None)`** - Group standard deviations\n\n### Advanced Functions\n- **`seq()`** - Generate sequence numbers\n- **`group(df, columns)`** - Create group identifiers\n- **`pc(series, by=None)`** - Calculate percentiles\n- **`iqr(series, by=None)`** - Interquartile range\n\n## \ud83d\udca1 Detailed Examples\n\n### Working with Missing Values\n```python\nimport numpy as np\n\n# Data with missing values\ndf = pd.DataFrame({\n 'var1': [1, 2, np.nan, 4, 5],\n 'var2': [10, np.nan, 30, 40, 50],\n 'var3': [100, 200, 300, np.nan, 500]\n})\n\n# Row statistics excluding missing values\ndf['mean_nonmissing'] = egen.rowmean(df, ['var1', 'var2', 'var3'])\ndf['count_nonmissing'] = egen.rowcount(df, ['var1', 'var2', 'var3'])\n```\n\n### Group Operations\n```python\n# Sample data with groups\ndf = pd.DataFrame({\n 'country': ['USA', 'USA', 'CHN', 'CHN', 'DEU', 'DEU'],\n 'year': [2020, 2021, 2020, 2021, 2020, 2021],\n 'gdp': [21.43, 22.32, 14.72, 17.73, 3.84, 4.26],\n 'population': [331, 332, 1439, 1412, 83, 83]\n})\n\n# Group-wise operations\ndf['mean_gdp_by_country'] = egen.mean(df['gdp'], by=df['country'])\ndf['country_tag'] = egen.tag(df, ['country'])\ndf['obs_per_country'] = egen.count(df['gdp'], by=df['country'])\n```\n\n### Ranking and Percentiles\n```python\n# Ranking\ndf['gdp_rank'] = egen.rank(df['gdp']) # Overall ranking\ndf['gdp_rank_by_year'] = egen.rank(df['gdp'], by=df['year']) # Ranking within year\n\n# Percentiles\ndf['gdp_percentile'] = egen.pc(df['gdp'])\n```\n\n## Stata to Python Translation Guide\n\n| Stata Command | PyEgen Equivalent |\n|---------------|-------------------|\n| `egen newvar = rank(var)` | `df['newvar'] = egen.rank(df['var'])` |\n| `egen newvar = rowmean(var1-var3)` | `df['newvar'] = egen.rowmean(df, ['var1', 'var2', 'var3'])` |\n| `egen newvar = rowtotal(var1-var3)` | `df['newvar'] = egen.rowtotal(df, ['var1', 'var2', 'var3'])` |\n| `egen newvar = tag(group)` | `df['newvar'] = egen.tag(df, ['group'])` |\n| `egen newvar = count(var), by(group)` | `df['newvar'] = egen.count(df['var'], by=df['group'])` |\n| `egen newvar = mean(var), by(group)` | `df['newvar'] = egen.mean(df['var'], by=df['group'])` |\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n### Development Setup\n```bash\ngit clone https://github.com/brycewang-stanford/pyegen.git\ncd pyegen\npip install -e \".[dev]\"\npython -m pytest tests/\n```\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- Inspired by Stata's `egen` command\n- Built on the excellent pandas library\n- Thanks to the open-source community for feedback and contributions\n\n## \ud83d\udcde Support\n\n- \ud83d\udc1b **Bug Reports**: [GitHub Issues](https://github.com/brycewang-stanford/pyegen/issues)\n- \ud83d\udca1 **Feature Requests**: [GitHub Discussions](https://github.com/brycewang-stanford/pyegen/discussions)\n- \ud83d\udce7 **Email**: brycew6m@stanford.edu\n\n---\n\n\u2b50 **If this package helps your research, please consider starring the repository!**\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python implementation of Stata's egen command for pandas DataFrames",
"version": "0.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/brycewang-stanford/pyegen/issues",
"Documentation": "https://github.com/brycewang-stanford/pyegen/blob/main/README.md",
"Homepage": "https://github.com/brycewang-stanford/pyegen",
"Repository": "https://github.com/brycewang-stanford/pyegen"
},
"split_keywords": [
"stata",
" egen",
" pandas",
" data-analysis",
" econometrics",
" statistics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2a50a4a97eb95b79e6a15515be8795df532a5536a52627d4424e9fd8db457d80",
"md5": "7f6995ff27494fa4dce43e6b75996ce2",
"sha256": "15f2e38cc6f5a11a8a029a18317a83865a167a0bdf3e39854804535efdcacf03"
},
"downloads": -1,
"filename": "pyegen-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7f6995ff27494fa4dce43e6b75996ce2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 7257,
"upload_time": "2025-07-25T05:14:46",
"upload_time_iso_8601": "2025-07-25T05:14:46.363720Z",
"url": "https://files.pythonhosted.org/packages/2a/50/a4a97eb95b79e6a15515be8795df532a5536a52627d4424e9fd8db457d80/pyegen-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e1c0f852725a5cbc4401d777859a7dac27250eb7022d77a7df92d99430cbae2d",
"md5": "f25da919f8bf4cdbf31ef2913a4e963d",
"sha256": "128f44859b8fe365fd99ca93863e4f326bcf7f03b357c9be67c3b8213305b328"
},
"downloads": -1,
"filename": "pyegen-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "f25da919f8bf4cdbf31ef2913a4e963d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 8554,
"upload_time": "2025-07-25T05:14:47",
"upload_time_iso_8601": "2025-07-25T05:14:47.741654Z",
"url": "https://files.pythonhosted.org/packages/e1/c0/f852725a5cbc4401d777859a7dac27250eb7022d77a7df92d99430cbae2d/pyegen-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-25 05:14:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "brycewang-stanford",
"github_project": "pyegen",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pyegen"
}