# PyEgen
[](https://pypi.org/project/pyegen/)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://pypi.org/project/pyegen/)
Python implementation of Stata's `egen` command for pandas DataFrames. This package provides Stata-style data manipulation functions, making it easier for researchers to transition from Stata to Python while maintaining familiar syntax and functionality.
## Quick Start
```bash
pip install pyegen
```
```python
import pandas as pd
import numpy as np
import pyegen as egen
# Create sample data
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B', 'C', 'C'],
'var1': [1, np.nan, 3, 4, 5, 6],
'var2': [np.nan, 2, 5, 6, 7, 8],
'var3': [10, 11, 12, 13, 14, 15]
})
# Row-wise operations
df['first_nonmiss'] = egen.rowfirst(df, ['var1', 'var2', 'var3'])
df['row_median'] = egen.rowmedian(df, ['var1', 'var2', 'var3'])
df['missing_count'] = egen.rowmiss(df, ['var1', 'var2', 'var3'])
# Group-wise operations
df['group_mean'] = egen.mean(df['var1'], by=df['group'])
df['group_median'] = egen.median(df['var1'], by=df['group'])
df['group_rank'] = egen.rank(df['var1'], method='min')
# Utility functions
df['has_value_1_or_2'] = egen.anymatch(df, ['var1', 'var2'], [1, 2])
df['concat_vars'] = egen.concat(df, ['group', 'var1'], punct='_')
```
## Available Functions
PyEgen provides **45+ functions** with **100% coverage** of Stata's egen capabilities:
### Row-wise Functions
- `rowmean()`, `rowtotal()`, `rowmax()`, `rowmin()`, `rowsd()`
- `rowfirst()`, `rowlast()`, `rowmedian()`, `rowmiss()`, `rownonmiss()`, `rowpctile()`
### Statistical Functions
- `rank()`, `count()`, `mean()`, `sum()`, `max()`, `min()`, `sd()`
- `median()`, `mode()`, `iqr()`, `kurt()`, `skew()`, `mad()`, `mdev()`
- `pc()`, `pctile()`, `std()`, `total()`
### Utility Functions
- `tag()`, `group()`, `seq()`, `anycount()`, `anymatch()`, `anyvalue()`
- `concat()`, `cut()`, `diff()`, `ends()`, `fill()`
## ๐ฏ Key Features
- **Complete Stata Coverage**: All 45 egen functions implemented
- **Pandas Integration**: Works seamlessly with pandas DataFrames
- **Missing Value Handling**: Consistent with Stata behavior
- **Group Operations**: Full support for by-group operations with `by` parameter
- **Type Safety**: Comprehensive input validation and error handling
- **Performance**: Optimized for large datasets
## ๐ Complete Function Reference
### Row-wise Functions
| Function | Stata Equivalent | Description |
|----------|------------------|-------------|
| `rowmean()` | `egen newvar = rowmean(varlist)` | Row mean |
| `rowtotal()` | `egen newvar = rowtotal(varlist)` | Row sum |
| `rowmax()` | `egen newvar = rowmax(varlist)` | Row maximum |
| `rowmin()` | `egen newvar = rowmin(varlist)` | Row minimum |
| `rowsd()` | `egen newvar = rowsd(varlist)` | Row standard deviation |
| `rowfirst()` | `egen newvar = rowfirst(varlist)` | First non-missing value |
| `rowlast()` | `egen newvar = rowlast(varlist)` | Last non-missing value |
| `rowmedian()` | `egen newvar = rowmedian(varlist)` | Row median |
| `rowmiss()` | `egen newvar = rowmiss(varlist)` | Count of missing values |
| `rownonmiss()` | `egen newvar = rownonmiss(varlist)` | Count of non-missing values |
| `rowpctile()` | `egen newvar = rowpctile(varlist), p(#)` | Row percentile |
### Statistical Functions (with grouping support)
| Function | Stata Equivalent | Description |
|----------|------------------|-------------|
| `count()` | `egen newvar = count(var), by(group)` | Count observations |
| `mean()` | `egen newvar = mean(var), by(group)` | Mean |
| `sum()` | `egen newvar = sum(var), by(group)` | Sum |
| `total()` | `egen newvar = total(var), by(group)` | Total (treats missing as 0) |
| `max()` | `egen newvar = max(var), by(group)` | Maximum |
| `min()` | `egen newvar = min(var), by(group)` | Minimum |
| `sd()` | `egen newvar = sd(var), by(group)` | Standard deviation |
| `median()` | `egen newvar = median(var), by(group)` | Median |
| `mode()` | `egen newvar = mode(var), by(group)` | Mode |
| `iqr()` | `egen newvar = iqr(var), by(group)` | Interquartile range |
| `kurt()` | `egen newvar = kurt(var), by(group)` | Kurtosis |
| `skew()` | `egen newvar = skew(var), by(group)` | Skewness |
| `mad()` | `egen newvar = mad(var), by(group)` | Median absolute deviation |
| `mdev()` | `egen newvar = mdev(var), by(group)` | Mean absolute deviation |
| `pctile()` | `egen newvar = pctile(var), p(#)` | Percentile |
| `pc()` | `egen newvar = pc(var), by(group)` | Percent of total |
| `std()` | `egen newvar = std(var), by(group)` | Standardized values |
### Utility Functions
| Function | Stata Equivalent | Description |
|----------|------------------|-------------|
| `rank()` | `egen newvar = rank(var)` | Ranking with tie options |
| `tag()` | `egen newvar = tag(varlist)` | Tag first obs in group |
| `group()` | `egen newvar = group(varlist)` | Create group identifiers |
| `seq()` | `egen newvar = seq()` | Generate sequences |
| `anycount()` | `egen newvar = anycount(varlist), v(values)` | Count matching values |
| `anymatch()` | `egen newvar = anymatch(varlist), v(values)` | Check for matches |
| `anyvalue()` | `egen newvar = anyvalue(var), v(values)` | Return matching values |
| `concat()` | `egen newvar = concat(varlist), punct()` | Concatenate variables |
| `cut()` | `egen newvar = cut(var), group(#)` | Create categorical from continuous |
| `diff()` | `egen newvar = diff(varlist)` | Check if variables differ |
| `ends()` | `egen newvar = ends(strvar), head\|last\|tail` | Extract string parts |
| `fill()` | `egen newvar = fill(numlist)` | Create repeating patterns |
## ๐ก Migration Recommendation
**For new projects**, we recommend using the unified **PyStataR** package which provides a comprehensive suite of Stata-equivalent commands:
```bash
pip install py-stata-commands
```
```python
from py_stata_commands import egen
df['rank_var'] = egen.rank(df['income'])
```
### Why Consider PyStataR?
- **Single installation** for all Stata-equivalent commands (tabulate, egen, reghdfe, winsor2)
- **Consistent API** across all modules
- **Enhanced documentation** and examples
- **Active development** and long-term support
**PyStataR Repository:** https://github.com/brycewang-stanford/PyStataR
## Documentation & Examples
For comprehensive examples and function documentation, see:
- [Complete Function Reference](egen_demo_en.ipynb)
- [Stata-to-PyEgen Mapping](egen_demo_en.ipynb#8-stata-to-python-conversion-reference-table)
## ๐ Function Coverage Status
- โ
Row-wise functions: 11/11 (100%)
- โ
Statistical functions: 17/17 (100%)
- โ
Utility functions: 12/12 (100%)
- โ
String functions: 2/2 (100%)
- โ
Sequence functions: 2/2 (100%)
**Total: 45/45 functions (100% coverage)**
## ๐งช Testing
```bash
# Run tests
python -m pytest tests/
# Run specific test
python -m pytest tests/test_core.py
```
## ๐ง Project Status
**PyEgen will continue to be maintained** for existing users, but new feature development will primarily focus on PyStataR. This ensures:
- โ
Bug fixes and compatibility updates for PyEgen
- โ
Stable API for existing codebases
- ๐ Enhanced features and new capabilities in PyStataR
## Installation & Requirements
```bash
pip install pyegen
```
**Requirements:**
- Python 3.7+
- pandas >= 1.3.0
- numpy >= 1.20.0
## ๐ค Contributing
We welcome contributions! For major changes, please consider contributing to [PyStataR](https://github.com/brycewang-stanford/PyStataR) for maximum impact.
## ๐ Stata Documentation Reference
This implementation follows the official Stata documentation for egen:
- [Stata 18 egen documentation](https://www.stata.com/manuals/d/egen.pdf)
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Related Projects
- **[PyStataR](https://github.com/brycewang-stanford/PyStataR)** - Unified Stata-equivalent commands and R functions (recommended for new projects)
- **[StatsPAI](https://github.com/brycewang-stanford/StatsPAI/)** - StatsPAI = Stats + Econometrics + ML + AI + LLMs
Raw data
{
"_id": null,
"home_page": null,
"name": "pyegen",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "stata, egen, pandas, data-analysis, econometrics, statistics",
"author": null,
"author_email": "Bryce Wang <brycew6m@stanford.edu>",
"download_url": "https://files.pythonhosted.org/packages/b0/c3/7d08d47d4f86c6217a1dcea86a89dc57ff5a33fda7685850f862035aa6c3/pyegen-0.2.4.tar.gz",
"platform": null,
"description": "# PyEgen\n\n[](https://pypi.org/project/pyegen/)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](https://pypi.org/project/pyegen/)\n\nPython implementation of Stata's `egen` command for pandas DataFrames. This package provides Stata-style data manipulation functions, making it easier for researchers to transition from Stata to Python while maintaining familiar syntax and functionality.\n\n## Quick Start\n\n```bash\npip install pyegen\n```\n\n```python\nimport pandas as pd\nimport numpy as np\nimport pyegen as egen\n\n# Create sample data\ndf = pd.DataFrame({\n 'group': ['A', 'A', 'B', 'B', 'C', 'C'],\n 'var1': [1, np.nan, 3, 4, 5, 6],\n 'var2': [np.nan, 2, 5, 6, 7, 8],\n 'var3': [10, 11, 12, 13, 14, 15]\n})\n\n# Row-wise operations\ndf['first_nonmiss'] = egen.rowfirst(df, ['var1', 'var2', 'var3'])\ndf['row_median'] = egen.rowmedian(df, ['var1', 'var2', 'var3'])\ndf['missing_count'] = egen.rowmiss(df, ['var1', 'var2', 'var3'])\n\n# Group-wise operations \ndf['group_mean'] = egen.mean(df['var1'], by=df['group'])\ndf['group_median'] = egen.median(df['var1'], by=df['group'])\ndf['group_rank'] = egen.rank(df['var1'], method='min')\n\n# Utility functions\ndf['has_value_1_or_2'] = egen.anymatch(df, ['var1', 'var2'], [1, 2])\ndf['concat_vars'] = egen.concat(df, ['group', 'var1'], punct='_')\n```\n\n## Available Functions\n\nPyEgen provides **45+ functions** with **100% coverage** of Stata's egen capabilities:\n\n### Row-wise Functions\n- `rowmean()`, `rowtotal()`, `rowmax()`, `rowmin()`, `rowsd()`\n- `rowfirst()`, `rowlast()`, `rowmedian()`, `rowmiss()`, `rownonmiss()`, `rowpctile()`\n\n### Statistical Functions \n- `rank()`, `count()`, `mean()`, `sum()`, `max()`, `min()`, `sd()`\n- `median()`, `mode()`, `iqr()`, `kurt()`, `skew()`, `mad()`, `mdev()`\n- `pc()`, `pctile()`, `std()`, `total()`\n\n### Utility Functions\n- `tag()`, `group()`, `seq()`, `anycount()`, `anymatch()`, `anyvalue()`\n- `concat()`, `cut()`, `diff()`, `ends()`, `fill()`\n\n## \ud83c\udfaf Key Features\n\n- **Complete Stata Coverage**: All 45 egen functions implemented\n- **Pandas Integration**: Works seamlessly with pandas DataFrames \n- **Missing Value Handling**: Consistent with Stata behavior\n- **Group Operations**: Full support for by-group operations with `by` parameter\n- **Type Safety**: Comprehensive input validation and error handling\n- **Performance**: Optimized for large datasets\n\n## \ud83d\udcda Complete Function Reference\n\n### Row-wise Functions\n| Function | Stata Equivalent | Description |\n|----------|------------------|-------------|\n| `rowmean()` | `egen newvar = rowmean(varlist)` | Row mean |\n| `rowtotal()` | `egen newvar = rowtotal(varlist)` | Row sum |\n| `rowmax()` | `egen newvar = rowmax(varlist)` | Row maximum |\n| `rowmin()` | `egen newvar = rowmin(varlist)` | Row minimum |\n| `rowsd()` | `egen newvar = rowsd(varlist)` | Row standard deviation |\n| `rowfirst()` | `egen newvar = rowfirst(varlist)` | First non-missing value |\n| `rowlast()` | `egen newvar = rowlast(varlist)` | Last non-missing value |\n| `rowmedian()` | `egen newvar = rowmedian(varlist)` | Row median |\n| `rowmiss()` | `egen newvar = rowmiss(varlist)` | Count of missing values |\n| `rownonmiss()` | `egen newvar = rownonmiss(varlist)` | Count of non-missing values |\n| `rowpctile()` | `egen newvar = rowpctile(varlist), p(#)` | Row percentile |\n\n### Statistical Functions (with grouping support)\n| Function | Stata Equivalent | Description |\n|----------|------------------|-------------|\n| `count()` | `egen newvar = count(var), by(group)` | Count observations |\n| `mean()` | `egen newvar = mean(var), by(group)` | Mean |\n| `sum()` | `egen newvar = sum(var), by(group)` | Sum |\n| `total()` | `egen newvar = total(var), by(group)` | Total (treats missing as 0) |\n| `max()` | `egen newvar = max(var), by(group)` | Maximum |\n| `min()` | `egen newvar = min(var), by(group)` | Minimum |\n| `sd()` | `egen newvar = sd(var), by(group)` | Standard deviation |\n| `median()` | `egen newvar = median(var), by(group)` | Median |\n| `mode()` | `egen newvar = mode(var), by(group)` | Mode |\n| `iqr()` | `egen newvar = iqr(var), by(group)` | Interquartile range |\n| `kurt()` | `egen newvar = kurt(var), by(group)` | Kurtosis |\n| `skew()` | `egen newvar = skew(var), by(group)` | Skewness |\n| `mad()` | `egen newvar = mad(var), by(group)` | Median absolute deviation |\n| `mdev()` | `egen newvar = mdev(var), by(group)` | Mean absolute deviation |\n| `pctile()` | `egen newvar = pctile(var), p(#)` | Percentile |\n| `pc()` | `egen newvar = pc(var), by(group)` | Percent of total |\n| `std()` | `egen newvar = std(var), by(group)` | Standardized values |\n\n### Utility Functions\n| Function | Stata Equivalent | Description |\n|----------|------------------|-------------|\n| `rank()` | `egen newvar = rank(var)` | Ranking with tie options |\n| `tag()` | `egen newvar = tag(varlist)` | Tag first obs in group |\n| `group()` | `egen newvar = group(varlist)` | Create group identifiers |\n| `seq()` | `egen newvar = seq()` | Generate sequences |\n| `anycount()` | `egen newvar = anycount(varlist), v(values)` | Count matching values |\n| `anymatch()` | `egen newvar = anymatch(varlist), v(values)` | Check for matches |\n| `anyvalue()` | `egen newvar = anyvalue(var), v(values)` | Return matching values |\n| `concat()` | `egen newvar = concat(varlist), punct()` | Concatenate variables |\n| `cut()` | `egen newvar = cut(var), group(#)` | Create categorical from continuous |\n| `diff()` | `egen newvar = diff(varlist)` | Check if variables differ |\n| `ends()` | `egen newvar = ends(strvar), head\\|last\\|tail` | Extract string parts |\n| `fill()` | `egen newvar = fill(numlist)` | Create repeating patterns |\n\n## \ud83d\udca1 Migration Recommendation\n\n**For new projects**, we recommend using the unified **PyStataR** package which provides a comprehensive suite of Stata-equivalent commands:\n\n```bash\npip install py-stata-commands\n```\n\n```python\nfrom py_stata_commands import egen\ndf['rank_var'] = egen.rank(df['income'])\n```\n\n### Why Consider PyStataR?\n\n- **Single installation** for all Stata-equivalent commands (tabulate, egen, reghdfe, winsor2)\n- **Consistent API** across all modules \n- **Enhanced documentation** and examples\n- **Active development** and long-term support\n\n**PyStataR Repository:** https://github.com/brycewang-stanford/PyStataR\n\n## Documentation & Examples\n\nFor comprehensive examples and function documentation, see:\n- [Complete Function Reference](egen_demo_en.ipynb)\n- [Stata-to-PyEgen Mapping](egen_demo_en.ipynb#8-stata-to-python-conversion-reference-table)\n\n## \ud83d\udcca Function Coverage Status\n\n- \u2705 Row-wise functions: 11/11 (100%)\n- \u2705 Statistical functions: 17/17 (100%) \n- \u2705 Utility functions: 12/12 (100%)\n- \u2705 String functions: 2/2 (100%)\n- \u2705 Sequence functions: 2/2 (100%)\n\n**Total: 45/45 functions (100% coverage)**\n\n## \ud83e\uddea Testing\n\n```bash\n# Run tests\npython -m pytest tests/\n\n# Run specific test\npython -m pytest tests/test_core.py\n```\n\n## \ud83d\udd27 Project Status\n\n**PyEgen will continue to be maintained** for existing users, but new feature development will primarily focus on PyStataR. This ensures:\n- \u2705 Bug fixes and compatibility updates for PyEgen\n- \u2705 Stable API for existing codebases \n- \ud83d\ude80 Enhanced features and new capabilities in PyStataR\n\n## Installation & Requirements\n\n```bash\npip install pyegen\n```\n\n**Requirements:**\n- Python 3.7+\n- pandas >= 1.3.0\n- numpy >= 1.20.0\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! For major changes, please consider contributing to [PyStataR](https://github.com/brycewang-stanford/PyStataR) for maximum impact.\n\n## \ud83d\udd17 Stata Documentation Reference\n\nThis implementation follows the official Stata documentation for egen:\n- [Stata 18 egen documentation](https://www.stata.com/manuals/d/egen.pdf)\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udd17 Related Projects\n\n- **[PyStataR](https://github.com/brycewang-stanford/PyStataR)** - Unified Stata-equivalent commands and R functions (recommended for new projects)\n- **[StatsPAI](https://github.com/brycewang-stanford/StatsPAI/)** - StatsPAI = Stats + Econometrics + ML + AI + LLMs\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python implementation of Stata's egen command for pandas DataFrames",
"version": "0.2.4",
"project_urls": {
"Bug Tracker": "https://github.com/brycewang-stanford/pyegen/issues",
"Documentation": "https://github.com/brycewang-stanford/pyegen/blob/main/README.md",
"Homepage": "https://github.com/brycewang-stanford/pyegen",
"Repository": "https://github.com/brycewang-stanford/pyegen"
},
"split_keywords": [
"stata",
" egen",
" pandas",
" data-analysis",
" econometrics",
" statistics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "79080678696ed31dabb32ebfa46604367247fbb2134965a146e107eb1e99f1c9",
"md5": "758385433093a4b6bfff35c7eab6e8a9",
"sha256": "e590c7146c548b8888df9251aef3ba1eed6d92053315360b1cb1ab46d50b3d37"
},
"downloads": -1,
"filename": "pyegen-0.2.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "758385433093a4b6bfff35c7eab6e8a9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 11737,
"upload_time": "2025-07-30T23:18:28",
"upload_time_iso_8601": "2025-07-30T23:18:28.768561Z",
"url": "https://files.pythonhosted.org/packages/79/08/0678696ed31dabb32ebfa46604367247fbb2134965a146e107eb1e99f1c9/pyegen-0.2.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b0c37d08d47d4f86c6217a1dcea86a89dc57ff5a33fda7685850f862035aa6c3",
"md5": "b067719a1c3bd13d35599e2ae95af747",
"sha256": "a23af21c794e8d451089e20d46aca0364de89059e87b477d9de5ec2a7a9a6d60"
},
"downloads": -1,
"filename": "pyegen-0.2.4.tar.gz",
"has_sig": false,
"md5_digest": "b067719a1c3bd13d35599e2ae95af747",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 18916,
"upload_time": "2025-07-30T23:18:29",
"upload_time_iso_8601": "2025-07-30T23:18:29.869858Z",
"url": "https://files.pythonhosted.org/packages/b0/c3/7d08d47d4f86c6217a1dcea86a89dc57ff5a33fda7685850f862035aa6c3/pyegen-0.2.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-30 23:18:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "brycewang-stanford",
"github_project": "pyegen",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.20.0"
]
]
}
],
"lcname": "pyegen"
}