pyegen


Namepyegen JSON
Version 0.2.4 PyPI version JSON
download
home_pageNone
SummaryPython implementation of Stata's egen command for pandas DataFrames
upload_time2025-07-30 23:18:29
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseMIT
keywords stata egen pandas data-analysis econometrics statistics
VCS
bugtrack_url
requirements pandas numpy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PyEgen

[![PyPI version](https://img.shields.io/pypi/v/pyegen.svg)](https://pypi.org/project/pyegen/)
[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Downloads](https://img.shields.io/pypi/dm/pyegen)](https://pypi.org/project/pyegen/)

Python implementation of Stata's `egen` command for pandas DataFrames. This package provides Stata-style data manipulation functions, making it easier for researchers to transition from Stata to Python while maintaining familiar syntax and functionality.

## Quick Start

```bash
pip install pyegen
```

```python
import pandas as pd
import numpy as np
import pyegen as egen

# Create sample data
df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B', 'C', 'C'],
    'var1': [1, np.nan, 3, 4, 5, 6],
    'var2': [np.nan, 2, 5, 6, 7, 8],
    'var3': [10, 11, 12, 13, 14, 15]
})

# Row-wise operations
df['first_nonmiss'] = egen.rowfirst(df, ['var1', 'var2', 'var3'])
df['row_median'] = egen.rowmedian(df, ['var1', 'var2', 'var3'])
df['missing_count'] = egen.rowmiss(df, ['var1', 'var2', 'var3'])

# Group-wise operations  
df['group_mean'] = egen.mean(df['var1'], by=df['group'])
df['group_median'] = egen.median(df['var1'], by=df['group'])
df['group_rank'] = egen.rank(df['var1'], method='min')

# Utility functions
df['has_value_1_or_2'] = egen.anymatch(df, ['var1', 'var2'], [1, 2])
df['concat_vars'] = egen.concat(df, ['group', 'var1'], punct='_')
```

## Available Functions

PyEgen provides **45+ functions** with **100% coverage** of Stata's egen capabilities:

### Row-wise Functions
- `rowmean()`, `rowtotal()`, `rowmax()`, `rowmin()`, `rowsd()`
- `rowfirst()`, `rowlast()`, `rowmedian()`, `rowmiss()`, `rownonmiss()`, `rowpctile()`

### Statistical Functions  
- `rank()`, `count()`, `mean()`, `sum()`, `max()`, `min()`, `sd()`
- `median()`, `mode()`, `iqr()`, `kurt()`, `skew()`, `mad()`, `mdev()`
- `pc()`, `pctile()`, `std()`, `total()`

### Utility Functions
- `tag()`, `group()`, `seq()`, `anycount()`, `anymatch()`, `anyvalue()`
- `concat()`, `cut()`, `diff()`, `ends()`, `fill()`

## ๐ŸŽฏ Key Features

- **Complete Stata Coverage**: All 45 egen functions implemented
- **Pandas Integration**: Works seamlessly with pandas DataFrames  
- **Missing Value Handling**: Consistent with Stata behavior
- **Group Operations**: Full support for by-group operations with `by` parameter
- **Type Safety**: Comprehensive input validation and error handling
- **Performance**: Optimized for large datasets

## ๐Ÿ“š Complete Function Reference

### Row-wise Functions
| Function | Stata Equivalent | Description |
|----------|------------------|-------------|
| `rowmean()` | `egen newvar = rowmean(varlist)` | Row mean |
| `rowtotal()` | `egen newvar = rowtotal(varlist)` | Row sum |
| `rowmax()` | `egen newvar = rowmax(varlist)` | Row maximum |
| `rowmin()` | `egen newvar = rowmin(varlist)` | Row minimum |
| `rowsd()` | `egen newvar = rowsd(varlist)` | Row standard deviation |
| `rowfirst()` | `egen newvar = rowfirst(varlist)` | First non-missing value |
| `rowlast()` | `egen newvar = rowlast(varlist)` | Last non-missing value |
| `rowmedian()` | `egen newvar = rowmedian(varlist)` | Row median |
| `rowmiss()` | `egen newvar = rowmiss(varlist)` | Count of missing values |
| `rownonmiss()` | `egen newvar = rownonmiss(varlist)` | Count of non-missing values |
| `rowpctile()` | `egen newvar = rowpctile(varlist), p(#)` | Row percentile |

### Statistical Functions (with grouping support)
| Function | Stata Equivalent | Description |
|----------|------------------|-------------|
| `count()` | `egen newvar = count(var), by(group)` | Count observations |
| `mean()` | `egen newvar = mean(var), by(group)` | Mean |
| `sum()` | `egen newvar = sum(var), by(group)` | Sum |
| `total()` | `egen newvar = total(var), by(group)` | Total (treats missing as 0) |
| `max()` | `egen newvar = max(var), by(group)` | Maximum |
| `min()` | `egen newvar = min(var), by(group)` | Minimum |
| `sd()` | `egen newvar = sd(var), by(group)` | Standard deviation |
| `median()` | `egen newvar = median(var), by(group)` | Median |
| `mode()` | `egen newvar = mode(var), by(group)` | Mode |
| `iqr()` | `egen newvar = iqr(var), by(group)` | Interquartile range |
| `kurt()` | `egen newvar = kurt(var), by(group)` | Kurtosis |
| `skew()` | `egen newvar = skew(var), by(group)` | Skewness |
| `mad()` | `egen newvar = mad(var), by(group)` | Median absolute deviation |
| `mdev()` | `egen newvar = mdev(var), by(group)` | Mean absolute deviation |
| `pctile()` | `egen newvar = pctile(var), p(#)` | Percentile |
| `pc()` | `egen newvar = pc(var), by(group)` | Percent of total |
| `std()` | `egen newvar = std(var), by(group)` | Standardized values |

### Utility Functions
| Function | Stata Equivalent | Description |
|----------|------------------|-------------|
| `rank()` | `egen newvar = rank(var)` | Ranking with tie options |
| `tag()` | `egen newvar = tag(varlist)` | Tag first obs in group |
| `group()` | `egen newvar = group(varlist)` | Create group identifiers |
| `seq()` | `egen newvar = seq()` | Generate sequences |
| `anycount()` | `egen newvar = anycount(varlist), v(values)` | Count matching values |
| `anymatch()` | `egen newvar = anymatch(varlist), v(values)` | Check for matches |
| `anyvalue()` | `egen newvar = anyvalue(var), v(values)` | Return matching values |
| `concat()` | `egen newvar = concat(varlist), punct()` | Concatenate variables |
| `cut()` | `egen newvar = cut(var), group(#)` | Create categorical from continuous |
| `diff()` | `egen newvar = diff(varlist)` | Check if variables differ |
| `ends()` | `egen newvar = ends(strvar), head\|last\|tail` | Extract string parts |
| `fill()` | `egen newvar = fill(numlist)` | Create repeating patterns |

## ๐Ÿ’ก Migration Recommendation

**For new projects**, we recommend using the unified **PyStataR** package which provides a comprehensive suite of Stata-equivalent commands:

```bash
pip install py-stata-commands
```

```python
from py_stata_commands import egen
df['rank_var'] = egen.rank(df['income'])
```

### Why Consider PyStataR?

- **Single installation** for all Stata-equivalent commands (tabulate, egen, reghdfe, winsor2)
- **Consistent API** across all modules  
- **Enhanced documentation** and examples
- **Active development** and long-term support

**PyStataR Repository:** https://github.com/brycewang-stanford/PyStataR

## Documentation & Examples

For comprehensive examples and function documentation, see:
- [Complete Function Reference](egen_demo_en.ipynb)
- [Stata-to-PyEgen Mapping](egen_demo_en.ipynb#8-stata-to-python-conversion-reference-table)

## ๐Ÿ“Š Function Coverage Status

- โœ… Row-wise functions: 11/11 (100%)
- โœ… Statistical functions: 17/17 (100%)  
- โœ… Utility functions: 12/12 (100%)
- โœ… String functions: 2/2 (100%)
- โœ… Sequence functions: 2/2 (100%)

**Total: 45/45 functions (100% coverage)**

## ๐Ÿงช Testing

```bash
# Run tests
python -m pytest tests/

# Run specific test
python -m pytest tests/test_core.py
```

## ๐Ÿ”ง Project Status

**PyEgen will continue to be maintained** for existing users, but new feature development will primarily focus on PyStataR. This ensures:
- โœ… Bug fixes and compatibility updates for PyEgen
- โœ… Stable API for existing codebases  
- ๐Ÿš€ Enhanced features and new capabilities in PyStataR

## Installation & Requirements

```bash
pip install pyegen
```

**Requirements:**
- Python 3.7+
- pandas >= 1.3.0
- numpy >= 1.20.0

## ๐Ÿค Contributing

We welcome contributions! For major changes, please consider contributing to [PyStataR](https://github.com/brycewang-stanford/PyStataR) for maximum impact.

## ๐Ÿ”— Stata Documentation Reference

This implementation follows the official Stata documentation for egen:
- [Stata 18 egen documentation](https://www.stata.com/manuals/d/egen.pdf)

## ๐Ÿ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## ๐Ÿ”— Related Projects

- **[PyStataR](https://github.com/brycewang-stanford/PyStataR)** - Unified Stata-equivalent commands and R functions (recommended for new projects)
- **[StatsPAI](https://github.com/brycewang-stanford/StatsPAI/)** - StatsPAI = Stats + Econometrics + ML + AI + LLMs

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyegen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "stata, egen, pandas, data-analysis, econometrics, statistics",
    "author": null,
    "author_email": "Bryce Wang <brycew6m@stanford.edu>",
    "download_url": "https://files.pythonhosted.org/packages/b0/c3/7d08d47d4f86c6217a1dcea86a89dc57ff5a33fda7685850f862035aa6c3/pyegen-0.2.4.tar.gz",
    "platform": null,
    "description": "# PyEgen\n\n[![PyPI version](https://img.shields.io/pypi/v/pyegen.svg)](https://pypi.org/project/pyegen/)\n[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Downloads](https://img.shields.io/pypi/dm/pyegen)](https://pypi.org/project/pyegen/)\n\nPython implementation of Stata's `egen` command for pandas DataFrames. This package provides Stata-style data manipulation functions, making it easier for researchers to transition from Stata to Python while maintaining familiar syntax and functionality.\n\n## Quick Start\n\n```bash\npip install pyegen\n```\n\n```python\nimport pandas as pd\nimport numpy as np\nimport pyegen as egen\n\n# Create sample data\ndf = pd.DataFrame({\n    'group': ['A', 'A', 'B', 'B', 'C', 'C'],\n    'var1': [1, np.nan, 3, 4, 5, 6],\n    'var2': [np.nan, 2, 5, 6, 7, 8],\n    'var3': [10, 11, 12, 13, 14, 15]\n})\n\n# Row-wise operations\ndf['first_nonmiss'] = egen.rowfirst(df, ['var1', 'var2', 'var3'])\ndf['row_median'] = egen.rowmedian(df, ['var1', 'var2', 'var3'])\ndf['missing_count'] = egen.rowmiss(df, ['var1', 'var2', 'var3'])\n\n# Group-wise operations  \ndf['group_mean'] = egen.mean(df['var1'], by=df['group'])\ndf['group_median'] = egen.median(df['var1'], by=df['group'])\ndf['group_rank'] = egen.rank(df['var1'], method='min')\n\n# Utility functions\ndf['has_value_1_or_2'] = egen.anymatch(df, ['var1', 'var2'], [1, 2])\ndf['concat_vars'] = egen.concat(df, ['group', 'var1'], punct='_')\n```\n\n## Available Functions\n\nPyEgen provides **45+ functions** with **100% coverage** of Stata's egen capabilities:\n\n### Row-wise Functions\n- `rowmean()`, `rowtotal()`, `rowmax()`, `rowmin()`, `rowsd()`\n- `rowfirst()`, `rowlast()`, `rowmedian()`, `rowmiss()`, `rownonmiss()`, `rowpctile()`\n\n### Statistical Functions  \n- `rank()`, `count()`, `mean()`, `sum()`, `max()`, `min()`, `sd()`\n- `median()`, `mode()`, `iqr()`, `kurt()`, `skew()`, `mad()`, `mdev()`\n- `pc()`, `pctile()`, `std()`, `total()`\n\n### Utility Functions\n- `tag()`, `group()`, `seq()`, `anycount()`, `anymatch()`, `anyvalue()`\n- `concat()`, `cut()`, `diff()`, `ends()`, `fill()`\n\n## \ud83c\udfaf Key Features\n\n- **Complete Stata Coverage**: All 45 egen functions implemented\n- **Pandas Integration**: Works seamlessly with pandas DataFrames  \n- **Missing Value Handling**: Consistent with Stata behavior\n- **Group Operations**: Full support for by-group operations with `by` parameter\n- **Type Safety**: Comprehensive input validation and error handling\n- **Performance**: Optimized for large datasets\n\n## \ud83d\udcda Complete Function Reference\n\n### Row-wise Functions\n| Function | Stata Equivalent | Description |\n|----------|------------------|-------------|\n| `rowmean()` | `egen newvar = rowmean(varlist)` | Row mean |\n| `rowtotal()` | `egen newvar = rowtotal(varlist)` | Row sum |\n| `rowmax()` | `egen newvar = rowmax(varlist)` | Row maximum |\n| `rowmin()` | `egen newvar = rowmin(varlist)` | Row minimum |\n| `rowsd()` | `egen newvar = rowsd(varlist)` | Row standard deviation |\n| `rowfirst()` | `egen newvar = rowfirst(varlist)` | First non-missing value |\n| `rowlast()` | `egen newvar = rowlast(varlist)` | Last non-missing value |\n| `rowmedian()` | `egen newvar = rowmedian(varlist)` | Row median |\n| `rowmiss()` | `egen newvar = rowmiss(varlist)` | Count of missing values |\n| `rownonmiss()` | `egen newvar = rownonmiss(varlist)` | Count of non-missing values |\n| `rowpctile()` | `egen newvar = rowpctile(varlist), p(#)` | Row percentile |\n\n### Statistical Functions (with grouping support)\n| Function | Stata Equivalent | Description |\n|----------|------------------|-------------|\n| `count()` | `egen newvar = count(var), by(group)` | Count observations |\n| `mean()` | `egen newvar = mean(var), by(group)` | Mean |\n| `sum()` | `egen newvar = sum(var), by(group)` | Sum |\n| `total()` | `egen newvar = total(var), by(group)` | Total (treats missing as 0) |\n| `max()` | `egen newvar = max(var), by(group)` | Maximum |\n| `min()` | `egen newvar = min(var), by(group)` | Minimum |\n| `sd()` | `egen newvar = sd(var), by(group)` | Standard deviation |\n| `median()` | `egen newvar = median(var), by(group)` | Median |\n| `mode()` | `egen newvar = mode(var), by(group)` | Mode |\n| `iqr()` | `egen newvar = iqr(var), by(group)` | Interquartile range |\n| `kurt()` | `egen newvar = kurt(var), by(group)` | Kurtosis |\n| `skew()` | `egen newvar = skew(var), by(group)` | Skewness |\n| `mad()` | `egen newvar = mad(var), by(group)` | Median absolute deviation |\n| `mdev()` | `egen newvar = mdev(var), by(group)` | Mean absolute deviation |\n| `pctile()` | `egen newvar = pctile(var), p(#)` | Percentile |\n| `pc()` | `egen newvar = pc(var), by(group)` | Percent of total |\n| `std()` | `egen newvar = std(var), by(group)` | Standardized values |\n\n### Utility Functions\n| Function | Stata Equivalent | Description |\n|----------|------------------|-------------|\n| `rank()` | `egen newvar = rank(var)` | Ranking with tie options |\n| `tag()` | `egen newvar = tag(varlist)` | Tag first obs in group |\n| `group()` | `egen newvar = group(varlist)` | Create group identifiers |\n| `seq()` | `egen newvar = seq()` | Generate sequences |\n| `anycount()` | `egen newvar = anycount(varlist), v(values)` | Count matching values |\n| `anymatch()` | `egen newvar = anymatch(varlist), v(values)` | Check for matches |\n| `anyvalue()` | `egen newvar = anyvalue(var), v(values)` | Return matching values |\n| `concat()` | `egen newvar = concat(varlist), punct()` | Concatenate variables |\n| `cut()` | `egen newvar = cut(var), group(#)` | Create categorical from continuous |\n| `diff()` | `egen newvar = diff(varlist)` | Check if variables differ |\n| `ends()` | `egen newvar = ends(strvar), head\\|last\\|tail` | Extract string parts |\n| `fill()` | `egen newvar = fill(numlist)` | Create repeating patterns |\n\n## \ud83d\udca1 Migration Recommendation\n\n**For new projects**, we recommend using the unified **PyStataR** package which provides a comprehensive suite of Stata-equivalent commands:\n\n```bash\npip install py-stata-commands\n```\n\n```python\nfrom py_stata_commands import egen\ndf['rank_var'] = egen.rank(df['income'])\n```\n\n### Why Consider PyStataR?\n\n- **Single installation** for all Stata-equivalent commands (tabulate, egen, reghdfe, winsor2)\n- **Consistent API** across all modules  \n- **Enhanced documentation** and examples\n- **Active development** and long-term support\n\n**PyStataR Repository:** https://github.com/brycewang-stanford/PyStataR\n\n## Documentation & Examples\n\nFor comprehensive examples and function documentation, see:\n- [Complete Function Reference](egen_demo_en.ipynb)\n- [Stata-to-PyEgen Mapping](egen_demo_en.ipynb#8-stata-to-python-conversion-reference-table)\n\n## \ud83d\udcca Function Coverage Status\n\n- \u2705 Row-wise functions: 11/11 (100%)\n- \u2705 Statistical functions: 17/17 (100%)  \n- \u2705 Utility functions: 12/12 (100%)\n- \u2705 String functions: 2/2 (100%)\n- \u2705 Sequence functions: 2/2 (100%)\n\n**Total: 45/45 functions (100% coverage)**\n\n## \ud83e\uddea Testing\n\n```bash\n# Run tests\npython -m pytest tests/\n\n# Run specific test\npython -m pytest tests/test_core.py\n```\n\n## \ud83d\udd27 Project Status\n\n**PyEgen will continue to be maintained** for existing users, but new feature development will primarily focus on PyStataR. This ensures:\n- \u2705 Bug fixes and compatibility updates for PyEgen\n- \u2705 Stable API for existing codebases  \n- \ud83d\ude80 Enhanced features and new capabilities in PyStataR\n\n## Installation & Requirements\n\n```bash\npip install pyegen\n```\n\n**Requirements:**\n- Python 3.7+\n- pandas >= 1.3.0\n- numpy >= 1.20.0\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! For major changes, please consider contributing to [PyStataR](https://github.com/brycewang-stanford/PyStataR) for maximum impact.\n\n## \ud83d\udd17 Stata Documentation Reference\n\nThis implementation follows the official Stata documentation for egen:\n- [Stata 18 egen documentation](https://www.stata.com/manuals/d/egen.pdf)\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udd17 Related Projects\n\n- **[PyStataR](https://github.com/brycewang-stanford/PyStataR)** - Unified Stata-equivalent commands and R functions (recommended for new projects)\n- **[StatsPAI](https://github.com/brycewang-stanford/StatsPAI/)** - StatsPAI = Stats + Econometrics + ML + AI + LLMs\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python implementation of Stata's egen command for pandas DataFrames",
    "version": "0.2.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/brycewang-stanford/pyegen/issues",
        "Documentation": "https://github.com/brycewang-stanford/pyegen/blob/main/README.md",
        "Homepage": "https://github.com/brycewang-stanford/pyegen",
        "Repository": "https://github.com/brycewang-stanford/pyegen"
    },
    "split_keywords": [
        "stata",
        " egen",
        " pandas",
        " data-analysis",
        " econometrics",
        " statistics"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "79080678696ed31dabb32ebfa46604367247fbb2134965a146e107eb1e99f1c9",
                "md5": "758385433093a4b6bfff35c7eab6e8a9",
                "sha256": "e590c7146c548b8888df9251aef3ba1eed6d92053315360b1cb1ab46d50b3d37"
            },
            "downloads": -1,
            "filename": "pyegen-0.2.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "758385433093a4b6bfff35c7eab6e8a9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 11737,
            "upload_time": "2025-07-30T23:18:28",
            "upload_time_iso_8601": "2025-07-30T23:18:28.768561Z",
            "url": "https://files.pythonhosted.org/packages/79/08/0678696ed31dabb32ebfa46604367247fbb2134965a146e107eb1e99f1c9/pyegen-0.2.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b0c37d08d47d4f86c6217a1dcea86a89dc57ff5a33fda7685850f862035aa6c3",
                "md5": "b067719a1c3bd13d35599e2ae95af747",
                "sha256": "a23af21c794e8d451089e20d46aca0364de89059e87b477d9de5ec2a7a9a6d60"
            },
            "downloads": -1,
            "filename": "pyegen-0.2.4.tar.gz",
            "has_sig": false,
            "md5_digest": "b067719a1c3bd13d35599e2ae95af747",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 18916,
            "upload_time": "2025-07-30T23:18:29",
            "upload_time_iso_8601": "2025-07-30T23:18:29.869858Z",
            "url": "https://files.pythonhosted.org/packages/b0/c3/7d08d47d4f86c6217a1dcea86a89dc57ff5a33fda7685850f862035aa6c3/pyegen-0.2.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-30 23:18:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "brycewang-stanford",
    "github_project": "pyegen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.0"
                ]
            ]
        }
    ],
    "lcname": "pyegen"
}
        
Elapsed time: 1.38674s