pyegen


Namepyegen JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryPython implementation of Stata's egen command for pandas DataFrames
upload_time2025-07-25 05:14:47
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseMIT
keywords stata egen pandas data-analysis econometrics statistics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PyEgen

[![PyPI version](https://badge.fury.io/py/pyegen.svg)](https://badge.fury.io/py/pyegen)
[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

🐍 **Python implementation of Stata's `egen` command**

PyEgen brings the power and convenience of Stata's `egen` (extended generate) command to Python pandas DataFrames. If you're a researcher transitioning from Stata to Python, this package will make your data manipulation tasks much more familiar and efficient.

##  Key Features

- **Familiar Syntax**: Stata-like syntax for data manipulation
- **Pandas Integration**: Seamless integration with pandas DataFrames  
- **High Performance**: Optimized implementations using pandas operations
- **Comprehensive**: Covers most commonly used `egen` functions
- **Easy to Use**: Simple, intuitive API

##  Installation

```bash
pip install pyegen
```

##  Quick Start

```python
import pandas as pd
import pyegen as egen

# Create sample data
df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B', 'C', 'C'],
    'value1': [10, 20, 30, 40, 50, 60],
    'value2': [1, 2, 3, 4, 5, 6],
    'value3': [100, 200, 300, 400, 500, 600]
})

# Stata: egen newvar = rank(value1)
df['rank_val'] = egen.rank(df['value1'])

# Stata: egen newvar = rowmean(value1 value2 value3)
df['row_mean'] = egen.rowmean(df, ['value1', 'value2', 'value3'])

# Stata: egen newvar = rowtotal(value1 value2 value3)
df['row_total'] = egen.rowtotal(df, ['value1', 'value2', 'value3'])

# Stata: egen newvar = tag(group)
df['tag'] = egen.tag(df, ['group'])

# Stata: egen newvar = count(value1), by(group)
df['count_by_group'] = egen.count(df['value1'], by=df['group'])
```

##  Available Functions

### Basic Functions
- **`rank(series, method='average')`** - Rank values (like Stata's `egen rank`)
- **`rowmean(df, columns)`** - Row-wise mean across specified columns
- **`rowtotal(df, columns)`** - Row-wise sum across specified columns
- **`rowmax(df, columns)`** - Row-wise maximum across specified columns
- **`rowmin(df, columns)`** - Row-wise minimum across specified columns
- **`rowcount(df, columns)`** - Count non-missing values across columns
- **`rowsd(df, columns)`** - Row-wise standard deviation

### Grouping Functions
- **`tag(df, columns)`** - Tag first occurrence in each group
- **`count(series, by=None)`** - Count observations (optionally by group)
- **`mean(series, by=None)`** - Group means
- **`sum(series, by=None)`** - Group sums
- **`max(series, by=None)`** - Group maxima
- **`min(series, by=None)`** - Group minima
- **`sd(series, by=None)`** - Group standard deviations

### Advanced Functions
- **`seq()`** - Generate sequence numbers
- **`group(df, columns)`** - Create group identifiers
- **`pc(series, by=None)`** - Calculate percentiles
- **`iqr(series, by=None)`** - Interquartile range

## 💡 Detailed Examples

### Working with Missing Values
```python
import numpy as np

# Data with missing values
df = pd.DataFrame({
    'var1': [1, 2, np.nan, 4, 5],
    'var2': [10, np.nan, 30, 40, 50],
    'var3': [100, 200, 300, np.nan, 500]
})

# Row statistics excluding missing values
df['mean_nonmissing'] = egen.rowmean(df, ['var1', 'var2', 'var3'])
df['count_nonmissing'] = egen.rowcount(df, ['var1', 'var2', 'var3'])
```

### Group Operations
```python
# Sample data with groups
df = pd.DataFrame({
    'country': ['USA', 'USA', 'CHN', 'CHN', 'DEU', 'DEU'],
    'year': [2020, 2021, 2020, 2021, 2020, 2021],
    'gdp': [21.43, 22.32, 14.72, 17.73, 3.84, 4.26],
    'population': [331, 332, 1439, 1412, 83, 83]
})

# Group-wise operations
df['mean_gdp_by_country'] = egen.mean(df['gdp'], by=df['country'])
df['country_tag'] = egen.tag(df, ['country'])
df['obs_per_country'] = egen.count(df['gdp'], by=df['country'])
```

### Ranking and Percentiles
```python
# Ranking
df['gdp_rank'] = egen.rank(df['gdp'])  # Overall ranking
df['gdp_rank_by_year'] = egen.rank(df['gdp'], by=df['year'])  # Ranking within year

# Percentiles
df['gdp_percentile'] = egen.pc(df['gdp'])
```

##  Stata to Python Translation Guide

| Stata Command | PyEgen Equivalent |
|---------------|-------------------|
| `egen newvar = rank(var)` | `df['newvar'] = egen.rank(df['var'])` |
| `egen newvar = rowmean(var1-var3)` | `df['newvar'] = egen.rowmean(df, ['var1', 'var2', 'var3'])` |
| `egen newvar = rowtotal(var1-var3)` | `df['newvar'] = egen.rowtotal(df, ['var1', 'var2', 'var3'])` |
| `egen newvar = tag(group)` | `df['newvar'] = egen.tag(df, ['group'])` |
| `egen newvar = count(var), by(group)` | `df['newvar'] = egen.count(df['var'], by=df['group'])` |
| `egen newvar = mean(var), by(group)` | `df['newvar'] = egen.mean(df['var'], by=df['group'])` |

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

### Development Setup
```bash
git clone https://github.com/brycewang-stanford/pyegen.git
cd pyegen
pip install -e ".[dev]"
python -m pytest tests/
```

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- Inspired by Stata's `egen` command
- Built on the excellent pandas library
- Thanks to the open-source community for feedback and contributions

## 📞 Support

- 🐛 **Bug Reports**: [GitHub Issues](https://github.com/brycewang-stanford/pyegen/issues)
- 💡 **Feature Requests**: [GitHub Discussions](https://github.com/brycewang-stanford/pyegen/discussions)
- 📧 **Email**: brycew6m@stanford.edu

---

⭐ **If this package helps your research, please consider starring the repository!**

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyegen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "stata, egen, pandas, data-analysis, econometrics, statistics",
    "author": null,
    "author_email": "Bryce Wang <brycew6m@stanford.edu>",
    "download_url": "https://files.pythonhosted.org/packages/e1/c0/f852725a5cbc4401d777859a7dac27250eb7022d77a7df92d99430cbae2d/pyegen-0.1.0.tar.gz",
    "platform": null,
    "description": "# PyEgen\n\n[![PyPI version](https://badge.fury.io/py/pyegen.svg)](https://badge.fury.io/py/pyegen)\n[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n\ud83d\udc0d **Python implementation of Stata's `egen` command**\n\nPyEgen brings the power and convenience of Stata's `egen` (extended generate) command to Python pandas DataFrames. If you're a researcher transitioning from Stata to Python, this package will make your data manipulation tasks much more familiar and efficient.\n\n##  Key Features\n\n- **Familiar Syntax**: Stata-like syntax for data manipulation\n- **Pandas Integration**: Seamless integration with pandas DataFrames  \n- **High Performance**: Optimized implementations using pandas operations\n- **Comprehensive**: Covers most commonly used `egen` functions\n- **Easy to Use**: Simple, intuitive API\n\n##  Installation\n\n```bash\npip install pyegen\n```\n\n##  Quick Start\n\n```python\nimport pandas as pd\nimport pyegen as egen\n\n# Create sample data\ndf = pd.DataFrame({\n    'group': ['A', 'A', 'B', 'B', 'C', 'C'],\n    'value1': [10, 20, 30, 40, 50, 60],\n    'value2': [1, 2, 3, 4, 5, 6],\n    'value3': [100, 200, 300, 400, 500, 600]\n})\n\n# Stata: egen newvar = rank(value1)\ndf['rank_val'] = egen.rank(df['value1'])\n\n# Stata: egen newvar = rowmean(value1 value2 value3)\ndf['row_mean'] = egen.rowmean(df, ['value1', 'value2', 'value3'])\n\n# Stata: egen newvar = rowtotal(value1 value2 value3)\ndf['row_total'] = egen.rowtotal(df, ['value1', 'value2', 'value3'])\n\n# Stata: egen newvar = tag(group)\ndf['tag'] = egen.tag(df, ['group'])\n\n# Stata: egen newvar = count(value1), by(group)\ndf['count_by_group'] = egen.count(df['value1'], by=df['group'])\n```\n\n##  Available Functions\n\n### Basic Functions\n- **`rank(series, method='average')`** - Rank values (like Stata's `egen rank`)\n- **`rowmean(df, columns)`** - Row-wise mean across specified columns\n- **`rowtotal(df, columns)`** - Row-wise sum across specified columns\n- **`rowmax(df, columns)`** - Row-wise maximum across specified columns\n- **`rowmin(df, columns)`** - Row-wise minimum across specified columns\n- **`rowcount(df, columns)`** - Count non-missing values across columns\n- **`rowsd(df, columns)`** - Row-wise standard deviation\n\n### Grouping Functions\n- **`tag(df, columns)`** - Tag first occurrence in each group\n- **`count(series, by=None)`** - Count observations (optionally by group)\n- **`mean(series, by=None)`** - Group means\n- **`sum(series, by=None)`** - Group sums\n- **`max(series, by=None)`** - Group maxima\n- **`min(series, by=None)`** - Group minima\n- **`sd(series, by=None)`** - Group standard deviations\n\n### Advanced Functions\n- **`seq()`** - Generate sequence numbers\n- **`group(df, columns)`** - Create group identifiers\n- **`pc(series, by=None)`** - Calculate percentiles\n- **`iqr(series, by=None)`** - Interquartile range\n\n## \ud83d\udca1 Detailed Examples\n\n### Working with Missing Values\n```python\nimport numpy as np\n\n# Data with missing values\ndf = pd.DataFrame({\n    'var1': [1, 2, np.nan, 4, 5],\n    'var2': [10, np.nan, 30, 40, 50],\n    'var3': [100, 200, 300, np.nan, 500]\n})\n\n# Row statistics excluding missing values\ndf['mean_nonmissing'] = egen.rowmean(df, ['var1', 'var2', 'var3'])\ndf['count_nonmissing'] = egen.rowcount(df, ['var1', 'var2', 'var3'])\n```\n\n### Group Operations\n```python\n# Sample data with groups\ndf = pd.DataFrame({\n    'country': ['USA', 'USA', 'CHN', 'CHN', 'DEU', 'DEU'],\n    'year': [2020, 2021, 2020, 2021, 2020, 2021],\n    'gdp': [21.43, 22.32, 14.72, 17.73, 3.84, 4.26],\n    'population': [331, 332, 1439, 1412, 83, 83]\n})\n\n# Group-wise operations\ndf['mean_gdp_by_country'] = egen.mean(df['gdp'], by=df['country'])\ndf['country_tag'] = egen.tag(df, ['country'])\ndf['obs_per_country'] = egen.count(df['gdp'], by=df['country'])\n```\n\n### Ranking and Percentiles\n```python\n# Ranking\ndf['gdp_rank'] = egen.rank(df['gdp'])  # Overall ranking\ndf['gdp_rank_by_year'] = egen.rank(df['gdp'], by=df['year'])  # Ranking within year\n\n# Percentiles\ndf['gdp_percentile'] = egen.pc(df['gdp'])\n```\n\n##  Stata to Python Translation Guide\n\n| Stata Command | PyEgen Equivalent |\n|---------------|-------------------|\n| `egen newvar = rank(var)` | `df['newvar'] = egen.rank(df['var'])` |\n| `egen newvar = rowmean(var1-var3)` | `df['newvar'] = egen.rowmean(df, ['var1', 'var2', 'var3'])` |\n| `egen newvar = rowtotal(var1-var3)` | `df['newvar'] = egen.rowtotal(df, ['var1', 'var2', 'var3'])` |\n| `egen newvar = tag(group)` | `df['newvar'] = egen.tag(df, ['group'])` |\n| `egen newvar = count(var), by(group)` | `df['newvar'] = egen.count(df['var'], by=df['group'])` |\n| `egen newvar = mean(var), by(group)` | `df['newvar'] = egen.mean(df['var'], by=df['group'])` |\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n### Development Setup\n```bash\ngit clone https://github.com/brycewang-stanford/pyegen.git\ncd pyegen\npip install -e \".[dev]\"\npython -m pytest tests/\n```\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- Inspired by Stata's `egen` command\n- Built on the excellent pandas library\n- Thanks to the open-source community for feedback and contributions\n\n## \ud83d\udcde Support\n\n- \ud83d\udc1b **Bug Reports**: [GitHub Issues](https://github.com/brycewang-stanford/pyegen/issues)\n- \ud83d\udca1 **Feature Requests**: [GitHub Discussions](https://github.com/brycewang-stanford/pyegen/discussions)\n- \ud83d\udce7 **Email**: brycew6m@stanford.edu\n\n---\n\n\u2b50 **If this package helps your research, please consider starring the repository!**\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python implementation of Stata's egen command for pandas DataFrames",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/brycewang-stanford/pyegen/issues",
        "Documentation": "https://github.com/brycewang-stanford/pyegen/blob/main/README.md",
        "Homepage": "https://github.com/brycewang-stanford/pyegen",
        "Repository": "https://github.com/brycewang-stanford/pyegen"
    },
    "split_keywords": [
        "stata",
        " egen",
        " pandas",
        " data-analysis",
        " econometrics",
        " statistics"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2a50a4a97eb95b79e6a15515be8795df532a5536a52627d4424e9fd8db457d80",
                "md5": "7f6995ff27494fa4dce43e6b75996ce2",
                "sha256": "15f2e38cc6f5a11a8a029a18317a83865a167a0bdf3e39854804535efdcacf03"
            },
            "downloads": -1,
            "filename": "pyegen-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7f6995ff27494fa4dce43e6b75996ce2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 7257,
            "upload_time": "2025-07-25T05:14:46",
            "upload_time_iso_8601": "2025-07-25T05:14:46.363720Z",
            "url": "https://files.pythonhosted.org/packages/2a/50/a4a97eb95b79e6a15515be8795df532a5536a52627d4424e9fd8db457d80/pyegen-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e1c0f852725a5cbc4401d777859a7dac27250eb7022d77a7df92d99430cbae2d",
                "md5": "f25da919f8bf4cdbf31ef2913a4e963d",
                "sha256": "128f44859b8fe365fd99ca93863e4f326bcf7f03b357c9be67c3b8213305b328"
            },
            "downloads": -1,
            "filename": "pyegen-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f25da919f8bf4cdbf31ef2913a4e963d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 8554,
            "upload_time": "2025-07-25T05:14:47",
            "upload_time_iso_8601": "2025-07-25T05:14:47.741654Z",
            "url": "https://files.pythonhosted.org/packages/e1/c0/f852725a5cbc4401d777859a7dac27250eb7022d77a7df92d99430cbae2d/pyegen-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-25 05:14:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "brycewang-stanford",
    "github_project": "pyegen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pyegen"
}
        
Elapsed time: 1.39128s