prepo


Nameprepo JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/erikhox/prepo
SummaryA Python package with automated data type detection, KNN imputation, outlier removal, and multiple scaling methods using type-safe enum architecture
upload_time2025-07-15 14:19:00
maintainerNone
docs_urlNone
authorErik Hoxhaj
requires_python>=3.9
licenseMIT
keywords pandas preprocessing data-science feature-engineering machine-learning automation type-detection knn-imputation scaling outlier-detection cli polars pyarrow
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Prepo

A Python package for preprocessing pandas DataFrames, with a focus on automatic data type detection, cleaning, and scaling.

## Installation

```bash
pip install prepo
```

## Usage

```python
import pandas as pd
from prepo import FeaturePreProcessor

# Create a processor instance
processor = FeaturePreProcessor()

# Load your data
df = pd.read_csv('data/raw/your_data.csv')

# Process the data
processed_df = processor.process(
    df, 
    drop_na=True,           # Drop rows with missing values
    scaler_type='standard', # Scale numeric features using standard scaling
    remove_outlier=True     # Remove outliers
)

# Save the processed data
processed_df.to_csv('data/processed/processed_data.csv', index=False)
```

## Data Type Detection

The package automatically detects the following data types:

- **temporal**: Date and time columns
- **binary**: Columns with only two unique values
- **percentage**: Columns with values between 0 and 1, or columns with names containing "perc", "rating", etc.
- **price**: Columns with names containing "price", "cost", "revenue", etc.
- **id**: Columns with names ending or starting with "id"
- **numeric**: General numeric columns
- **string**: Short text columns
- **text**: Long text columns

## Project Structure

```
prepo/
├── data/               # Data directory
│   ├── raw/            # Raw data files
│   ├── processed/      # Processed data files
│   └── test/           # Test data files
├── src/                # Source code
│   └── prepo/          # Main package
│       ├── __init__.py        # Package initialization
│       └── preprocessor.py    # Core preprocessing functionality
├── tests/              # Test directory
│   ├── __init__.py     # Test package initialization
│   └── test_preprocessor.py  # Tests for preprocessor
├── examples/           # Example scripts
│   └── basic_usage.py  # Basic usage example
├── README.md           # Project documentation
├── LICENSE             # License information
└── setup.py            # Package installation script
```

## Demo
[preposc.streamlit.app](https://preposc.streamlit.app/)

## License

This project is licensed under the MIT License - see the LICENSE file for details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/erikhox/prepo",
    "name": "prepo",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Erik Hoxhaj <erik.hoxhaj@outlook.com>",
    "keywords": "pandas, preprocessing, data-science, feature-engineering, machine-learning, automation, type-detection, knn-imputation, scaling, outlier-detection, cli, polars, pyarrow",
    "author": "Erik Hoxhaj",
    "author_email": "Erik Hoxhaj <erik.hoxhaj@outlook.com>",
    "download_url": "https://files.pythonhosted.org/packages/df/dd/5513eeb4fc457be103e7f62a60348f8a2dbce643c88f0c27512e706ec7e2/prepo-0.2.0.tar.gz",
    "platform": null,
    "description": "# Prepo\n\nA Python package for preprocessing pandas DataFrames, with a focus on automatic data type detection, cleaning, and scaling.\n\n## Installation\n\n```bash\npip install prepo\n```\n\n## Usage\n\n```python\nimport pandas as pd\nfrom prepo import FeaturePreProcessor\n\n# Create a processor instance\nprocessor = FeaturePreProcessor()\n\n# Load your data\ndf = pd.read_csv('data/raw/your_data.csv')\n\n# Process the data\nprocessed_df = processor.process(\n    df, \n    drop_na=True,           # Drop rows with missing values\n    scaler_type='standard', # Scale numeric features using standard scaling\n    remove_outlier=True     # Remove outliers\n)\n\n# Save the processed data\nprocessed_df.to_csv('data/processed/processed_data.csv', index=False)\n```\n\n## Data Type Detection\n\nThe package automatically detects the following data types:\n\n- **temporal**: Date and time columns\n- **binary**: Columns with only two unique values\n- **percentage**: Columns with values between 0 and 1, or columns with names containing \"perc\", \"rating\", etc.\n- **price**: Columns with names containing \"price\", \"cost\", \"revenue\", etc.\n- **id**: Columns with names ending or starting with \"id\"\n- **numeric**: General numeric columns\n- **string**: Short text columns\n- **text**: Long text columns\n\n## Project Structure\n\n```\nprepo/\n\u251c\u2500\u2500 data/               # Data directory\n\u2502   \u251c\u2500\u2500 raw/            # Raw data files\n\u2502   \u251c\u2500\u2500 processed/      # Processed data files\n\u2502   \u2514\u2500\u2500 test/           # Test data files\n\u251c\u2500\u2500 src/                # Source code\n\u2502   \u2514\u2500\u2500 prepo/          # Main package\n\u2502       \u251c\u2500\u2500 __init__.py        # Package initialization\n\u2502       \u2514\u2500\u2500 preprocessor.py    # Core preprocessing functionality\n\u251c\u2500\u2500 tests/              # Test directory\n\u2502   \u251c\u2500\u2500 __init__.py     # Test package initialization\n\u2502   \u2514\u2500\u2500 test_preprocessor.py  # Tests for preprocessor\n\u251c\u2500\u2500 examples/           # Example scripts\n\u2502   \u2514\u2500\u2500 basic_usage.py  # Basic usage example\n\u251c\u2500\u2500 README.md           # Project documentation\n\u251c\u2500\u2500 LICENSE             # License information\n\u2514\u2500\u2500 setup.py            # Package installation script\n```\n\n## Demo\n[preposc.streamlit.app](https://preposc.streamlit.app/)\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python package with automated data type detection, KNN imputation, outlier removal, and multiple scaling methods using type-safe enum architecture",
    "version": "0.2.0",
    "project_urls": {
        "Bug Reports": "https://github.com/erikhox/prepo/issues",
        "Changelog": "https://github.com/erikhox/prepo/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/erikhox/prepo#readme",
        "Homepage": "https://github.com/erikhox/prepo",
        "Source": "https://github.com/erikhox/prepo"
    },
    "split_keywords": [
        "pandas",
        " preprocessing",
        " data-science",
        " feature-engineering",
        " machine-learning",
        " automation",
        " type-detection",
        " knn-imputation",
        " scaling",
        " outlier-detection",
        " cli",
        " polars",
        " pyarrow"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d0f15755bae595308cc46120e819a9a108aab7e86da207c1dc64c79299043cce",
                "md5": "fa4cafc5ad7d89295283b6fbe36f5306",
                "sha256": "06e9d07455b4d98385e3d4e1bb63036c369743f3ed33c8fedce6e787592103d5"
            },
            "downloads": -1,
            "filename": "prepo-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fa4cafc5ad7d89295283b6fbe36f5306",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16015,
            "upload_time": "2025-07-15T14:18:58",
            "upload_time_iso_8601": "2025-07-15T14:18:58.832034Z",
            "url": "https://files.pythonhosted.org/packages/d0/f1/5755bae595308cc46120e819a9a108aab7e86da207c1dc64c79299043cce/prepo-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dfdd5513eeb4fc457be103e7f62a60348f8a2dbce643c88f0c27512e706ec7e2",
                "md5": "536812f370bb6c623447ec74b51071c1",
                "sha256": "f3eb1226512eec06b9b74d8bc9be741659eaa319b2b06007011bbf3f829a9d0e"
            },
            "downloads": -1,
            "filename": "prepo-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "536812f370bb6c623447ec74b51071c1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 23896,
            "upload_time": "2025-07-15T14:19:00",
            "upload_time_iso_8601": "2025-07-15T14:19:00.044491Z",
            "url": "https://files.pythonhosted.org/packages/df/dd/5513eeb4fc457be103e7f62a60348f8a2dbce643c88f0c27512e706ec7e2/prepo-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-15 14:19:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "erikhox",
    "github_project": "prepo",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "prepo"
}
        
Elapsed time: 1.07494s