# Prepo
A Python package for preprocessing pandas DataFrames, with a focus on automatic data type detection, cleaning, and scaling.
## Installation
```bash
pip install prepo
```
## Usage
```python
import pandas as pd
from prepo import FeaturePreProcessor
# Create a processor instance
processor = FeaturePreProcessor()
# Load your data
df = pd.read_csv('data/raw/your_data.csv')
# Process the data
processed_df = processor.process(
df,
drop_na=True, # Drop rows with missing values
scaler_type='standard', # Scale numeric features using standard scaling
remove_outlier=True # Remove outliers
)
# Save the processed data
processed_df.to_csv('data/processed/processed_data.csv', index=False)
```
## Data Type Detection
The package automatically detects the following data types:
- **temporal**: Date and time columns
- **binary**: Columns with only two unique values
- **percentage**: Columns with values between 0 and 1, or columns with names containing "perc", "rating", etc.
- **price**: Columns with names containing "price", "cost", "revenue", etc.
- **id**: Columns with names ending or starting with "id"
- **numeric**: General numeric columns
- **string**: Short text columns
- **text**: Long text columns
## Project Structure
```
prepo/
├── data/ # Data directory
│ ├── raw/ # Raw data files
│ ├── processed/ # Processed data files
│ └── test/ # Test data files
├── src/ # Source code
│ └── prepo/ # Main package
│ ├── __init__.py # Package initialization
│ └── preprocessor.py # Core preprocessing functionality
├── tests/ # Test directory
│ ├── __init__.py # Test package initialization
│ └── test_preprocessor.py # Tests for preprocessor
├── examples/ # Example scripts
│ └── basic_usage.py # Basic usage example
├── README.md # Project documentation
├── LICENSE # License information
└── setup.py # Package installation script
```
## Demo
[preposc.streamlit.app](https://preposc.streamlit.app/)
## License
This project is licensed under the MIT License - see the LICENSE file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/erikhox/prepo",
"name": "prepo",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Erik Hoxhaj <erik.hoxhaj@outlook.com>",
"keywords": "pandas, preprocessing, data-science, feature-engineering, machine-learning, automation, type-detection, knn-imputation, scaling, outlier-detection, cli, polars, pyarrow",
"author": "Erik Hoxhaj",
"author_email": "Erik Hoxhaj <erik.hoxhaj@outlook.com>",
"download_url": "https://files.pythonhosted.org/packages/df/dd/5513eeb4fc457be103e7f62a60348f8a2dbce643c88f0c27512e706ec7e2/prepo-0.2.0.tar.gz",
"platform": null,
"description": "# Prepo\n\nA Python package for preprocessing pandas DataFrames, with a focus on automatic data type detection, cleaning, and scaling.\n\n## Installation\n\n```bash\npip install prepo\n```\n\n## Usage\n\n```python\nimport pandas as pd\nfrom prepo import FeaturePreProcessor\n\n# Create a processor instance\nprocessor = FeaturePreProcessor()\n\n# Load your data\ndf = pd.read_csv('data/raw/your_data.csv')\n\n# Process the data\nprocessed_df = processor.process(\n df, \n drop_na=True, # Drop rows with missing values\n scaler_type='standard', # Scale numeric features using standard scaling\n remove_outlier=True # Remove outliers\n)\n\n# Save the processed data\nprocessed_df.to_csv('data/processed/processed_data.csv', index=False)\n```\n\n## Data Type Detection\n\nThe package automatically detects the following data types:\n\n- **temporal**: Date and time columns\n- **binary**: Columns with only two unique values\n- **percentage**: Columns with values between 0 and 1, or columns with names containing \"perc\", \"rating\", etc.\n- **price**: Columns with names containing \"price\", \"cost\", \"revenue\", etc.\n- **id**: Columns with names ending or starting with \"id\"\n- **numeric**: General numeric columns\n- **string**: Short text columns\n- **text**: Long text columns\n\n## Project Structure\n\n```\nprepo/\n\u251c\u2500\u2500 data/ # Data directory\n\u2502 \u251c\u2500\u2500 raw/ # Raw data files\n\u2502 \u251c\u2500\u2500 processed/ # Processed data files\n\u2502 \u2514\u2500\u2500 test/ # Test data files\n\u251c\u2500\u2500 src/ # Source code\n\u2502 \u2514\u2500\u2500 prepo/ # Main package\n\u2502 \u251c\u2500\u2500 __init__.py # Package initialization\n\u2502 \u2514\u2500\u2500 preprocessor.py # Core preprocessing functionality\n\u251c\u2500\u2500 tests/ # Test directory\n\u2502 \u251c\u2500\u2500 __init__.py # Test package initialization\n\u2502 \u2514\u2500\u2500 test_preprocessor.py # Tests for preprocessor\n\u251c\u2500\u2500 examples/ # Example scripts\n\u2502 \u2514\u2500\u2500 basic_usage.py # Basic usage example\n\u251c\u2500\u2500 README.md # Project documentation\n\u251c\u2500\u2500 LICENSE # License information\n\u2514\u2500\u2500 setup.py # Package installation script\n```\n\n## Demo\n[preposc.streamlit.app](https://preposc.streamlit.app/)\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python package with automated data type detection, KNN imputation, outlier removal, and multiple scaling methods using type-safe enum architecture",
"version": "0.2.0",
"project_urls": {
"Bug Reports": "https://github.com/erikhox/prepo/issues",
"Changelog": "https://github.com/erikhox/prepo/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/erikhox/prepo#readme",
"Homepage": "https://github.com/erikhox/prepo",
"Source": "https://github.com/erikhox/prepo"
},
"split_keywords": [
"pandas",
" preprocessing",
" data-science",
" feature-engineering",
" machine-learning",
" automation",
" type-detection",
" knn-imputation",
" scaling",
" outlier-detection",
" cli",
" polars",
" pyarrow"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d0f15755bae595308cc46120e819a9a108aab7e86da207c1dc64c79299043cce",
"md5": "fa4cafc5ad7d89295283b6fbe36f5306",
"sha256": "06e9d07455b4d98385e3d4e1bb63036c369743f3ed33c8fedce6e787592103d5"
},
"downloads": -1,
"filename": "prepo-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fa4cafc5ad7d89295283b6fbe36f5306",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 16015,
"upload_time": "2025-07-15T14:18:58",
"upload_time_iso_8601": "2025-07-15T14:18:58.832034Z",
"url": "https://files.pythonhosted.org/packages/d0/f1/5755bae595308cc46120e819a9a108aab7e86da207c1dc64c79299043cce/prepo-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "dfdd5513eeb4fc457be103e7f62a60348f8a2dbce643c88f0c27512e706ec7e2",
"md5": "536812f370bb6c623447ec74b51071c1",
"sha256": "f3eb1226512eec06b9b74d8bc9be741659eaa319b2b06007011bbf3f829a9d0e"
},
"downloads": -1,
"filename": "prepo-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "536812f370bb6c623447ec74b51071c1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 23896,
"upload_time": "2025-07-15T14:19:00",
"upload_time_iso_8601": "2025-07-15T14:19:00.044491Z",
"url": "https://files.pythonhosted.org/packages/df/dd/5513eeb4fc457be103e7f62a60348f8a2dbce643c88f0c27512e706ec7e2/prepo-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-15 14:19:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "erikhox",
"github_project": "prepo",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "prepo"
}