# π§Ή ScrubPy β AI-Powered Data Cleaning Made Simple
[](https://badge.fury.io/py/scrubpy)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
> π **Transform messy data into clean, analysis-ready datasets with AI assistance, interactive web interface, and powerful CLI tools.**
---
## β¨ What is ScrubPy?
ScrubPy is an advanced data cleaning toolkit that combines AI-powered intelligence with user-friendly interfaces. Whether you're a data scientist, analyst, or researcher, ScrubPy helps you clean and understand your datasets faster than ever.
### π― Key Highlights
- **π€ AI-Powered**: LLM integration for intelligent data cleaning suggestions
- **π Web Interface**: Modern Streamlit-based GUI with drag-and-drop
- **π¬ Chat Assistant**: Interactive AI guide for data cleaning workflows
- **β‘ CLI Tools**: Rich terminal interface with progress indicators
- **π Smart Analysis**: Automated EDA with quality scoring
- **π Professional Reports**: Generate PDF reports and insights
---
## οΏ½ Quick Start
### Installation
Install ScrubPy with a single command:
```bash
pip install scrubpy
```
### Usage Options
**π Web Interface** (Recommended for beginners):
```bash
scrubpy-web
```
**π¬ AI Chat Assistant**:
```bash
scrubpy-chat your_data.csv
```
**β‘ CLI Interface**:
```bash
scrubpy
```
---
## π§ Features
### π€ AI-Powered Intelligence
- **Smart Suggestions**: AI analyzes your data and recommends cleaning steps
- **Natural Language Processing**: Advanced text cleaning and normalization
- **Quality Scoring**: Automatic data quality assessment and insights
- **Pattern Recognition**: Detect and fix common data issues automatically
### π Modern Interfaces
- **Web App**: Drag-and-drop file upload with real-time previews
- **Chat Assistant**: Conversational AI guide for cleaning workflows
- **Rich CLI**: Beautiful terminal interface with progress bars and colors
### π Advanced Analytics
- **Smart EDA**: Automated exploratory data analysis
- **Quality Reports**: Comprehensive data quality assessments
- **Visual Insights**: Generate correlation heatmaps and distributions
- **PDF Exports**: Professional reports for stakeholders
### π§ Powerful Cleaning Tools
- **Missing Values**: Smart imputation strategies
- **Duplicates**: Advanced duplicate detection and removal
- **Outliers**: Statistical outlier detection and handling
- **Data Types**: Intelligent type inference and conversion
- **Text Processing**: Advanced text standardization and cleaning
- **Validation**: Email, phone number, and custom validation rules
---
## οΏ½ Usage Examples
### Command Line Interface
```bash
# Launch interactive CLI
scrubpy
# Quick clean with default settings
scrubpy clean data.csv --output cleaned_data.csv
# Generate quality report
scrubpy analyze data.csv --report
```
### Python API
```python
import scrubpy as sp
# Load and analyze data
df = sp.load_data('messy_data.csv')
quality_score = sp.analyze_quality(df)
# AI-powered cleaning
cleaned_df = sp.smart_clean(df, ai_suggestions=True)
# Export results
sp.export_data(cleaned_df, 'cleaned_data.csv')
sp.generate_report(df, cleaned_df, 'quality_report.pdf')
```
---
## π οΈ Installation & Setup
### System Requirements
- **Python**: 3.8 or higher
- **OS**: Windows, macOS, Linux
- **RAM**: 2GB minimum (4GB recommended for large datasets)
### Installing from PyPI
```bash
# Install latest stable version
pip install scrubpy
# Install with all AI features
pip install scrubpy[ai]
# Install development version
pip install scrubpy[dev]
```
### Verify Installation
```bash
scrubpy --version
scrubpy-web --help
scrubpy-chat --help
```
---
## ποΈ Architecture
ScrubPy is built with modern Python practices and modular design:
```
scrubpy/
βββ core.py # Core data processing engine
βββ cli.py # Rich CLI interface
βββ chat_assistant.py # AI chat interface
βββ quality_analyzer.py # Data quality assessment
βββ llm_utils.py # AI/LLM integration
βββ eda_analysis.py # Exploratory data analysis
βββ validation.py # Data validation rules
βββ web/ # Streamlit web interface
βββ config/ # Configuration templates
βββ utils/ # Utility functions
```
---
## π€ Contributing
We welcome contributions! See our [Contributing Guide](docs/CONTRIBUTING.md) for details.
### Development Setup
```bash
git clone https://github.com/dhanushranga1/scrubpy.git
cd scrubpy
pip install -e .[dev]
```
### Running Tests
```bash
pytest tests/
```
---
## π Performance
ScrubPy is optimized for performance:
- **Memory Efficient**: Processes large datasets with minimal RAM usage
- **Fast Processing**: Vectorized operations with pandas and numpy
- **Streaming Support**: Handle datasets larger than memory
- **Parallel Processing**: Multi-core support for intensive operations
---
## π― Use Cases
### Data Science Workflows
- **EDA**: Quick exploratory data analysis
- **Preprocessing**: Clean data before ML pipelines
- **Quality Assessment**: Validate data quality metrics
### Business Analytics
- **CRM Data**: Clean customer databases
- **Sales Data**: Process transaction records
- **Survey Data**: Clean and standardize responses
### Research & Academia
- **Dataset Preparation**: Clean research datasets
- **Statistical Analysis**: Prepare data for statistical tests
- **Report Generation**: Create professional data quality reports
---
## π Links
- **Documentation**: [Full documentation and API reference](docs/)
- **GitHub**: [Source code and issues](https://github.com/dhanushranga1/scrubpy)
- **PyPI**: [Package on Python Package Index](https://pypi.org/project/scrubpy/)
- **Changelog**: [Release history and updates](CHANGELOG.md)
---
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## π Acknowledgments
ScrubPy is built on the shoulders of giants:
- **pandas** & **numpy** for data processing
- **Streamlit** for the beautiful web interface
- **Typer** & **Rich** for the modern CLI experience
- **scikit-learn** for machine learning utilities
---
**Made with β€οΈ for the data community**
---
## π Folder Structure
```
scrubpy/
βββ cli.py # Main CLI interface
βββ core.py # Core cleaning logic
βββ preview.py # Preview operations before applying
βββ profiling.py # Dataset profiling & suggestions
βββ export_profiling_report.py# Export detailed profiling reports
```
---
## π οΈ Requirements
- Python 3.8+
- pandas
- numpy
- typer
- rich
- InquirerPy
- scipy
---
## β¨ Whatβs Next?
We plan to add smart visual exports, column intelligence, and eventually ML-powered cleaning.
---
## π Why This Exists
Sometimes you just need a quick tool to clean and inspect your data without writing boilerplate pandas code. ScrubPy helps you do that, even if you're not a data wizard.
---
## π License
MIT
---
Made with β€οΈ by a student learning to make tools that help others.
Raw data
{
"_id": null,
"home_page": "https://github.com/Dhanushranga1/scrubpy",
"name": "scrubpy",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "ScrubPy Team <support@scrubpy.dev>",
"keywords": "data cleaning, data preprocessing, pandas, AI, machine learning, data science",
"author": "Dhanush Ranga",
"author_email": "ScrubPy Team <support@scrubpy.dev>",
"download_url": "https://files.pythonhosted.org/packages/c2/48/5841f7e7a4dda110ae98233a8c26b35626273f0b5d0c3c28e01b77ff2a83/scrubpy-2.0.0.tar.gz",
"platform": "any",
"description": "# \ud83e\uddf9 ScrubPy \u2013 AI-Powered Data Cleaning Made Simple\n\n[](https://badge.fury.io/py/scrubpy)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n\n> \ud83d\ude80 **Transform messy data into clean, analysis-ready datasets with AI assistance, interactive web interface, and powerful CLI tools.**\n\n---\n\n## \u2728 What is ScrubPy?\n\nScrubPy is an advanced data cleaning toolkit that combines AI-powered intelligence with user-friendly interfaces. Whether you're a data scientist, analyst, or researcher, ScrubPy helps you clean and understand your datasets faster than ever.\n\n### \ud83c\udfaf Key Highlights\n- **\ud83e\udd16 AI-Powered**: LLM integration for intelligent data cleaning suggestions\n- **\ud83c\udf10 Web Interface**: Modern Streamlit-based GUI with drag-and-drop\n- **\ud83d\udcac Chat Assistant**: Interactive AI guide for data cleaning workflows \n- **\u26a1 CLI Tools**: Rich terminal interface with progress indicators\n- **\ud83d\udcca Smart Analysis**: Automated EDA with quality scoring\n- **\ud83d\udccb Professional Reports**: Generate PDF reports and insights\n\n---\n\n## \ufffd Quick Start\n\n### Installation\n\nInstall ScrubPy with a single command:\n\n```bash\npip install scrubpy\n```\n\n### Usage Options\n\n**\ud83c\udf10 Web Interface** (Recommended for beginners):\n```bash\nscrubpy-web\n```\n\n**\ud83d\udcac AI Chat Assistant**:\n```bash\nscrubpy-chat your_data.csv\n```\n\n**\u26a1 CLI Interface**:\n```bash\nscrubpy\n```\n\n---\n\n## \ud83d\udd27 Features\n\n### \ud83e\udd16 AI-Powered Intelligence\n- **Smart Suggestions**: AI analyzes your data and recommends cleaning steps\n- **Natural Language Processing**: Advanced text cleaning and normalization\n- **Quality Scoring**: Automatic data quality assessment and insights\n- **Pattern Recognition**: Detect and fix common data issues automatically\n\n### \ud83c\udf10 Modern Interfaces\n- **Web App**: Drag-and-drop file upload with real-time previews\n- **Chat Assistant**: Conversational AI guide for cleaning workflows\n- **Rich CLI**: Beautiful terminal interface with progress bars and colors\n\n### \ud83d\udcca Advanced Analytics\n- **Smart EDA**: Automated exploratory data analysis\n- **Quality Reports**: Comprehensive data quality assessments\n- **Visual Insights**: Generate correlation heatmaps and distributions\n- **PDF Exports**: Professional reports for stakeholders\n\n### \ud83d\udd27 Powerful Cleaning Tools\n- **Missing Values**: Smart imputation strategies\n- **Duplicates**: Advanced duplicate detection and removal\n- **Outliers**: Statistical outlier detection and handling\n- **Data Types**: Intelligent type inference and conversion\n- **Text Processing**: Advanced text standardization and cleaning\n- **Validation**: Email, phone number, and custom validation rules\n\n---\n\n## \ufffd Usage Examples\n\n### Command Line Interface\n```bash\n# Launch interactive CLI\nscrubpy\n\n# Quick clean with default settings\nscrubpy clean data.csv --output cleaned_data.csv\n\n# Generate quality report\nscrubpy analyze data.csv --report\n```\n\n### Python API\n```python\nimport scrubpy as sp\n\n# Load and analyze data\ndf = sp.load_data('messy_data.csv')\nquality_score = sp.analyze_quality(df)\n\n# AI-powered cleaning\ncleaned_df = sp.smart_clean(df, ai_suggestions=True)\n\n# Export results\nsp.export_data(cleaned_df, 'cleaned_data.csv')\nsp.generate_report(df, cleaned_df, 'quality_report.pdf')\n```\n\n---\n\n## \ud83d\udee0\ufe0f Installation & Setup\n\n### System Requirements\n- **Python**: 3.8 or higher\n- **OS**: Windows, macOS, Linux\n- **RAM**: 2GB minimum (4GB recommended for large datasets)\n\n### Installing from PyPI\n```bash\n# Install latest stable version\npip install scrubpy\n\n# Install with all AI features\npip install scrubpy[ai]\n\n# Install development version\npip install scrubpy[dev]\n```\n\n### Verify Installation\n```bash\nscrubpy --version\nscrubpy-web --help\nscrubpy-chat --help\n```\n\n---\n\n## \ud83c\udfd7\ufe0f Architecture\n\nScrubPy is built with modern Python practices and modular design:\n\n```\nscrubpy/\n\u251c\u2500\u2500 core.py # Core data processing engine \n\u251c\u2500\u2500 cli.py # Rich CLI interface\n\u251c\u2500\u2500 chat_assistant.py # AI chat interface\n\u251c\u2500\u2500 quality_analyzer.py # Data quality assessment\n\u251c\u2500\u2500 llm_utils.py # AI/LLM integration\n\u251c\u2500\u2500 eda_analysis.py # Exploratory data analysis\n\u251c\u2500\u2500 validation.py # Data validation rules\n\u251c\u2500\u2500 web/ # Streamlit web interface\n\u251c\u2500\u2500 config/ # Configuration templates\n\u2514\u2500\u2500 utils/ # Utility functions\n```\n\n---\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! See our [Contributing Guide](docs/CONTRIBUTING.md) for details.\n\n### Development Setup\n```bash\ngit clone https://github.com/dhanushranga1/scrubpy.git\ncd scrubpy\npip install -e .[dev]\n```\n\n### Running Tests\n```bash\npytest tests/\n```\n\n---\n\n## \ud83d\udcc8 Performance\n\nScrubPy is optimized for performance:\n- **Memory Efficient**: Processes large datasets with minimal RAM usage\n- **Fast Processing**: Vectorized operations with pandas and numpy\n- **Streaming Support**: Handle datasets larger than memory\n- **Parallel Processing**: Multi-core support for intensive operations\n\n---\n\n## \ud83c\udfaf Use Cases\n\n### Data Science Workflows\n- **EDA**: Quick exploratory data analysis\n- **Preprocessing**: Clean data before ML pipelines \n- **Quality Assessment**: Validate data quality metrics\n\n### Business Analytics\n- **CRM Data**: Clean customer databases\n- **Sales Data**: Process transaction records\n- **Survey Data**: Clean and standardize responses\n\n### Research & Academia\n- **Dataset Preparation**: Clean research datasets\n- **Statistical Analysis**: Prepare data for statistical tests\n- **Report Generation**: Create professional data quality reports\n\n---\n\n## \ud83d\udd17 Links\n\n- **Documentation**: [Full documentation and API reference](docs/)\n- **GitHub**: [Source code and issues](https://github.com/dhanushranga1/scrubpy)\n- **PyPI**: [Package on Python Package Index](https://pypi.org/project/scrubpy/)\n- **Changelog**: [Release history and updates](CHANGELOG.md)\n\n---\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n## \ud83d\ude4f Acknowledgments\n\nScrubPy is built on the shoulders of giants:\n- **pandas** & **numpy** for data processing\n- **Streamlit** for the beautiful web interface\n- **Typer** & **Rich** for the modern CLI experience\n- **scikit-learn** for machine learning utilities\n\n---\n\n**Made with \u2764\ufe0f for the data community**\n\n---\n\n## \ud83c\udf10 Folder Structure\n```\nscrubpy/\n\u251c\u2500\u2500 cli.py # Main CLI interface\n\u251c\u2500\u2500 core.py # Core cleaning logic\n\u251c\u2500\u2500 preview.py # Preview operations before applying\n\u251c\u2500\u2500 profiling.py # Dataset profiling & suggestions\n\u251c\u2500\u2500 export_profiling_report.py# Export detailed profiling reports\n```\n\n---\n\n## \ud83d\udee0\ufe0f Requirements\n- Python 3.8+\n- pandas\n- numpy\n- typer\n- rich\n- InquirerPy\n- scipy\n\n---\n\n## \u2728 What\u2019s Next?\nWe plan to add smart visual exports, column intelligence, and eventually ML-powered cleaning.\n\n---\n\n## \ud83c\udf89 Why This Exists\nSometimes you just need a quick tool to clean and inspect your data without writing boilerplate pandas code. ScrubPy helps you do that, even if you're not a data wizard.\n\n---\n\n## \ud83d\udcda License\nMIT\n\n---\n\nMade with \u2764\ufe0f by a student learning to make tools that help others.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "AI-powered data cleaning assistant with multiple interfaces",
"version": "2.0.0",
"project_urls": {
"Bug Reports": "https://github.com/Dhanushranga1/scrubpy/issues",
"Changelog": "https://github.com/Dhanushranga1/scrubpy/blob/main/CHANGELOG.md",
"Documentation": "https://scrubpy.readthedocs.io",
"Homepage": "https://github.com/Dhanushranga1/scrubpy",
"Repository": "https://github.com/Dhanushranga1/scrubpy.git"
},
"split_keywords": [
"data cleaning",
" data preprocessing",
" pandas",
" ai",
" machine learning",
" data science"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5df5096c6bb3d9538ad5fea312c245c44cea8689470ba97fdd2357f1a162059c",
"md5": "cecb335bd3088496495d69ec0527a5f3",
"sha256": "311afb8511d141d44d762a35d5ae7d962f94816ad32229b1c8c150e24962bd93"
},
"downloads": -1,
"filename": "scrubpy-2.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cecb335bd3088496495d69ec0527a5f3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 484439,
"upload_time": "2025-10-11T10:28:46",
"upload_time_iso_8601": "2025-10-11T10:28:46.658105Z",
"url": "https://files.pythonhosted.org/packages/5d/f5/096c6bb3d9538ad5fea312c245c44cea8689470ba97fdd2357f1a162059c/scrubpy-2.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "c2485841f7e7a4dda110ae98233a8c26b35626273f0b5d0c3c28e01b77ff2a83",
"md5": "3c546a04383f519f652e029f2a37ee08",
"sha256": "9bbab02090717b2a6a7e10305fe15412b4f89dbf8bbcc76f44a6456c97d1217d"
},
"downloads": -1,
"filename": "scrubpy-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "3c546a04383f519f652e029f2a37ee08",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 616622,
"upload_time": "2025-10-11T10:28:52",
"upload_time_iso_8601": "2025-10-11T10:28:52.620778Z",
"url": "https://files.pythonhosted.org/packages/c2/48/5841f7e7a4dda110ae98233a8c26b35626273f0b5d0c3c28e01b77ff2a83/scrubpy-2.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-11 10:28:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Dhanushranga1",
"github_project": "scrubpy",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "scrubpy"
}