scrubpy


Namescrubpy JSON
Version 2.0.0 PyPI version JSON
download
home_pagehttps://github.com/Dhanushranga1/scrubpy
SummaryAI-powered data cleaning assistant with multiple interfaces
upload_time2025-10-11 10:28:52
maintainerNone
docs_urlNone
authorDhanush Ranga
requires_python>=3.8
licenseMIT
keywords data cleaning data preprocessing pandas ai machine learning data science
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🧹 ScrubPy – AI-Powered Data Cleaning Made Simple

[![PyPI version](https://badge.fury.io/py/scrubpy.svg)](https://badge.fury.io/py/scrubpy)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

> πŸš€ **Transform messy data into clean, analysis-ready datasets with AI assistance, interactive web interface, and powerful CLI tools.**

---

## ✨ What is ScrubPy?

ScrubPy is an advanced data cleaning toolkit that combines AI-powered intelligence with user-friendly interfaces. Whether you're a data scientist, analyst, or researcher, ScrubPy helps you clean and understand your datasets faster than ever.

### 🎯 Key Highlights
- **πŸ€– AI-Powered**: LLM integration for intelligent data cleaning suggestions
- **🌐 Web Interface**: Modern Streamlit-based GUI with drag-and-drop
- **πŸ’¬ Chat Assistant**: Interactive AI guide for data cleaning workflows  
- **⚑ CLI Tools**: Rich terminal interface with progress indicators
- **πŸ“Š Smart Analysis**: Automated EDA with quality scoring
- **πŸ“‹ Professional Reports**: Generate PDF reports and insights

---

## οΏ½ Quick Start

### Installation

Install ScrubPy with a single command:

```bash
pip install scrubpy
```

### Usage Options

**🌐 Web Interface** (Recommended for beginners):
```bash
scrubpy-web
```

**πŸ’¬ AI Chat Assistant**:
```bash
scrubpy-chat your_data.csv
```

**⚑ CLI Interface**:
```bash
scrubpy
```

---

## πŸ”§ Features

### πŸ€– AI-Powered Intelligence
- **Smart Suggestions**: AI analyzes your data and recommends cleaning steps
- **Natural Language Processing**: Advanced text cleaning and normalization
- **Quality Scoring**: Automatic data quality assessment and insights
- **Pattern Recognition**: Detect and fix common data issues automatically

### 🌐 Modern Interfaces
- **Web App**: Drag-and-drop file upload with real-time previews
- **Chat Assistant**: Conversational AI guide for cleaning workflows
- **Rich CLI**: Beautiful terminal interface with progress bars and colors

### πŸ“Š Advanced Analytics
- **Smart EDA**: Automated exploratory data analysis
- **Quality Reports**: Comprehensive data quality assessments
- **Visual Insights**: Generate correlation heatmaps and distributions
- **PDF Exports**: Professional reports for stakeholders

### πŸ”§ Powerful Cleaning Tools
- **Missing Values**: Smart imputation strategies
- **Duplicates**: Advanced duplicate detection and removal
- **Outliers**: Statistical outlier detection and handling
- **Data Types**: Intelligent type inference and conversion
- **Text Processing**: Advanced text standardization and cleaning
- **Validation**: Email, phone number, and custom validation rules

---

## οΏ½ Usage Examples

### Command Line Interface
```bash
# Launch interactive CLI
scrubpy

# Quick clean with default settings
scrubpy clean data.csv --output cleaned_data.csv

# Generate quality report
scrubpy analyze data.csv --report
```

### Python API
```python
import scrubpy as sp

# Load and analyze data
df = sp.load_data('messy_data.csv')
quality_score = sp.analyze_quality(df)

# AI-powered cleaning
cleaned_df = sp.smart_clean(df, ai_suggestions=True)

# Export results
sp.export_data(cleaned_df, 'cleaned_data.csv')
sp.generate_report(df, cleaned_df, 'quality_report.pdf')
```

---

## πŸ› οΈ Installation & Setup

### System Requirements
- **Python**: 3.8 or higher
- **OS**: Windows, macOS, Linux
- **RAM**: 2GB minimum (4GB recommended for large datasets)

### Installing from PyPI
```bash
# Install latest stable version
pip install scrubpy

# Install with all AI features
pip install scrubpy[ai]

# Install development version
pip install scrubpy[dev]
```

### Verify Installation
```bash
scrubpy --version
scrubpy-web --help
scrubpy-chat --help
```

---

## πŸ—οΈ Architecture

ScrubPy is built with modern Python practices and modular design:

```
scrubpy/
β”œβ”€β”€ core.py              # Core data processing engine  
β”œβ”€β”€ cli.py               # Rich CLI interface
β”œβ”€β”€ chat_assistant.py    # AI chat interface
β”œβ”€β”€ quality_analyzer.py  # Data quality assessment
β”œβ”€β”€ llm_utils.py         # AI/LLM integration
β”œβ”€β”€ eda_analysis.py      # Exploratory data analysis
β”œβ”€β”€ validation.py        # Data validation rules
β”œβ”€β”€ web/                 # Streamlit web interface
β”œβ”€β”€ config/              # Configuration templates
└── utils/               # Utility functions
```

---

## 🀝 Contributing

We welcome contributions! See our [Contributing Guide](docs/CONTRIBUTING.md) for details.

### Development Setup
```bash
git clone https://github.com/dhanushranga1/scrubpy.git
cd scrubpy
pip install -e .[dev]
```

### Running Tests
```bash
pytest tests/
```

---

## πŸ“ˆ Performance

ScrubPy is optimized for performance:
- **Memory Efficient**: Processes large datasets with minimal RAM usage
- **Fast Processing**: Vectorized operations with pandas and numpy
- **Streaming Support**: Handle datasets larger than memory
- **Parallel Processing**: Multi-core support for intensive operations

---

## 🎯 Use Cases

### Data Science Workflows
- **EDA**: Quick exploratory data analysis
- **Preprocessing**: Clean data before ML pipelines  
- **Quality Assessment**: Validate data quality metrics

### Business Analytics
- **CRM Data**: Clean customer databases
- **Sales Data**: Process transaction records
- **Survey Data**: Clean and standardize responses

### Research & Academia
- **Dataset Preparation**: Clean research datasets
- **Statistical Analysis**: Prepare data for statistical tests
- **Report Generation**: Create professional data quality reports

---

## πŸ”— Links

- **Documentation**: [Full documentation and API reference](docs/)
- **GitHub**: [Source code and issues](https://github.com/dhanushranga1/scrubpy)
- **PyPI**: [Package on Python Package Index](https://pypi.org/project/scrubpy/)
- **Changelog**: [Release history and updates](CHANGELOG.md)

---

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## πŸ™ Acknowledgments

ScrubPy is built on the shoulders of giants:
- **pandas** & **numpy** for data processing
- **Streamlit** for the beautiful web interface
- **Typer** & **Rich** for the modern CLI experience
- **scikit-learn** for machine learning utilities

---

**Made with ❀️ for the data community**

---

## 🌐 Folder Structure
```
scrubpy/
β”œβ”€β”€ cli.py                    # Main CLI interface
β”œβ”€β”€ core.py                   # Core cleaning logic
β”œβ”€β”€ preview.py                # Preview operations before applying
β”œβ”€β”€ profiling.py              # Dataset profiling & suggestions
β”œβ”€β”€ export_profiling_report.py# Export detailed profiling reports
```

---

## πŸ› οΈ Requirements
- Python 3.8+
- pandas
- numpy
- typer
- rich
- InquirerPy
- scipy

---

## ✨ What’s Next?
We plan to add smart visual exports, column intelligence, and eventually ML-powered cleaning.

---

## πŸŽ‰ Why This Exists
Sometimes you just need a quick tool to clean and inspect your data without writing boilerplate pandas code. ScrubPy helps you do that, even if you're not a data wizard.

---

## πŸ“š License
MIT

---

Made with ❀️ by a student learning to make tools that help others.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Dhanushranga1/scrubpy",
    "name": "scrubpy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "ScrubPy Team <support@scrubpy.dev>",
    "keywords": "data cleaning, data preprocessing, pandas, AI, machine learning, data science",
    "author": "Dhanush Ranga",
    "author_email": "ScrubPy Team <support@scrubpy.dev>",
    "download_url": "https://files.pythonhosted.org/packages/c2/48/5841f7e7a4dda110ae98233a8c26b35626273f0b5d0c3c28e01b77ff2a83/scrubpy-2.0.0.tar.gz",
    "platform": "any",
    "description": "# \ud83e\uddf9 ScrubPy \u2013 AI-Powered Data Cleaning Made Simple\n\n[![PyPI version](https://badge.fury.io/py/scrubpy.svg)](https://badge.fury.io/py/scrubpy)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n> \ud83d\ude80 **Transform messy data into clean, analysis-ready datasets with AI assistance, interactive web interface, and powerful CLI tools.**\n\n---\n\n## \u2728 What is ScrubPy?\n\nScrubPy is an advanced data cleaning toolkit that combines AI-powered intelligence with user-friendly interfaces. Whether you're a data scientist, analyst, or researcher, ScrubPy helps you clean and understand your datasets faster than ever.\n\n### \ud83c\udfaf Key Highlights\n- **\ud83e\udd16 AI-Powered**: LLM integration for intelligent data cleaning suggestions\n- **\ud83c\udf10 Web Interface**: Modern Streamlit-based GUI with drag-and-drop\n- **\ud83d\udcac Chat Assistant**: Interactive AI guide for data cleaning workflows  \n- **\u26a1 CLI Tools**: Rich terminal interface with progress indicators\n- **\ud83d\udcca Smart Analysis**: Automated EDA with quality scoring\n- **\ud83d\udccb Professional Reports**: Generate PDF reports and insights\n\n---\n\n## \ufffd Quick Start\n\n### Installation\n\nInstall ScrubPy with a single command:\n\n```bash\npip install scrubpy\n```\n\n### Usage Options\n\n**\ud83c\udf10 Web Interface** (Recommended for beginners):\n```bash\nscrubpy-web\n```\n\n**\ud83d\udcac AI Chat Assistant**:\n```bash\nscrubpy-chat your_data.csv\n```\n\n**\u26a1 CLI Interface**:\n```bash\nscrubpy\n```\n\n---\n\n## \ud83d\udd27 Features\n\n### \ud83e\udd16 AI-Powered Intelligence\n- **Smart Suggestions**: AI analyzes your data and recommends cleaning steps\n- **Natural Language Processing**: Advanced text cleaning and normalization\n- **Quality Scoring**: Automatic data quality assessment and insights\n- **Pattern Recognition**: Detect and fix common data issues automatically\n\n### \ud83c\udf10 Modern Interfaces\n- **Web App**: Drag-and-drop file upload with real-time previews\n- **Chat Assistant**: Conversational AI guide for cleaning workflows\n- **Rich CLI**: Beautiful terminal interface with progress bars and colors\n\n### \ud83d\udcca Advanced Analytics\n- **Smart EDA**: Automated exploratory data analysis\n- **Quality Reports**: Comprehensive data quality assessments\n- **Visual Insights**: Generate correlation heatmaps and distributions\n- **PDF Exports**: Professional reports for stakeholders\n\n### \ud83d\udd27 Powerful Cleaning Tools\n- **Missing Values**: Smart imputation strategies\n- **Duplicates**: Advanced duplicate detection and removal\n- **Outliers**: Statistical outlier detection and handling\n- **Data Types**: Intelligent type inference and conversion\n- **Text Processing**: Advanced text standardization and cleaning\n- **Validation**: Email, phone number, and custom validation rules\n\n---\n\n## \ufffd Usage Examples\n\n### Command Line Interface\n```bash\n# Launch interactive CLI\nscrubpy\n\n# Quick clean with default settings\nscrubpy clean data.csv --output cleaned_data.csv\n\n# Generate quality report\nscrubpy analyze data.csv --report\n```\n\n### Python API\n```python\nimport scrubpy as sp\n\n# Load and analyze data\ndf = sp.load_data('messy_data.csv')\nquality_score = sp.analyze_quality(df)\n\n# AI-powered cleaning\ncleaned_df = sp.smart_clean(df, ai_suggestions=True)\n\n# Export results\nsp.export_data(cleaned_df, 'cleaned_data.csv')\nsp.generate_report(df, cleaned_df, 'quality_report.pdf')\n```\n\n---\n\n## \ud83d\udee0\ufe0f Installation & Setup\n\n### System Requirements\n- **Python**: 3.8 or higher\n- **OS**: Windows, macOS, Linux\n- **RAM**: 2GB minimum (4GB recommended for large datasets)\n\n### Installing from PyPI\n```bash\n# Install latest stable version\npip install scrubpy\n\n# Install with all AI features\npip install scrubpy[ai]\n\n# Install development version\npip install scrubpy[dev]\n```\n\n### Verify Installation\n```bash\nscrubpy --version\nscrubpy-web --help\nscrubpy-chat --help\n```\n\n---\n\n## \ud83c\udfd7\ufe0f Architecture\n\nScrubPy is built with modern Python practices and modular design:\n\n```\nscrubpy/\n\u251c\u2500\u2500 core.py              # Core data processing engine  \n\u251c\u2500\u2500 cli.py               # Rich CLI interface\n\u251c\u2500\u2500 chat_assistant.py    # AI chat interface\n\u251c\u2500\u2500 quality_analyzer.py  # Data quality assessment\n\u251c\u2500\u2500 llm_utils.py         # AI/LLM integration\n\u251c\u2500\u2500 eda_analysis.py      # Exploratory data analysis\n\u251c\u2500\u2500 validation.py        # Data validation rules\n\u251c\u2500\u2500 web/                 # Streamlit web interface\n\u251c\u2500\u2500 config/              # Configuration templates\n\u2514\u2500\u2500 utils/               # Utility functions\n```\n\n---\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! See our [Contributing Guide](docs/CONTRIBUTING.md) for details.\n\n### Development Setup\n```bash\ngit clone https://github.com/dhanushranga1/scrubpy.git\ncd scrubpy\npip install -e .[dev]\n```\n\n### Running Tests\n```bash\npytest tests/\n```\n\n---\n\n## \ud83d\udcc8 Performance\n\nScrubPy is optimized for performance:\n- **Memory Efficient**: Processes large datasets with minimal RAM usage\n- **Fast Processing**: Vectorized operations with pandas and numpy\n- **Streaming Support**: Handle datasets larger than memory\n- **Parallel Processing**: Multi-core support for intensive operations\n\n---\n\n## \ud83c\udfaf Use Cases\n\n### Data Science Workflows\n- **EDA**: Quick exploratory data analysis\n- **Preprocessing**: Clean data before ML pipelines  \n- **Quality Assessment**: Validate data quality metrics\n\n### Business Analytics\n- **CRM Data**: Clean customer databases\n- **Sales Data**: Process transaction records\n- **Survey Data**: Clean and standardize responses\n\n### Research & Academia\n- **Dataset Preparation**: Clean research datasets\n- **Statistical Analysis**: Prepare data for statistical tests\n- **Report Generation**: Create professional data quality reports\n\n---\n\n## \ud83d\udd17 Links\n\n- **Documentation**: [Full documentation and API reference](docs/)\n- **GitHub**: [Source code and issues](https://github.com/dhanushranga1/scrubpy)\n- **PyPI**: [Package on Python Package Index](https://pypi.org/project/scrubpy/)\n- **Changelog**: [Release history and updates](CHANGELOG.md)\n\n---\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n## \ud83d\ude4f Acknowledgments\n\nScrubPy is built on the shoulders of giants:\n- **pandas** & **numpy** for data processing\n- **Streamlit** for the beautiful web interface\n- **Typer** & **Rich** for the modern CLI experience\n- **scikit-learn** for machine learning utilities\n\n---\n\n**Made with \u2764\ufe0f for the data community**\n\n---\n\n## \ud83c\udf10 Folder Structure\n```\nscrubpy/\n\u251c\u2500\u2500 cli.py                    # Main CLI interface\n\u251c\u2500\u2500 core.py                   # Core cleaning logic\n\u251c\u2500\u2500 preview.py                # Preview operations before applying\n\u251c\u2500\u2500 profiling.py              # Dataset profiling & suggestions\n\u251c\u2500\u2500 export_profiling_report.py# Export detailed profiling reports\n```\n\n---\n\n## \ud83d\udee0\ufe0f Requirements\n- Python 3.8+\n- pandas\n- numpy\n- typer\n- rich\n- InquirerPy\n- scipy\n\n---\n\n## \u2728 What\u2019s Next?\nWe plan to add smart visual exports, column intelligence, and eventually ML-powered cleaning.\n\n---\n\n## \ud83c\udf89 Why This Exists\nSometimes you just need a quick tool to clean and inspect your data without writing boilerplate pandas code. ScrubPy helps you do that, even if you're not a data wizard.\n\n---\n\n## \ud83d\udcda License\nMIT\n\n---\n\nMade with \u2764\ufe0f by a student learning to make tools that help others.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "AI-powered data cleaning assistant with multiple interfaces",
    "version": "2.0.0",
    "project_urls": {
        "Bug Reports": "https://github.com/Dhanushranga1/scrubpy/issues",
        "Changelog": "https://github.com/Dhanushranga1/scrubpy/blob/main/CHANGELOG.md",
        "Documentation": "https://scrubpy.readthedocs.io",
        "Homepage": "https://github.com/Dhanushranga1/scrubpy",
        "Repository": "https://github.com/Dhanushranga1/scrubpy.git"
    },
    "split_keywords": [
        "data cleaning",
        " data preprocessing",
        " pandas",
        " ai",
        " machine learning",
        " data science"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5df5096c6bb3d9538ad5fea312c245c44cea8689470ba97fdd2357f1a162059c",
                "md5": "cecb335bd3088496495d69ec0527a5f3",
                "sha256": "311afb8511d141d44d762a35d5ae7d962f94816ad32229b1c8c150e24962bd93"
            },
            "downloads": -1,
            "filename": "scrubpy-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cecb335bd3088496495d69ec0527a5f3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 484439,
            "upload_time": "2025-10-11T10:28:46",
            "upload_time_iso_8601": "2025-10-11T10:28:46.658105Z",
            "url": "https://files.pythonhosted.org/packages/5d/f5/096c6bb3d9538ad5fea312c245c44cea8689470ba97fdd2357f1a162059c/scrubpy-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c2485841f7e7a4dda110ae98233a8c26b35626273f0b5d0c3c28e01b77ff2a83",
                "md5": "3c546a04383f519f652e029f2a37ee08",
                "sha256": "9bbab02090717b2a6a7e10305fe15412b4f89dbf8bbcc76f44a6456c97d1217d"
            },
            "downloads": -1,
            "filename": "scrubpy-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3c546a04383f519f652e029f2a37ee08",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 616622,
            "upload_time": "2025-10-11T10:28:52",
            "upload_time_iso_8601": "2025-10-11T10:28:52.620778Z",
            "url": "https://files.pythonhosted.org/packages/c2/48/5841f7e7a4dda110ae98233a8c26b35626273f0b5d0c3c28e01b77ff2a83/scrubpy-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-11 10:28:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Dhanushranga1",
    "github_project": "scrubpy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "scrubpy"
}
        
Elapsed time: 2.64171s