# Cleaning Agent
Intelligent data cleaning agent for automated data quality improvement.
## 🚀 Features
- **Automated Data Quality Analysis**: Detect missing values, duplicates, outliers, and data type inconsistencies
- **Intelligent Cleaning Strategies**: AI-powered decision making for optimal cleaning approaches
- **LLM-Driven Cleaning**: Leverage Large Language Models to automatically generate and execute Python code for complex data cleaning tasks.
- **Multiple Data Format Support**: CSV, Excel, JSON, Parquet, and pandas DataFrames
- **Comprehensive Reporting**: Detailed cleaning reports with metrics and recommendations
- **Configurable Parameters**: Customize cleaning behavior and thresholds
- **Command Line Interface**: Easy-to-use CLI for batch processing
- **Python API**: Simple integration into existing workflows
## 🏗️ Architecture
The Cleaning Agent follows a modular architecture:
```
CleaningAgent
├── DataQualityAnalyzer # Analyzes data quality and detects issues
├── CleaningValidator # Validates cleaned data and provides assessment
├── Configuration # Manages agent settings and parameters
└── Models # Data structures for requests, responses, and reports
```
## Data Quality Metrics
- **Overall Quality Score**: 0-1 scale based on multiple factors
- **Missing Value Analysis**: Per-column missing value statistics
- **Duplicate Analysis**: Duplicate row counts and percentages
- **Data Type Analysis**: Column data type distribution
- **Uniqueness Analysis**: Unique value counts per column
### 🔍 Supported Data Quality Issues
### Missing Values
- **Detection**: Automatic identification of columns with missing data
- **Handling**: Smart imputation strategies (median for numerical, mode for categorical)
- **Thresholds**: Configurable missing value percentage limits
### Duplicate Rows
- **Detection**: Identifies exact and near-duplicate rows
- **Removal**: Configurable duplicate removal strategies
- **Analysis**: Reports duplicate patterns and impact
### Data Type Inconsistencies
- **Detection**: Identifies columns with mixed or inappropriate data types
- **Standardization**: Converts data types for consistency
- **Validation**: Ensures data type appropriateness
### Outliers
- **Detection**: Statistical outlier detection using IQR method
- **Handling**: Configurable outlier treatment (capping, removal, investigation)
- **Impact Assessment**: Reports outlier impact on data quality
## Developer Setup and Testing
### Setup Instructions
1. Clone the repository and checkout the feature branch:
```bash
git clone https://github.com/stepfnAI/cleaning_agent.git
cd cleaning_agent
git checkout review
```
2. Install uv (if not already installed):
```bash
# Option A: Using the standalone installer (recommended for macOS/Linux)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Option B: Using pip (if uv is already in an existing environment)
pip install uv
```
3. Create and activate a virtual environment:
```bash
uv venv --python=3.10 venv
source venv/bin/activate
```
4. Install the project in editable mode with development dependencies:
```bash
uv pip install -e ".[dev]"
```
5. Clone and set up the sfn_blueprint dependency:
```bash
cd ..
git clone https://github.com/stepfnAI/sfn_blueprint.git
cd sfn_blueprint
source ../cleaning_agent/venv/bin/activate
git checkout dev
uv pip install -e .
cd ../cleaning_agent
```
6. Set your OpenAI API key:
```bash
export OPENAI_API_KEY='your-api-key-here'
```
### Example
1. Run the example script:
```bash
python examples/basic_usage.py
```
### Running Tests
1. Run the test suite:
```bash
# Run all tests
pytest tests/ -s
# Run specific test files
pytest tests/test_agent.py -s
pytest tests/test_context_integration.py -s
pytest tests/test_execution_validation.py -s
pytest tests/test_llm_driven_cleaning.py -s
pytest tests/test_llm_driven_cleaning_with_sql.py -s
```
##### Test Structure
```
tests/
├── test_agent.py # Agent functionality tests
├── test_context_integration.py # Context integration tests
├── test_execution_validation.py # Execution validation tests
├── test_llm_driven_cleaning.py # LLM-driven cleaning tests
├── tests/test_llm_driven_cleaning_with_sql.py # SQL cleaning tests
```
##### Test Dependencies
The following testing dependencies are automatically installed:
- `pytest>=7.0.0` - Test framework
- `pytest-cov>=4.0.0` - Coverage reporting
- `black>=23.0.0` - Code formatting
- `isort>=5.12.0` - Import sorting
- `flake8>=6.0.0` - Linting
- `mypy>=1.0.0` - Type checking
## 📊 Output and Reporting
### Cleaning Response
```python
{
"success": True,
"cleaned_data": DataFrame,
"report": {
"report_id": "uuid",
"timestamp": "2024-01-01T00:00:00Z",
"data_summary": {
"original_shape": (1000, 10),
"cleaned_shape": (950, 10),
"rows_removed": 50,
"columns_processed": 10
},
"issues_detected": [...],
"cleaning_operations": [...],
"quality_metrics": {
"original_quality_score": 0.65,
"final_quality_score": 0.89,
"improvement": 0.24
},
"recommendations": [...],
"execution_time": 2.34
},
"message": "Data cleaning completed successfully",
"errors": [],
"metadata": {...}
}
```
## Additional Information
- **Python Version**: 3.10+
- **Dependencies**: Managed through `pyproject.toml`
- **Code Style**: Follows PEP 8 with Black formatting
Raw data
{
"_id": null,
"home_page": null,
"name": "cleaning-agent",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "StepFunction AI <team@stepfunction.ai>",
"keywords": "data-cleaning, data-quality, ai-agent, machine-learning, data-preprocessing",
"author": null,
"author_email": "StepFunction AI <team@stepfunction.ai>",
"download_url": "https://files.pythonhosted.org/packages/5b/a9/dfcc55f8fd3e9c25f88593a07db4dad001c8c024c6d2d40ea3db12a313e0/cleaning_agent-0.1.0.tar.gz",
"platform": null,
"description": "# Cleaning Agent\n\nIntelligent data cleaning agent for automated data quality improvement.\n\n## \ud83d\ude80 Features\n\n- **Automated Data Quality Analysis**: Detect missing values, duplicates, outliers, and data type inconsistencies\n- **Intelligent Cleaning Strategies**: AI-powered decision making for optimal cleaning approaches\n- **LLM-Driven Cleaning**: Leverage Large Language Models to automatically generate and execute Python code for complex data cleaning tasks.\n- **Multiple Data Format Support**: CSV, Excel, JSON, Parquet, and pandas DataFrames\n- **Comprehensive Reporting**: Detailed cleaning reports with metrics and recommendations\n- **Configurable Parameters**: Customize cleaning behavior and thresholds\n- **Command Line Interface**: Easy-to-use CLI for batch processing\n- **Python API**: Simple integration into existing workflows\n\n## \ud83c\udfd7\ufe0f Architecture\n\nThe Cleaning Agent follows a modular architecture:\n\n```\nCleaningAgent\n\u251c\u2500\u2500 DataQualityAnalyzer # Analyzes data quality and detects issues\n\u251c\u2500\u2500 CleaningValidator # Validates cleaned data and provides assessment\n\u251c\u2500\u2500 Configuration # Manages agent settings and parameters\n\u2514\u2500\u2500 Models # Data structures for requests, responses, and reports\n```\n\n## Data Quality Metrics\n- **Overall Quality Score**: 0-1 scale based on multiple factors\n- **Missing Value Analysis**: Per-column missing value statistics\n- **Duplicate Analysis**: Duplicate row counts and percentages\n- **Data Type Analysis**: Column data type distribution\n- **Uniqueness Analysis**: Unique value counts per column\n\n### \ud83d\udd0d Supported Data Quality Issues\n### Missing Values\n- **Detection**: Automatic identification of columns with missing data\n- **Handling**: Smart imputation strategies (median for numerical, mode for categorical)\n- **Thresholds**: Configurable missing value percentage limits\n\n### Duplicate Rows\n- **Detection**: Identifies exact and near-duplicate rows\n- **Removal**: Configurable duplicate removal strategies\n- **Analysis**: Reports duplicate patterns and impact\n\n### Data Type Inconsistencies\n- **Detection**: Identifies columns with mixed or inappropriate data types\n- **Standardization**: Converts data types for consistency\n- **Validation**: Ensures data type appropriateness\n\n### Outliers\n- **Detection**: Statistical outlier detection using IQR method\n- **Handling**: Configurable outlier treatment (capping, removal, investigation)\n- **Impact Assessment**: Reports outlier impact on data quality\n\n\n## Developer Setup and Testing\n\n### Setup Instructions\n\n1. Clone the repository and checkout the feature branch:\n ```bash\n git clone https://github.com/stepfnAI/cleaning_agent.git \n cd cleaning_agent\n git checkout review\n ```\n\n2. Install uv (if not already installed):\n ```bash\n # Option A: Using the standalone installer (recommended for macOS/Linux)\n curl -LsSf https://astral.sh/uv/install.sh | sh\n \n # Option B: Using pip (if uv is already in an existing environment)\n pip install uv\n ```\n\n3. Create and activate a virtual environment:\n ```bash\n uv venv --python=3.10 venv\n source venv/bin/activate\n ```\n\n4. Install the project in editable mode with development dependencies:\n ```bash\n uv pip install -e \".[dev]\"\n ```\n\n5. Clone and set up the sfn_blueprint dependency:\n ```bash\n cd ..\n git clone https://github.com/stepfnAI/sfn_blueprint.git\n cd sfn_blueprint\n source ../cleaning_agent/venv/bin/activate\n git checkout dev\n uv pip install -e .\n cd ../cleaning_agent\n ```\n\n6. Set your OpenAI API key:\n ```bash\n export OPENAI_API_KEY='your-api-key-here'\n ```\n\n### Example \n\n1. Run the example script:\n ```bash\n python examples/basic_usage.py\n ```\n\n\n### Running Tests\n\n1. Run the test suite:\n ```bash\n # Run all tests\n pytest tests/ -s\n \n # Run specific test files\n pytest tests/test_agent.py -s\n pytest tests/test_context_integration.py -s \n pytest tests/test_execution_validation.py -s \n pytest tests/test_llm_driven_cleaning.py -s\n pytest tests/test_llm_driven_cleaning_with_sql.py -s\n ```\n\n##### Test Structure\n\n```\ntests/\n\u251c\u2500\u2500 test_agent.py # Agent functionality tests\n\u251c\u2500\u2500 test_context_integration.py # Context integration tests\n\u251c\u2500\u2500 test_execution_validation.py # Execution validation tests\n\u251c\u2500\u2500 test_llm_driven_cleaning.py # LLM-driven cleaning tests\n\u251c\u2500\u2500 tests/test_llm_driven_cleaning_with_sql.py # SQL cleaning tests\n```\n\n##### Test Dependencies\nThe following testing dependencies are automatically installed:\n- `pytest>=7.0.0` - Test framework\n- `pytest-cov>=4.0.0` - Coverage reporting\n- `black>=23.0.0` - Code formatting\n- `isort>=5.12.0` - Import sorting\n- `flake8>=6.0.0` - Linting\n- `mypy>=1.0.0` - Type checking\n\n## \ud83d\udcca Output and Reporting\n\n### Cleaning Response\n```python\n{\n \"success\": True,\n \"cleaned_data\": DataFrame,\n \"report\": {\n \"report_id\": \"uuid\",\n \"timestamp\": \"2024-01-01T00:00:00Z\",\n \"data_summary\": {\n \"original_shape\": (1000, 10),\n \"cleaned_shape\": (950, 10),\n \"rows_removed\": 50,\n \"columns_processed\": 10\n },\n \"issues_detected\": [...],\n \"cleaning_operations\": [...],\n \"quality_metrics\": {\n \"original_quality_score\": 0.65,\n \"final_quality_score\": 0.89,\n \"improvement\": 0.24\n },\n \"recommendations\": [...],\n \"execution_time\": 2.34\n },\n \"message\": \"Data cleaning completed successfully\",\n \"errors\": [],\n \"metadata\": {...}\n}\n```\n\n## Additional Information\n\n- **Python Version**: 3.10+\n- **Dependencies**: Managed through `pyproject.toml`\n- **Code Style**: Follows PEP 8 with Black formatting\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Intelligent data cleaning agent for automated data quality improvement",
"version": "0.1.0",
"project_urls": {
"Changelog": "https://github.com/stepfnAI/cleaning-agent/blob/main/CHANGELOG.md",
"Documentation": "https://cleaning-agent.readthedocs.io",
"Homepage": "https://github.com/stepfnAI/cleaning-agent",
"Issues": "https://github.com/stepfnAI/cleaning-agent/issues",
"Repository": "https://github.com/stepfnAI/cleaning-agent.git"
},
"split_keywords": [
"data-cleaning",
" data-quality",
" ai-agent",
" machine-learning",
" data-preprocessing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "729e4fdc03bbbb13bf5bf67f3c1f903a2ddf7c2bbb76076ca845968f088fe30b",
"md5": "71140d77b7bf48b46c0e3b1cc84557d1",
"sha256": "1f50b02ca9f90ee98f03b6d89be6220feb1910164c41537150d7adad5c11b9b6"
},
"downloads": -1,
"filename": "cleaning_agent-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "71140d77b7bf48b46c0e3b1cc84557d1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 40878,
"upload_time": "2025-10-15T07:49:23",
"upload_time_iso_8601": "2025-10-15T07:49:23.714648Z",
"url": "https://files.pythonhosted.org/packages/72/9e/4fdc03bbbb13bf5bf67f3c1f903a2ddf7c2bbb76076ca845968f088fe30b/cleaning_agent-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5ba9dfcc55f8fd3e9c25f88593a07db4dad001c8c024c6d2d40ea3db12a313e0",
"md5": "b8e6f3c51f462e1c9f9d0e6a8cfaa33e",
"sha256": "7b72780a2b09f6827c93020b8e42eb3424e13d003ef8c4221850e2dba5b584f7"
},
"downloads": -1,
"filename": "cleaning_agent-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "b8e6f3c51f462e1c9f9d0e6a8cfaa33e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 48854,
"upload_time": "2025-10-15T07:49:25",
"upload_time_iso_8601": "2025-10-15T07:49:25.130570Z",
"url": "https://files.pythonhosted.org/packages/5b/a9/dfcc55f8fd3e9c25f88593a07db4dad001c8c024c6d2d40ea3db12a313e0/cleaning_agent-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-15 07:49:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "stepfnAI",
"github_project": "cleaning-agent",
"github_not_found": true,
"lcname": "cleaning-agent"
}