cleaning-agent


Namecleaning-agent JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryIntelligent data cleaning agent for automated data quality improvement
upload_time2025-10-15 07:49:25
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords data-cleaning data-quality ai-agent machine-learning data-preprocessing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Cleaning Agent

Intelligent data cleaning agent for automated data quality improvement.

## 🚀 Features

- **Automated Data Quality Analysis**: Detect missing values, duplicates, outliers, and data type inconsistencies
- **Intelligent Cleaning Strategies**: AI-powered decision making for optimal cleaning approaches
- **LLM-Driven Cleaning**: Leverage Large Language Models to automatically generate and execute Python code for complex data cleaning tasks.
- **Multiple Data Format Support**: CSV, Excel, JSON, Parquet, and pandas DataFrames
- **Comprehensive Reporting**: Detailed cleaning reports with metrics and recommendations
- **Configurable Parameters**: Customize cleaning behavior and thresholds
- **Command Line Interface**: Easy-to-use CLI for batch processing
- **Python API**: Simple integration into existing workflows

## 🏗️ Architecture

The Cleaning Agent follows a modular architecture:

```
CleaningAgent
├── DataQualityAnalyzer    # Analyzes data quality and detects issues
├── CleaningValidator      # Validates cleaned data and provides assessment
├── Configuration          # Manages agent settings and parameters
└── Models                 # Data structures for requests, responses, and reports
```

## Data Quality Metrics
- **Overall Quality Score**: 0-1 scale based on multiple factors
- **Missing Value Analysis**: Per-column missing value statistics
- **Duplicate Analysis**: Duplicate row counts and percentages
- **Data Type Analysis**: Column data type distribution
- **Uniqueness Analysis**: Unique value counts per column

### 🔍 Supported Data Quality Issues
### Missing Values
- **Detection**: Automatic identification of columns with missing data
- **Handling**: Smart imputation strategies (median for numerical, mode for categorical)
- **Thresholds**: Configurable missing value percentage limits

### Duplicate Rows
- **Detection**: Identifies exact and near-duplicate rows
- **Removal**: Configurable duplicate removal strategies
- **Analysis**: Reports duplicate patterns and impact

### Data Type Inconsistencies
- **Detection**: Identifies columns with mixed or inappropriate data types
- **Standardization**: Converts data types for consistency
- **Validation**: Ensures data type appropriateness

### Outliers
- **Detection**: Statistical outlier detection using IQR method
- **Handling**: Configurable outlier treatment (capping, removal, investigation)
- **Impact Assessment**: Reports outlier impact on data quality


## Developer Setup and Testing

### Setup Instructions

1. Clone the repository and checkout the feature branch:
   ```bash
   git clone https://github.com/stepfnAI/cleaning_agent.git 
   cd cleaning_agent
   git checkout review
   ```

2. Install uv (if not already installed):
   ```bash
   # Option A: Using the standalone installer (recommended for macOS/Linux)
   curl -LsSf https://astral.sh/uv/install.sh | sh
   
   # Option B: Using pip (if uv is already in an existing environment)
   pip install uv
   ```

3. Create and activate a virtual environment:
   ```bash
   uv venv --python=3.10 venv
   source venv/bin/activate
   ```

4. Install the project in editable mode with development dependencies:
   ```bash
   uv pip install -e ".[dev]"
   ```

5. Clone and set up the sfn_blueprint dependency:
   ```bash
   cd ..
   git clone https://github.com/stepfnAI/sfn_blueprint.git
   cd sfn_blueprint
   source ../cleaning_agent/venv/bin/activate
   git checkout dev
   uv pip install -e .
   cd ../cleaning_agent
   ```

6. Set your OpenAI API key:
   ```bash
   export OPENAI_API_KEY='your-api-key-here'
   ```

### Example 

1. Run the example script:
   ```bash
   python examples/basic_usage.py
   ```


### Running Tests

1. Run the test suite:
   ```bash
   # Run all tests
   pytest tests/ -s
   
   # Run specific test files
   pytest tests/test_agent.py -s
   pytest tests/test_context_integration.py -s 
   pytest tests/test_execution_validation.py -s 
   pytest tests/test_llm_driven_cleaning.py -s
   pytest tests/test_llm_driven_cleaning_with_sql.py -s
   ```

#####    Test Structure

```
tests/
├── test_agent.py                                        # Agent functionality tests
├── test_context_integration.py                          # Context integration tests
├── test_execution_validation.py                         # Execution validation tests
├── test_llm_driven_cleaning.py                          # LLM-driven cleaning tests
├── tests/test_llm_driven_cleaning_with_sql.py           # SQL cleaning tests
```

##### Test Dependencies
The following testing dependencies are automatically installed:
- `pytest>=7.0.0` - Test framework
- `pytest-cov>=4.0.0` - Coverage reporting
- `black>=23.0.0` - Code formatting
- `isort>=5.12.0` - Import sorting
- `flake8>=6.0.0` - Linting
- `mypy>=1.0.0` - Type checking

## 📊 Output and Reporting

### Cleaning Response
```python
{
    "success": True,
    "cleaned_data": DataFrame,
    "report": {
        "report_id": "uuid",
        "timestamp": "2024-01-01T00:00:00Z",
        "data_summary": {
            "original_shape": (1000, 10),
            "cleaned_shape": (950, 10),
            "rows_removed": 50,
            "columns_processed": 10
        },
        "issues_detected": [...],
        "cleaning_operations": [...],
        "quality_metrics": {
            "original_quality_score": 0.65,
            "final_quality_score": 0.89,
            "improvement": 0.24
        },
        "recommendations": [...],
        "execution_time": 2.34
    },
    "message": "Data cleaning completed successfully",
    "errors": [],
    "metadata": {...}
}
```

## Additional Information

- **Python Version**: 3.10+
- **Dependencies**: Managed through `pyproject.toml`
- **Code Style**: Follows PEP 8 with Black formatting

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cleaning-agent",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "StepFunction AI <team@stepfunction.ai>",
    "keywords": "data-cleaning, data-quality, ai-agent, machine-learning, data-preprocessing",
    "author": null,
    "author_email": "StepFunction AI <team@stepfunction.ai>",
    "download_url": "https://files.pythonhosted.org/packages/5b/a9/dfcc55f8fd3e9c25f88593a07db4dad001c8c024c6d2d40ea3db12a313e0/cleaning_agent-0.1.0.tar.gz",
    "platform": null,
    "description": "# Cleaning Agent\n\nIntelligent data cleaning agent for automated data quality improvement.\n\n## \ud83d\ude80 Features\n\n- **Automated Data Quality Analysis**: Detect missing values, duplicates, outliers, and data type inconsistencies\n- **Intelligent Cleaning Strategies**: AI-powered decision making for optimal cleaning approaches\n- **LLM-Driven Cleaning**: Leverage Large Language Models to automatically generate and execute Python code for complex data cleaning tasks.\n- **Multiple Data Format Support**: CSV, Excel, JSON, Parquet, and pandas DataFrames\n- **Comprehensive Reporting**: Detailed cleaning reports with metrics and recommendations\n- **Configurable Parameters**: Customize cleaning behavior and thresholds\n- **Command Line Interface**: Easy-to-use CLI for batch processing\n- **Python API**: Simple integration into existing workflows\n\n## \ud83c\udfd7\ufe0f Architecture\n\nThe Cleaning Agent follows a modular architecture:\n\n```\nCleaningAgent\n\u251c\u2500\u2500 DataQualityAnalyzer    # Analyzes data quality and detects issues\n\u251c\u2500\u2500 CleaningValidator      # Validates cleaned data and provides assessment\n\u251c\u2500\u2500 Configuration          # Manages agent settings and parameters\n\u2514\u2500\u2500 Models                 # Data structures for requests, responses, and reports\n```\n\n## Data Quality Metrics\n- **Overall Quality Score**: 0-1 scale based on multiple factors\n- **Missing Value Analysis**: Per-column missing value statistics\n- **Duplicate Analysis**: Duplicate row counts and percentages\n- **Data Type Analysis**: Column data type distribution\n- **Uniqueness Analysis**: Unique value counts per column\n\n### \ud83d\udd0d Supported Data Quality Issues\n### Missing Values\n- **Detection**: Automatic identification of columns with missing data\n- **Handling**: Smart imputation strategies (median for numerical, mode for categorical)\n- **Thresholds**: Configurable missing value percentage limits\n\n### Duplicate Rows\n- **Detection**: Identifies exact and near-duplicate rows\n- **Removal**: Configurable duplicate removal strategies\n- **Analysis**: Reports duplicate patterns and impact\n\n### Data Type Inconsistencies\n- **Detection**: Identifies columns with mixed or inappropriate data types\n- **Standardization**: Converts data types for consistency\n- **Validation**: Ensures data type appropriateness\n\n### Outliers\n- **Detection**: Statistical outlier detection using IQR method\n- **Handling**: Configurable outlier treatment (capping, removal, investigation)\n- **Impact Assessment**: Reports outlier impact on data quality\n\n\n## Developer Setup and Testing\n\n### Setup Instructions\n\n1. Clone the repository and checkout the feature branch:\n   ```bash\n   git clone https://github.com/stepfnAI/cleaning_agent.git \n   cd cleaning_agent\n   git checkout review\n   ```\n\n2. Install uv (if not already installed):\n   ```bash\n   # Option A: Using the standalone installer (recommended for macOS/Linux)\n   curl -LsSf https://astral.sh/uv/install.sh | sh\n   \n   # Option B: Using pip (if uv is already in an existing environment)\n   pip install uv\n   ```\n\n3. Create and activate a virtual environment:\n   ```bash\n   uv venv --python=3.10 venv\n   source venv/bin/activate\n   ```\n\n4. Install the project in editable mode with development dependencies:\n   ```bash\n   uv pip install -e \".[dev]\"\n   ```\n\n5. Clone and set up the sfn_blueprint dependency:\n   ```bash\n   cd ..\n   git clone https://github.com/stepfnAI/sfn_blueprint.git\n   cd sfn_blueprint\n   source ../cleaning_agent/venv/bin/activate\n   git checkout dev\n   uv pip install -e .\n   cd ../cleaning_agent\n   ```\n\n6. Set your OpenAI API key:\n   ```bash\n   export OPENAI_API_KEY='your-api-key-here'\n   ```\n\n### Example \n\n1. Run the example script:\n   ```bash\n   python examples/basic_usage.py\n   ```\n\n\n### Running Tests\n\n1. Run the test suite:\n   ```bash\n   # Run all tests\n   pytest tests/ -s\n   \n   # Run specific test files\n   pytest tests/test_agent.py -s\n   pytest tests/test_context_integration.py -s \n   pytest tests/test_execution_validation.py -s \n   pytest tests/test_llm_driven_cleaning.py -s\n   pytest tests/test_llm_driven_cleaning_with_sql.py -s\n   ```\n\n#####    Test Structure\n\n```\ntests/\n\u251c\u2500\u2500 test_agent.py                                        # Agent functionality tests\n\u251c\u2500\u2500 test_context_integration.py                          # Context integration tests\n\u251c\u2500\u2500 test_execution_validation.py                         # Execution validation tests\n\u251c\u2500\u2500 test_llm_driven_cleaning.py                          # LLM-driven cleaning tests\n\u251c\u2500\u2500 tests/test_llm_driven_cleaning_with_sql.py           # SQL cleaning tests\n```\n\n##### Test Dependencies\nThe following testing dependencies are automatically installed:\n- `pytest>=7.0.0` - Test framework\n- `pytest-cov>=4.0.0` - Coverage reporting\n- `black>=23.0.0` - Code formatting\n- `isort>=5.12.0` - Import sorting\n- `flake8>=6.0.0` - Linting\n- `mypy>=1.0.0` - Type checking\n\n## \ud83d\udcca Output and Reporting\n\n### Cleaning Response\n```python\n{\n    \"success\": True,\n    \"cleaned_data\": DataFrame,\n    \"report\": {\n        \"report_id\": \"uuid\",\n        \"timestamp\": \"2024-01-01T00:00:00Z\",\n        \"data_summary\": {\n            \"original_shape\": (1000, 10),\n            \"cleaned_shape\": (950, 10),\n            \"rows_removed\": 50,\n            \"columns_processed\": 10\n        },\n        \"issues_detected\": [...],\n        \"cleaning_operations\": [...],\n        \"quality_metrics\": {\n            \"original_quality_score\": 0.65,\n            \"final_quality_score\": 0.89,\n            \"improvement\": 0.24\n        },\n        \"recommendations\": [...],\n        \"execution_time\": 2.34\n    },\n    \"message\": \"Data cleaning completed successfully\",\n    \"errors\": [],\n    \"metadata\": {...}\n}\n```\n\n## Additional Information\n\n- **Python Version**: 3.10+\n- **Dependencies**: Managed through `pyproject.toml`\n- **Code Style**: Follows PEP 8 with Black formatting\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Intelligent data cleaning agent for automated data quality improvement",
    "version": "0.1.0",
    "project_urls": {
        "Changelog": "https://github.com/stepfnAI/cleaning-agent/blob/main/CHANGELOG.md",
        "Documentation": "https://cleaning-agent.readthedocs.io",
        "Homepage": "https://github.com/stepfnAI/cleaning-agent",
        "Issues": "https://github.com/stepfnAI/cleaning-agent/issues",
        "Repository": "https://github.com/stepfnAI/cleaning-agent.git"
    },
    "split_keywords": [
        "data-cleaning",
        " data-quality",
        " ai-agent",
        " machine-learning",
        " data-preprocessing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "729e4fdc03bbbb13bf5bf67f3c1f903a2ddf7c2bbb76076ca845968f088fe30b",
                "md5": "71140d77b7bf48b46c0e3b1cc84557d1",
                "sha256": "1f50b02ca9f90ee98f03b6d89be6220feb1910164c41537150d7adad5c11b9b6"
            },
            "downloads": -1,
            "filename": "cleaning_agent-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "71140d77b7bf48b46c0e3b1cc84557d1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 40878,
            "upload_time": "2025-10-15T07:49:23",
            "upload_time_iso_8601": "2025-10-15T07:49:23.714648Z",
            "url": "https://files.pythonhosted.org/packages/72/9e/4fdc03bbbb13bf5bf67f3c1f903a2ddf7c2bbb76076ca845968f088fe30b/cleaning_agent-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5ba9dfcc55f8fd3e9c25f88593a07db4dad001c8c024c6d2d40ea3db12a313e0",
                "md5": "b8e6f3c51f462e1c9f9d0e6a8cfaa33e",
                "sha256": "7b72780a2b09f6827c93020b8e42eb3424e13d003ef8c4221850e2dba5b584f7"
            },
            "downloads": -1,
            "filename": "cleaning_agent-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b8e6f3c51f462e1c9f9d0e6a8cfaa33e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 48854,
            "upload_time": "2025-10-15T07:49:25",
            "upload_time_iso_8601": "2025-10-15T07:49:25.130570Z",
            "url": "https://files.pythonhosted.org/packages/5b/a9/dfcc55f8fd3e9c25f88593a07db4dad001c8c024c6d2d40ea3db12a313e0/cleaning_agent-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-15 07:49:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "stepfnAI",
    "github_project": "cleaning-agent",
    "github_not_found": true,
    "lcname": "cleaning-agent"
}
        
Elapsed time: 3.82048s