datatidy


Namedatatidy JSON
Version 1.0.1 PyPI version JSON
download
home_pagehttps://github.com/wwd1015/datatidy
SummaryA powerful, configuration-driven data processing and cleaning package
upload_time2025-07-29 01:42:49
maintainerNone
docs_urlNone
authorDataTidy Development Team
requires_python>=3.8
licenseMIT
keywords data processing data cleaning etl configuration-driven pandas data transformation
VCS
bugtrack_url
requirements pandas PyYAML sqlalchemy openpyxl snowflake-sqlalchemy psycopg2-binary pymysql jsonschema
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
  <img src="assets/datatidy-logo-pypi.png" alt="DataTidy Logo" width="300">
  
  <h3>Configuration-Driven Data Processing Made Simple</h3>
  
  [![PyPI version](https://badge.fury.io/py/datatidy.svg)](https://pypi.org/project/datatidy/)
  [![Python versions](https://img.shields.io/pypi/pyversions/datatidy.svg)](https://pypi.org/project/datatidy/)
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  [![Downloads](https://pepy.tech/badge/datatidy)](https://pepy.tech/project/datatidy)
</div>

# DataTidy

A powerful, configuration-driven data processing and cleaning package for Python. DataTidy allows you to define complex data transformations, validations, and cleanings through simple YAML configuration files.

## 🚀 Key Features

- **🔧 Configuration-Driven**: Define all transformations in YAML - no code required
- **📊 Multiple Data Sources**: CSV, Excel, databases (PostgreSQL, MySQL, Snowflake, etc.)
- **🔗 Multi-Input Joins**: Combine data from multiple sources with flexible join operations
- **⚡ Advanced Operations**: Map/reduce/filter with lambda functions and chained operations
- **🧠 Dependency Resolution**: Automatic execution order planning for complex transformations
- **📈 Time Series Support**: Lag operations and rolling window calculations
- **🛡️ Safe Expressions**: Secure evaluation with whitelist-based security
- **🎯 Data Validation**: Comprehensive validation rules with detailed error reporting
- **⚙️ CLI Interface**: Easy-to-use command-line tools for batch processing


## Installation

```bash
pip install datatidy
```

For development installation:
```bash
git clone https://github.com/your-repo/datatidy.git
cd datatidy
pip install -e ".[dev]"
```

## Quick Start

### 1. Create a sample configuration

```bash
datatidy sample config.yaml
```

### 2. Process your data

```bash
datatidy process config.yaml -i input.csv -o output.csv
```

### 3. Or use programmatically

```python
from datatidy import DataTidy

# Initialize with configuration
dt = DataTidy('config.yaml')

# Process data
result = dt.process_data('input.csv')

# Save result
dt.process_and_save('output.csv', 'input.csv')
```

## Configuration Structure

DataTidy uses YAML configuration files to define data processing pipelines:

```yaml
input:
  type: csv                    # csv, excel, database
  source: "data/input.csv"     # file path or SQL query
  options:
    encoding: utf-8
    delimiter: ","

output:
  columns:
    user_id:
      source: "id"             # Source column name
      type: int                # Data type conversion
      validation:
        required: true
        min_value: 1
    
    full_name:
      source: "name"
      type: string
      transformation: "str.title()"  # Python expression
      validation:
        required: true
        min_length: 2
        max_length: 100
    
    age_group:
      transformation: "'adult' if age >= 18 else 'minor'"
      type: string
      validation:
        allowed_values: ["adult", "minor"]

  filters:
    - condition: "age >= 0"
      action: keep

  sort:
    - column: user_id
      ascending: true

global_settings:
  ignore_errors: false
  max_errors: 100
```

## Examples

### Basic CSV Processing

```python
from datatidy import DataTidy

config = {
    "input": {
        "type": "csv",
        "source": "users.csv"
    },
    "output": {
        "columns": {
            "clean_name": {
                "source": "name",
                "transformation": "str.strip().title()",
                "type": "string"
            },
            "age_category": {
                "transformation": "'senior' if age > 65 else ('adult' if age >= 18 else 'minor')",
                "type": "string"
            }
        }
    }
}

dt = DataTidy()
dt.load_config(config)
result = dt.process_data()
print(result)
```

### Database Processing

```yaml
input:
  type: database
  source: 
    query: "SELECT * FROM users WHERE active = true"
    connection_string: "postgresql://user:pass@localhost/db"

output:
  columns:
    user_email:
      source: "email"
      type: string
      validation:
        pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
    
    signup_date:
      source: "created_at"
      type: datetime
      format: "%Y-%m-%d"
```

### Excel Processing with Complex Transformations

```yaml
input:
  type: excel
  source:
    path: "sales_data.xlsx"
    sheet_name: "Q1_Sales"
    options:
      header: 0
      skiprows: 2

output:
  columns:
    revenue_category:
      transformation: |
        'high' if revenue > 100000 else (
          'medium' if revenue > 50000 else 'low'
        )
      validation:
        allowed_values: ["high", "medium", "low"]
    
    formatted_date:
      source: "sale_date"
      type: datetime
      format: "%Y-%m-%d"
    
    clean_product_name:
      source: "product"
      transformation: "str.strip().upper().replace('_', ' ')"
      validation:
        min_length: 1
        max_length: 50

  filters:
    - condition: "revenue > 0"
      action: keep
    - condition: "product != 'DELETED'"
      action: keep
```

## Command Line Usage

### Process Data
```bash
# Basic processing
datatidy process config.yaml

# With input/output files
datatidy process config.yaml -i input.csv -o output.csv

# Ignore validation errors
datatidy process config.yaml --ignore-errors
```

### Validate Configuration
```bash
datatidy validate config.yaml
```

### Create Sample Configuration
```bash
datatidy sample my_config.yaml
```

## Expression System

DataTidy includes a safe expression parser that supports:

### Basic Operations
- Arithmetic: `+`, `-`, `*`, `/`, `//`, `%`, `**`
- Comparison: `==`, `!=`, `<`, `<=`, `>`, `>=`
- Logical: `and`, `or`, `not`
- Membership: `in`, `not in`

### Functions
- Type conversion: `str()`, `int()`, `float()`, `bool()`
- Math: `abs()`, `max()`, `min()`, `round()`
- String methods: `upper()`, `lower()`, `strip()`, `replace()`, etc.

### Examples
```yaml
transformations:
  # Conditional expressions
  status: "'active' if last_login_days < 30 else 'inactive'"
  
  # String operations
  clean_name: "name.strip().title()"
  
  # Mathematical calculations
  bmi: "weight / (height / 100) ** 2"
  
  # Complex conditions
  risk_level: |
    'high' if (age > 65 and income < 30000) else (
      'medium' if age > 40 else 'low'
    )
```

## Validation Rules

DataTidy supports comprehensive validation:

```yaml
validation:
  required: true              # Field must not be null
  nullable: false             # Field cannot be null
  min_value: 0               # Minimum numeric value
  max_value: 100             # Maximum numeric value
  min_length: 2              # Minimum string length
  max_length: 50             # Maximum string length
  pattern: "^[A-Za-z]+$"     # Regex pattern
  allowed_values: ["A", "B"] # Whitelist of values
```

## Error Handling

```python
dt = DataTidy('config.yaml')
result = dt.process_data('input.csv')

# Check for errors
if dt.has_errors():
    for error in dt.get_errors():
        print(f"Error: {error['message']}")
```

## API Reference

### DataTidy Class

#### Methods
- `load_config(config)`: Load configuration from file or dict
- `process_data(data=None)`: Process data according to configuration
- `process_and_save(output_path, data=None)`: Process and save data
- `get_errors()`: Get list of processing errors
- `has_errors()`: Check if errors occurred

### Configuration Schema

See [Configuration Reference](docs/configuration.md) for complete schema documentation.

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request

## License

MIT License - see LICENSE file for details.

## Changelog

### Version 0.1.0
- Initial release
- Basic CSV, Excel, and database support
- Safe expression engine
- Comprehensive validation system
- CLI interface

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wwd1015/datatidy",
    "name": "datatidy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "DataTidy Development Team <maintainer@example.com>",
    "keywords": "data processing, data cleaning, ETL, configuration-driven, pandas, data transformation",
    "author": "DataTidy Development Team",
    "author_email": "DataTidy Development Team <maintainer@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/ad/6f/831d3cc847cb48d4bfd3c80a40b13bce50fed11a7a81a87bbc7595f64463/datatidy-1.0.1.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  <img src=\"assets/datatidy-logo-pypi.png\" alt=\"DataTidy Logo\" width=\"300\">\n  \n  <h3>Configuration-Driven Data Processing Made Simple</h3>\n  \n  [![PyPI version](https://badge.fury.io/py/datatidy.svg)](https://pypi.org/project/datatidy/)\n  [![Python versions](https://img.shields.io/pypi/pyversions/datatidy.svg)](https://pypi.org/project/datatidy/)\n  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n  [![Downloads](https://pepy.tech/badge/datatidy)](https://pepy.tech/project/datatidy)\n</div>\n\n# DataTidy\n\nA powerful, configuration-driven data processing and cleaning package for Python. DataTidy allows you to define complex data transformations, validations, and cleanings through simple YAML configuration files.\n\n## \ud83d\ude80 Key Features\n\n- **\ud83d\udd27 Configuration-Driven**: Define all transformations in YAML - no code required\n- **\ud83d\udcca Multiple Data Sources**: CSV, Excel, databases (PostgreSQL, MySQL, Snowflake, etc.)\n- **\ud83d\udd17 Multi-Input Joins**: Combine data from multiple sources with flexible join operations\n- **\u26a1 Advanced Operations**: Map/reduce/filter with lambda functions and chained operations\n- **\ud83e\udde0 Dependency Resolution**: Automatic execution order planning for complex transformations\n- **\ud83d\udcc8 Time Series Support**: Lag operations and rolling window calculations\n- **\ud83d\udee1\ufe0f Safe Expressions**: Secure evaluation with whitelist-based security\n- **\ud83c\udfaf Data Validation**: Comprehensive validation rules with detailed error reporting\n- **\u2699\ufe0f CLI Interface**: Easy-to-use command-line tools for batch processing\n\n\n## Installation\n\n```bash\npip install datatidy\n```\n\nFor development installation:\n```bash\ngit clone https://github.com/your-repo/datatidy.git\ncd datatidy\npip install -e \".[dev]\"\n```\n\n## Quick Start\n\n### 1. Create a sample configuration\n\n```bash\ndatatidy sample config.yaml\n```\n\n### 2. Process your data\n\n```bash\ndatatidy process config.yaml -i input.csv -o output.csv\n```\n\n### 3. Or use programmatically\n\n```python\nfrom datatidy import DataTidy\n\n# Initialize with configuration\ndt = DataTidy('config.yaml')\n\n# Process data\nresult = dt.process_data('input.csv')\n\n# Save result\ndt.process_and_save('output.csv', 'input.csv')\n```\n\n## Configuration Structure\n\nDataTidy uses YAML configuration files to define data processing pipelines:\n\n```yaml\ninput:\n  type: csv                    # csv, excel, database\n  source: \"data/input.csv\"     # file path or SQL query\n  options:\n    encoding: utf-8\n    delimiter: \",\"\n\noutput:\n  columns:\n    user_id:\n      source: \"id\"             # Source column name\n      type: int                # Data type conversion\n      validation:\n        required: true\n        min_value: 1\n    \n    full_name:\n      source: \"name\"\n      type: string\n      transformation: \"str.title()\"  # Python expression\n      validation:\n        required: true\n        min_length: 2\n        max_length: 100\n    \n    age_group:\n      transformation: \"'adult' if age >= 18 else 'minor'\"\n      type: string\n      validation:\n        allowed_values: [\"adult\", \"minor\"]\n\n  filters:\n    - condition: \"age >= 0\"\n      action: keep\n\n  sort:\n    - column: user_id\n      ascending: true\n\nglobal_settings:\n  ignore_errors: false\n  max_errors: 100\n```\n\n## Examples\n\n### Basic CSV Processing\n\n```python\nfrom datatidy import DataTidy\n\nconfig = {\n    \"input\": {\n        \"type\": \"csv\",\n        \"source\": \"users.csv\"\n    },\n    \"output\": {\n        \"columns\": {\n            \"clean_name\": {\n                \"source\": \"name\",\n                \"transformation\": \"str.strip().title()\",\n                \"type\": \"string\"\n            },\n            \"age_category\": {\n                \"transformation\": \"'senior' if age > 65 else ('adult' if age >= 18 else 'minor')\",\n                \"type\": \"string\"\n            }\n        }\n    }\n}\n\ndt = DataTidy()\ndt.load_config(config)\nresult = dt.process_data()\nprint(result)\n```\n\n### Database Processing\n\n```yaml\ninput:\n  type: database\n  source: \n    query: \"SELECT * FROM users WHERE active = true\"\n    connection_string: \"postgresql://user:pass@localhost/db\"\n\noutput:\n  columns:\n    user_email:\n      source: \"email\"\n      type: string\n      validation:\n        pattern: \"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\\\.[a-zA-Z]{2,}$\"\n    \n    signup_date:\n      source: \"created_at\"\n      type: datetime\n      format: \"%Y-%m-%d\"\n```\n\n### Excel Processing with Complex Transformations\n\n```yaml\ninput:\n  type: excel\n  source:\n    path: \"sales_data.xlsx\"\n    sheet_name: \"Q1_Sales\"\n    options:\n      header: 0\n      skiprows: 2\n\noutput:\n  columns:\n    revenue_category:\n      transformation: |\n        'high' if revenue > 100000 else (\n          'medium' if revenue > 50000 else 'low'\n        )\n      validation:\n        allowed_values: [\"high\", \"medium\", \"low\"]\n    \n    formatted_date:\n      source: \"sale_date\"\n      type: datetime\n      format: \"%Y-%m-%d\"\n    \n    clean_product_name:\n      source: \"product\"\n      transformation: \"str.strip().upper().replace('_', ' ')\"\n      validation:\n        min_length: 1\n        max_length: 50\n\n  filters:\n    - condition: \"revenue > 0\"\n      action: keep\n    - condition: \"product != 'DELETED'\"\n      action: keep\n```\n\n## Command Line Usage\n\n### Process Data\n```bash\n# Basic processing\ndatatidy process config.yaml\n\n# With input/output files\ndatatidy process config.yaml -i input.csv -o output.csv\n\n# Ignore validation errors\ndatatidy process config.yaml --ignore-errors\n```\n\n### Validate Configuration\n```bash\ndatatidy validate config.yaml\n```\n\n### Create Sample Configuration\n```bash\ndatatidy sample my_config.yaml\n```\n\n## Expression System\n\nDataTidy includes a safe expression parser that supports:\n\n### Basic Operations\n- Arithmetic: `+`, `-`, `*`, `/`, `//`, `%`, `**`\n- Comparison: `==`, `!=`, `<`, `<=`, `>`, `>=`\n- Logical: `and`, `or`, `not`\n- Membership: `in`, `not in`\n\n### Functions\n- Type conversion: `str()`, `int()`, `float()`, `bool()`\n- Math: `abs()`, `max()`, `min()`, `round()`\n- String methods: `upper()`, `lower()`, `strip()`, `replace()`, etc.\n\n### Examples\n```yaml\ntransformations:\n  # Conditional expressions\n  status: \"'active' if last_login_days < 30 else 'inactive'\"\n  \n  # String operations\n  clean_name: \"name.strip().title()\"\n  \n  # Mathematical calculations\n  bmi: \"weight / (height / 100) ** 2\"\n  \n  # Complex conditions\n  risk_level: |\n    'high' if (age > 65 and income < 30000) else (\n      'medium' if age > 40 else 'low'\n    )\n```\n\n## Validation Rules\n\nDataTidy supports comprehensive validation:\n\n```yaml\nvalidation:\n  required: true              # Field must not be null\n  nullable: false             # Field cannot be null\n  min_value: 0               # Minimum numeric value\n  max_value: 100             # Maximum numeric value\n  min_length: 2              # Minimum string length\n  max_length: 50             # Maximum string length\n  pattern: \"^[A-Za-z]+$\"     # Regex pattern\n  allowed_values: [\"A\", \"B\"] # Whitelist of values\n```\n\n## Error Handling\n\n```python\ndt = DataTidy('config.yaml')\nresult = dt.process_data('input.csv')\n\n# Check for errors\nif dt.has_errors():\n    for error in dt.get_errors():\n        print(f\"Error: {error['message']}\")\n```\n\n## API Reference\n\n### DataTidy Class\n\n#### Methods\n- `load_config(config)`: Load configuration from file or dict\n- `process_data(data=None)`: Process data according to configuration\n- `process_and_save(output_path, data=None)`: Process and save data\n- `get_errors()`: Get list of processing errors\n- `has_errors()`: Check if errors occurred\n\n### Configuration Schema\n\nSee [Configuration Reference](docs/configuration.md) for complete schema documentation.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests\n5. Submit a pull request\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Changelog\n\n### Version 0.1.0\n- Initial release\n- Basic CSV, Excel, and database support\n- Safe expression engine\n- Comprehensive validation system\n- CLI interface\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A powerful, configuration-driven data processing and cleaning package",
    "version": "1.0.1",
    "project_urls": {
        "Bug Reports": "https://github.com/wwd1015/datatidy/issues",
        "Changelog": "https://github.com/wwd1015/datatidy/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/wwd1015/datatidy#readme",
        "Homepage": "https://github.com/wwd1015/datatidy",
        "Repository": "https://github.com/wwd1015/datatidy"
    },
    "split_keywords": [
        "data processing",
        " data cleaning",
        " etl",
        " configuration-driven",
        " pandas",
        " data transformation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "70a9669fc57207757b87f497703a59bdbdfdb7e6b801329cdba2ecd64faba26d",
                "md5": "07082f9eb41acefd95c9453e5b3f1178",
                "sha256": "36061ebb8269a842ec3195364dbd31044a3d79e97d6c48787d30e72d03fe95f9"
            },
            "downloads": -1,
            "filename": "datatidy-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "07082f9eb41acefd95c9453e5b3f1178",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 28587,
            "upload_time": "2025-07-29T01:42:48",
            "upload_time_iso_8601": "2025-07-29T01:42:48.255610Z",
            "url": "https://files.pythonhosted.org/packages/70/a9/669fc57207757b87f497703a59bdbdfdb7e6b801329cdba2ecd64faba26d/datatidy-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ad6f831d3cc847cb48d4bfd3c80a40b13bce50fed11a7a81a87bbc7595f64463",
                "md5": "4b6cfc2d0530264ecc46ce1abdaf088b",
                "sha256": "f1a4580ca54fca8b0250eb996aca5b38e3d81dade035f1fb16577a6bf35a88e0"
            },
            "downloads": -1,
            "filename": "datatidy-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "4b6cfc2d0530264ecc46ce1abdaf088b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 74249,
            "upload_time": "2025-07-29T01:42:49",
            "upload_time_iso_8601": "2025-07-29T01:42:49.567841Z",
            "url": "https://files.pythonhosted.org/packages/ad/6f/831d3cc847cb48d4bfd3c80a40b13bce50fed11a7a81a87bbc7595f64463/datatidy-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-29 01:42:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wwd1015",
    "github_project": "datatidy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.5.0"
                ]
            ]
        },
        {
            "name": "PyYAML",
            "specs": [
                [
                    ">=",
                    "6.0"
                ]
            ]
        },
        {
            "name": "sqlalchemy",
            "specs": [
                [
                    ">=",
                    "1.4.0"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "snowflake-sqlalchemy",
            "specs": [
                [
                    ">=",
                    "1.4.0"
                ]
            ]
        },
        {
            "name": "psycopg2-binary",
            "specs": [
                [
                    ">=",
                    "2.9.0"
                ]
            ]
        },
        {
            "name": "pymysql",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "jsonschema",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        }
    ],
    "lcname": "datatidy"
}
        
Elapsed time: 1.84074s