<div align="center">
<img src="assets/datatidy-logo-pypi.png" alt="DataTidy Logo" width="300">
<h3>Configuration-Driven Data Processing Made Simple</h3>
[](https://pypi.org/project/datatidy/)
[](https://pypi.org/project/datatidy/)
[](https://opensource.org/licenses/MIT)
[](https://pepy.tech/project/datatidy)
</div>
# DataTidy
A powerful, configuration-driven data processing and cleaning package for Python. DataTidy allows you to define complex data transformations, validations, and cleanings through simple YAML configuration files.
## 🚀 Key Features
- **🔧 Configuration-Driven**: Define all transformations in YAML - no code required
- **📊 Multiple Data Sources**: CSV, Excel, databases (PostgreSQL, MySQL, Snowflake, etc.)
- **🔗 Multi-Input Joins**: Combine data from multiple sources with flexible join operations
- **⚡ Advanced Operations**: Map/reduce/filter with lambda functions and chained operations
- **🧠 Dependency Resolution**: Automatic execution order planning for complex transformations
- **📈 Time Series Support**: Lag operations and rolling window calculations
- **🛡️ Safe Expressions**: Secure evaluation with whitelist-based security
- **🎯 Data Validation**: Comprehensive validation rules with detailed error reporting
- **⚙️ CLI Interface**: Easy-to-use command-line tools for batch processing
## Installation
```bash
pip install datatidy
```
For development installation:
```bash
git clone https://github.com/your-repo/datatidy.git
cd datatidy
pip install -e ".[dev]"
```
## Quick Start
### 1. Create a sample configuration
```bash
datatidy sample config.yaml
```
### 2. Process your data
```bash
datatidy process config.yaml -i input.csv -o output.csv
```
### 3. Or use programmatically
```python
from datatidy import DataTidy
# Initialize with configuration
dt = DataTidy('config.yaml')
# Process data
result = dt.process_data('input.csv')
# Save result
dt.process_and_save('output.csv', 'input.csv')
```
## Configuration Structure
DataTidy uses YAML configuration files to define data processing pipelines:
```yaml
input:
type: csv # csv, excel, database
source: "data/input.csv" # file path or SQL query
options:
encoding: utf-8
delimiter: ","
output:
columns:
user_id:
source: "id" # Source column name
type: int # Data type conversion
validation:
required: true
min_value: 1
full_name:
source: "name"
type: string
transformation: "str.title()" # Python expression
validation:
required: true
min_length: 2
max_length: 100
age_group:
transformation: "'adult' if age >= 18 else 'minor'"
type: string
validation:
allowed_values: ["adult", "minor"]
filters:
- condition: "age >= 0"
action: keep
sort:
- column: user_id
ascending: true
global_settings:
ignore_errors: false
max_errors: 100
```
## Examples
### Basic CSV Processing
```python
from datatidy import DataTidy
config = {
"input": {
"type": "csv",
"source": "users.csv"
},
"output": {
"columns": {
"clean_name": {
"source": "name",
"transformation": "str.strip().title()",
"type": "string"
},
"age_category": {
"transformation": "'senior' if age > 65 else ('adult' if age >= 18 else 'minor')",
"type": "string"
}
}
}
}
dt = DataTidy()
dt.load_config(config)
result = dt.process_data()
print(result)
```
### Database Processing
```yaml
input:
type: database
source:
query: "SELECT * FROM users WHERE active = true"
connection_string: "postgresql://user:pass@localhost/db"
output:
columns:
user_email:
source: "email"
type: string
validation:
pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
signup_date:
source: "created_at"
type: datetime
format: "%Y-%m-%d"
```
### Excel Processing with Complex Transformations
```yaml
input:
type: excel
source:
path: "sales_data.xlsx"
sheet_name: "Q1_Sales"
options:
header: 0
skiprows: 2
output:
columns:
revenue_category:
transformation: |
'high' if revenue > 100000 else (
'medium' if revenue > 50000 else 'low'
)
validation:
allowed_values: ["high", "medium", "low"]
formatted_date:
source: "sale_date"
type: datetime
format: "%Y-%m-%d"
clean_product_name:
source: "product"
transformation: "str.strip().upper().replace('_', ' ')"
validation:
min_length: 1
max_length: 50
filters:
- condition: "revenue > 0"
action: keep
- condition: "product != 'DELETED'"
action: keep
```
## Command Line Usage
### Process Data
```bash
# Basic processing
datatidy process config.yaml
# With input/output files
datatidy process config.yaml -i input.csv -o output.csv
# Ignore validation errors
datatidy process config.yaml --ignore-errors
```
### Validate Configuration
```bash
datatidy validate config.yaml
```
### Create Sample Configuration
```bash
datatidy sample my_config.yaml
```
## Expression System
DataTidy includes a safe expression parser that supports:
### Basic Operations
- Arithmetic: `+`, `-`, `*`, `/`, `//`, `%`, `**`
- Comparison: `==`, `!=`, `<`, `<=`, `>`, `>=`
- Logical: `and`, `or`, `not`
- Membership: `in`, `not in`
### Functions
- Type conversion: `str()`, `int()`, `float()`, `bool()`
- Math: `abs()`, `max()`, `min()`, `round()`
- String methods: `upper()`, `lower()`, `strip()`, `replace()`, etc.
### Examples
```yaml
transformations:
# Conditional expressions
status: "'active' if last_login_days < 30 else 'inactive'"
# String operations
clean_name: "name.strip().title()"
# Mathematical calculations
bmi: "weight / (height / 100) ** 2"
# Complex conditions
risk_level: |
'high' if (age > 65 and income < 30000) else (
'medium' if age > 40 else 'low'
)
```
## Validation Rules
DataTidy supports comprehensive validation:
```yaml
validation:
required: true # Field must not be null
nullable: false # Field cannot be null
min_value: 0 # Minimum numeric value
max_value: 100 # Maximum numeric value
min_length: 2 # Minimum string length
max_length: 50 # Maximum string length
pattern: "^[A-Za-z]+$" # Regex pattern
allowed_values: ["A", "B"] # Whitelist of values
```
## Error Handling
```python
dt = DataTidy('config.yaml')
result = dt.process_data('input.csv')
# Check for errors
if dt.has_errors():
for error in dt.get_errors():
print(f"Error: {error['message']}")
```
## API Reference
### DataTidy Class
#### Methods
- `load_config(config)`: Load configuration from file or dict
- `process_data(data=None)`: Process data according to configuration
- `process_and_save(output_path, data=None)`: Process and save data
- `get_errors()`: Get list of processing errors
- `has_errors()`: Check if errors occurred
### Configuration Schema
See [Configuration Reference](docs/configuration.md) for complete schema documentation.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request
## License
MIT License - see LICENSE file for details.
## Changelog
### Version 0.1.0
- Initial release
- Basic CSV, Excel, and database support
- Safe expression engine
- Comprehensive validation system
- CLI interface
Raw data
{
"_id": null,
"home_page": "https://github.com/wwd1015/datatidy",
"name": "datatidy",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "DataTidy Development Team <maintainer@example.com>",
"keywords": "data processing, data cleaning, ETL, configuration-driven, pandas, data transformation",
"author": "DataTidy Development Team",
"author_email": "DataTidy Development Team <maintainer@example.com>",
"download_url": "https://files.pythonhosted.org/packages/ad/6f/831d3cc847cb48d4bfd3c80a40b13bce50fed11a7a81a87bbc7595f64463/datatidy-1.0.1.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <img src=\"assets/datatidy-logo-pypi.png\" alt=\"DataTidy Logo\" width=\"300\">\n \n <h3>Configuration-Driven Data Processing Made Simple</h3>\n \n [](https://pypi.org/project/datatidy/)\n [](https://pypi.org/project/datatidy/)\n [](https://opensource.org/licenses/MIT)\n [](https://pepy.tech/project/datatidy)\n</div>\n\n# DataTidy\n\nA powerful, configuration-driven data processing and cleaning package for Python. DataTidy allows you to define complex data transformations, validations, and cleanings through simple YAML configuration files.\n\n## \ud83d\ude80 Key Features\n\n- **\ud83d\udd27 Configuration-Driven**: Define all transformations in YAML - no code required\n- **\ud83d\udcca Multiple Data Sources**: CSV, Excel, databases (PostgreSQL, MySQL, Snowflake, etc.)\n- **\ud83d\udd17 Multi-Input Joins**: Combine data from multiple sources with flexible join operations\n- **\u26a1 Advanced Operations**: Map/reduce/filter with lambda functions and chained operations\n- **\ud83e\udde0 Dependency Resolution**: Automatic execution order planning for complex transformations\n- **\ud83d\udcc8 Time Series Support**: Lag operations and rolling window calculations\n- **\ud83d\udee1\ufe0f Safe Expressions**: Secure evaluation with whitelist-based security\n- **\ud83c\udfaf Data Validation**: Comprehensive validation rules with detailed error reporting\n- **\u2699\ufe0f CLI Interface**: Easy-to-use command-line tools for batch processing\n\n\n## Installation\n\n```bash\npip install datatidy\n```\n\nFor development installation:\n```bash\ngit clone https://github.com/your-repo/datatidy.git\ncd datatidy\npip install -e \".[dev]\"\n```\n\n## Quick Start\n\n### 1. Create a sample configuration\n\n```bash\ndatatidy sample config.yaml\n```\n\n### 2. Process your data\n\n```bash\ndatatidy process config.yaml -i input.csv -o output.csv\n```\n\n### 3. Or use programmatically\n\n```python\nfrom datatidy import DataTidy\n\n# Initialize with configuration\ndt = DataTidy('config.yaml')\n\n# Process data\nresult = dt.process_data('input.csv')\n\n# Save result\ndt.process_and_save('output.csv', 'input.csv')\n```\n\n## Configuration Structure\n\nDataTidy uses YAML configuration files to define data processing pipelines:\n\n```yaml\ninput:\n type: csv # csv, excel, database\n source: \"data/input.csv\" # file path or SQL query\n options:\n encoding: utf-8\n delimiter: \",\"\n\noutput:\n columns:\n user_id:\n source: \"id\" # Source column name\n type: int # Data type conversion\n validation:\n required: true\n min_value: 1\n \n full_name:\n source: \"name\"\n type: string\n transformation: \"str.title()\" # Python expression\n validation:\n required: true\n min_length: 2\n max_length: 100\n \n age_group:\n transformation: \"'adult' if age >= 18 else 'minor'\"\n type: string\n validation:\n allowed_values: [\"adult\", \"minor\"]\n\n filters:\n - condition: \"age >= 0\"\n action: keep\n\n sort:\n - column: user_id\n ascending: true\n\nglobal_settings:\n ignore_errors: false\n max_errors: 100\n```\n\n## Examples\n\n### Basic CSV Processing\n\n```python\nfrom datatidy import DataTidy\n\nconfig = {\n \"input\": {\n \"type\": \"csv\",\n \"source\": \"users.csv\"\n },\n \"output\": {\n \"columns\": {\n \"clean_name\": {\n \"source\": \"name\",\n \"transformation\": \"str.strip().title()\",\n \"type\": \"string\"\n },\n \"age_category\": {\n \"transformation\": \"'senior' if age > 65 else ('adult' if age >= 18 else 'minor')\",\n \"type\": \"string\"\n }\n }\n }\n}\n\ndt = DataTidy()\ndt.load_config(config)\nresult = dt.process_data()\nprint(result)\n```\n\n### Database Processing\n\n```yaml\ninput:\n type: database\n source: \n query: \"SELECT * FROM users WHERE active = true\"\n connection_string: \"postgresql://user:pass@localhost/db\"\n\noutput:\n columns:\n user_email:\n source: \"email\"\n type: string\n validation:\n pattern: \"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\\\.[a-zA-Z]{2,}$\"\n \n signup_date:\n source: \"created_at\"\n type: datetime\n format: \"%Y-%m-%d\"\n```\n\n### Excel Processing with Complex Transformations\n\n```yaml\ninput:\n type: excel\n source:\n path: \"sales_data.xlsx\"\n sheet_name: \"Q1_Sales\"\n options:\n header: 0\n skiprows: 2\n\noutput:\n columns:\n revenue_category:\n transformation: |\n 'high' if revenue > 100000 else (\n 'medium' if revenue > 50000 else 'low'\n )\n validation:\n allowed_values: [\"high\", \"medium\", \"low\"]\n \n formatted_date:\n source: \"sale_date\"\n type: datetime\n format: \"%Y-%m-%d\"\n \n clean_product_name:\n source: \"product\"\n transformation: \"str.strip().upper().replace('_', ' ')\"\n validation:\n min_length: 1\n max_length: 50\n\n filters:\n - condition: \"revenue > 0\"\n action: keep\n - condition: \"product != 'DELETED'\"\n action: keep\n```\n\n## Command Line Usage\n\n### Process Data\n```bash\n# Basic processing\ndatatidy process config.yaml\n\n# With input/output files\ndatatidy process config.yaml -i input.csv -o output.csv\n\n# Ignore validation errors\ndatatidy process config.yaml --ignore-errors\n```\n\n### Validate Configuration\n```bash\ndatatidy validate config.yaml\n```\n\n### Create Sample Configuration\n```bash\ndatatidy sample my_config.yaml\n```\n\n## Expression System\n\nDataTidy includes a safe expression parser that supports:\n\n### Basic Operations\n- Arithmetic: `+`, `-`, `*`, `/`, `//`, `%`, `**`\n- Comparison: `==`, `!=`, `<`, `<=`, `>`, `>=`\n- Logical: `and`, `or`, `not`\n- Membership: `in`, `not in`\n\n### Functions\n- Type conversion: `str()`, `int()`, `float()`, `bool()`\n- Math: `abs()`, `max()`, `min()`, `round()`\n- String methods: `upper()`, `lower()`, `strip()`, `replace()`, etc.\n\n### Examples\n```yaml\ntransformations:\n # Conditional expressions\n status: \"'active' if last_login_days < 30 else 'inactive'\"\n \n # String operations\n clean_name: \"name.strip().title()\"\n \n # Mathematical calculations\n bmi: \"weight / (height / 100) ** 2\"\n \n # Complex conditions\n risk_level: |\n 'high' if (age > 65 and income < 30000) else (\n 'medium' if age > 40 else 'low'\n )\n```\n\n## Validation Rules\n\nDataTidy supports comprehensive validation:\n\n```yaml\nvalidation:\n required: true # Field must not be null\n nullable: false # Field cannot be null\n min_value: 0 # Minimum numeric value\n max_value: 100 # Maximum numeric value\n min_length: 2 # Minimum string length\n max_length: 50 # Maximum string length\n pattern: \"^[A-Za-z]+$\" # Regex pattern\n allowed_values: [\"A\", \"B\"] # Whitelist of values\n```\n\n## Error Handling\n\n```python\ndt = DataTidy('config.yaml')\nresult = dt.process_data('input.csv')\n\n# Check for errors\nif dt.has_errors():\n for error in dt.get_errors():\n print(f\"Error: {error['message']}\")\n```\n\n## API Reference\n\n### DataTidy Class\n\n#### Methods\n- `load_config(config)`: Load configuration from file or dict\n- `process_data(data=None)`: Process data according to configuration\n- `process_and_save(output_path, data=None)`: Process and save data\n- `get_errors()`: Get list of processing errors\n- `has_errors()`: Check if errors occurred\n\n### Configuration Schema\n\nSee [Configuration Reference](docs/configuration.md) for complete schema documentation.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests\n5. Submit a pull request\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Changelog\n\n### Version 0.1.0\n- Initial release\n- Basic CSV, Excel, and database support\n- Safe expression engine\n- Comprehensive validation system\n- CLI interface\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A powerful, configuration-driven data processing and cleaning package",
"version": "1.0.1",
"project_urls": {
"Bug Reports": "https://github.com/wwd1015/datatidy/issues",
"Changelog": "https://github.com/wwd1015/datatidy/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/wwd1015/datatidy#readme",
"Homepage": "https://github.com/wwd1015/datatidy",
"Repository": "https://github.com/wwd1015/datatidy"
},
"split_keywords": [
"data processing",
" data cleaning",
" etl",
" configuration-driven",
" pandas",
" data transformation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "70a9669fc57207757b87f497703a59bdbdfdb7e6b801329cdba2ecd64faba26d",
"md5": "07082f9eb41acefd95c9453e5b3f1178",
"sha256": "36061ebb8269a842ec3195364dbd31044a3d79e97d6c48787d30e72d03fe95f9"
},
"downloads": -1,
"filename": "datatidy-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "07082f9eb41acefd95c9453e5b3f1178",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 28587,
"upload_time": "2025-07-29T01:42:48",
"upload_time_iso_8601": "2025-07-29T01:42:48.255610Z",
"url": "https://files.pythonhosted.org/packages/70/a9/669fc57207757b87f497703a59bdbdfdb7e6b801329cdba2ecd64faba26d/datatidy-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ad6f831d3cc847cb48d4bfd3c80a40b13bce50fed11a7a81a87bbc7595f64463",
"md5": "4b6cfc2d0530264ecc46ce1abdaf088b",
"sha256": "f1a4580ca54fca8b0250eb996aca5b38e3d81dade035f1fb16577a6bf35a88e0"
},
"downloads": -1,
"filename": "datatidy-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "4b6cfc2d0530264ecc46ce1abdaf088b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 74249,
"upload_time": "2025-07-29T01:42:49",
"upload_time_iso_8601": "2025-07-29T01:42:49.567841Z",
"url": "https://files.pythonhosted.org/packages/ad/6f/831d3cc847cb48d4bfd3c80a40b13bce50fed11a7a81a87bbc7595f64463/datatidy-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-29 01:42:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wwd1015",
"github_project": "datatidy",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pandas",
"specs": [
[
">=",
"1.5.0"
]
]
},
{
"name": "PyYAML",
"specs": [
[
">=",
"6.0"
]
]
},
{
"name": "sqlalchemy",
"specs": [
[
">=",
"1.4.0"
]
]
},
{
"name": "openpyxl",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "snowflake-sqlalchemy",
"specs": [
[
">=",
"1.4.0"
]
]
},
{
"name": "psycopg2-binary",
"specs": [
[
">=",
"2.9.0"
]
]
},
{
"name": "pymysql",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "jsonschema",
"specs": [
[
">=",
"4.0.0"
]
]
}
],
"lcname": "datatidy"
}