forklift-etl


Nameforklift-etl JSON
Version 0.1.4 PyPI version JSON
download
home_pageNone
SummaryA powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support
upload_time2025-10-19 14:38:11
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License Copyright (c) 2025 cornyhorse Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords csv data-processing etl excel parquet pyarrow s3 schema-generation validation
VCS
bugtrack_url
requirements pyarrow charset-normalizer jsonschema openpyxl xlrd sqlalchemy chardet boto3 smart-open python-dateutil pytz pytest pytest-cov ruff mypy build twine mattstash python-dotenv
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Forklift

A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support.

![Forklift Logo](FORKLIFT.png)

## Overview

Forklift is a comprehensive data processing tool that provides:

- **High-performance data import** with PyArrow streaming for CSV, Excel, FWF, and SQL sources
- **Intelligent schema generation** that analyzes your data and creates standardized schema definitions  
- **Robust validation** with configurable error handling and constraint validation
- **S3 streaming support** for both input and output operations
- **Multiple output formats** including Parquet, with comprehensive metadata and manifests

## Key Features

### 🚀 **Data Import & Processing**
- Stream large files efficiently with PyArrow
- Support for CSV, Excel, Fixed-Width Files (FWF), and SQL sources
- Configurable batch processing with memory optimization
- Comprehensive validation with detailed error reporting
- S3 integration for cloud-native workflows

### 🔍 **Schema Generation**
- **Intelligent schema inference** from data analysis
- **Privacy-first approach** - no sensitive sample data included by default
- **Multiple file format support** - CSV, Excel, Parquet
- **Flexible output options** - stdout, file, or clipboard
- **Standards-compliant schemas** following JSON Schema with Forklift extensions

### 🛡️ **Validation & Quality**
- JSON Schema validation with custom extensions
- Primary key inference and enforcement
- Constraint validation (unique, not-null, primary key)
- Data type validation and conversion
- Configurable error handling modes (fail-fast, fail-complete, bad-rows)

## Installation

```bash
pip install forklift
```

### Optional Dependencies

```bash
# For Excel support
pip install openpyxl

# For clipboard functionality
pip install pyperclip
```

## Quick Start

### Data Import

```python
import forklift

# Import CSV to Parquet with validation
from forklift import import_csv

results = import_csv(
    source="data.csv",
    destination="./output/",
    schema_path="schema.json"
)

print(f"Import completed successfully!")
```

### Schema Generation

```python
import forklift

# Generate schema from CSV (analyzes entire file by default)
schema = forklift.generate_schema_from_csv("data.csv")

# Generate with limited row analysis
schema = forklift.generate_schema_from_csv("data.csv", nrows=1000)

# Save schema to file
forklift.generate_and_save_schema(
    input_path="data.csv",
    output_path="schema.json",
    file_type="csv"
)

# Generate with primary key inference
schema = forklift.generate_schema_from_csv(
    "data.csv", 
    infer_primary_key_from_metadata=True
)
```

### Reading Data for Analysis

```python
import forklift

# Read CSV into DataFrame for analysis
df = forklift.read_csv("data.csv")

# Read Excel with specific sheet
df = forklift.read_excel("data.xlsx", sheet_name="Sheet1")

# Read Fixed-Width File with schema
df = forklift.read_fwf("data.txt", schema_path="fwf_schema.json")
```

## CLI Usage

### Data Import

```bash
# Import CSV with schema validation
forklift ingest data.csv --dest ./output/ --input-kind csv --schema schema.json

# Import from S3
forklift ingest s3://bucket/data.csv --dest s3://bucket/output/ --input-kind csv

# Import Excel file
forklift ingest data.xlsx --dest ./output/ --input-kind excel --sheet "Sheet1"

# Import Fixed-Width File
forklift ingest data.txt --dest ./output/ --input-kind fwf --fwf-spec schema.json
```

### Schema Generation

```bash
# Generate schema from CSV (analyzes entire file by default)
forklift generate-schema data.csv --file-type csv

# Generate with limited row analysis
forklift generate-schema data.csv --file-type csv --nrows 1000

# Save to file
forklift generate-schema data.csv --file-type csv --output file --output-path schema.json

# Include sample data for development (explicit opt-in)
forklift generate-schema data.csv --file-type csv --include-sample

# Copy to clipboard
forklift generate-schema data.csv --file-type csv --output clipboard

# Excel files
forklift generate-schema data.xlsx --file-type excel --sheet "Sheet1"

# Parquet files
forklift generate-schema data.parquet --file-type parquet

# With primary key inference
forklift generate-schema data.csv --file-type csv --infer-primary-key
```

## Core Components

- **Import Engine**: High-performance data processing with PyArrow
- **Schema Generator**: Intelligent schema inference and generation
- **Validation System**: Constraint validation and error handling
- **Processors**: Pluggable data transformation components
- **I/O Operations**: S3 and local file system support

## Documentation

For detailed documentation, see the [`docs/`](docs/) directory:

- **[Usage Guide](docs/USAGE.md)** - Comprehensive usage examples and workflows
- **[Schema Standards](docs/SCHEMA_STANDARDS.md)** - JSON Schema format and extensions
- **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation
- **[Constraint Validation](docs/CONSTRAINT_VALIDATION_IMPLEMENTATION.md)** - Validation features
- **[S3 Integration](docs/S3_TESTING.md)** - S3 usage and testing

## Examples

See the [`examples/`](examples/) directory for comprehensive examples:

- **[getting_started.py](examples/getting_started.py)** - **Start here!** Complete introduction to CSV processing with schema validation, including basic usage, complete schema validation, and passthrough mode for processing subsets of columns
- **calculated_columns_demo.py** - Calculated columns functionality
- **constraint_validation_demo.py** - Constraint validation examples
- **validation_demo.py** - Data validation with bad rows handling
- **datetime_features_example.py** - Date/time processing examples
- And more...

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "forklift-etl",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Matt <matt@mattkingsbury.com>",
    "keywords": "csv, data-processing, etl, excel, parquet, pyarrow, s3, schema-generation, validation",
    "author": null,
    "author_email": "Matt <matt@mattkingsbury.com>",
    "download_url": "https://files.pythonhosted.org/packages/25/f3/71fb75f8b128ee8bc0d47b9f1b4a4c99fc23926fa10a8184e5777f08a13e/forklift_etl-0.1.4.tar.gz",
    "platform": null,
    "description": "# Forklift\n\nA powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support.\n\n![Forklift Logo](FORKLIFT.png)\n\n## Overview\n\nForklift is a comprehensive data processing tool that provides:\n\n- **High-performance data import** with PyArrow streaming for CSV, Excel, FWF, and SQL sources\n- **Intelligent schema generation** that analyzes your data and creates standardized schema definitions  \n- **Robust validation** with configurable error handling and constraint validation\n- **S3 streaming support** for both input and output operations\n- **Multiple output formats** including Parquet, with comprehensive metadata and manifests\n\n## Key Features\n\n### \ud83d\ude80 **Data Import & Processing**\n- Stream large files efficiently with PyArrow\n- Support for CSV, Excel, Fixed-Width Files (FWF), and SQL sources\n- Configurable batch processing with memory optimization\n- Comprehensive validation with detailed error reporting\n- S3 integration for cloud-native workflows\n\n### \ud83d\udd0d **Schema Generation**\n- **Intelligent schema inference** from data analysis\n- **Privacy-first approach** - no sensitive sample data included by default\n- **Multiple file format support** - CSV, Excel, Parquet\n- **Flexible output options** - stdout, file, or clipboard\n- **Standards-compliant schemas** following JSON Schema with Forklift extensions\n\n### \ud83d\udee1\ufe0f **Validation & Quality**\n- JSON Schema validation with custom extensions\n- Primary key inference and enforcement\n- Constraint validation (unique, not-null, primary key)\n- Data type validation and conversion\n- Configurable error handling modes (fail-fast, fail-complete, bad-rows)\n\n## Installation\n\n```bash\npip install forklift\n```\n\n### Optional Dependencies\n\n```bash\n# For Excel support\npip install openpyxl\n\n# For clipboard functionality\npip install pyperclip\n```\n\n## Quick Start\n\n### Data Import\n\n```python\nimport forklift\n\n# Import CSV to Parquet with validation\nfrom forklift import import_csv\n\nresults = import_csv(\n    source=\"data.csv\",\n    destination=\"./output/\",\n    schema_path=\"schema.json\"\n)\n\nprint(f\"Import completed successfully!\")\n```\n\n### Schema Generation\n\n```python\nimport forklift\n\n# Generate schema from CSV (analyzes entire file by default)\nschema = forklift.generate_schema_from_csv(\"data.csv\")\n\n# Generate with limited row analysis\nschema = forklift.generate_schema_from_csv(\"data.csv\", nrows=1000)\n\n# Save schema to file\nforklift.generate_and_save_schema(\n    input_path=\"data.csv\",\n    output_path=\"schema.json\",\n    file_type=\"csv\"\n)\n\n# Generate with primary key inference\nschema = forklift.generate_schema_from_csv(\n    \"data.csv\", \n    infer_primary_key_from_metadata=True\n)\n```\n\n### Reading Data for Analysis\n\n```python\nimport forklift\n\n# Read CSV into DataFrame for analysis\ndf = forklift.read_csv(\"data.csv\")\n\n# Read Excel with specific sheet\ndf = forklift.read_excel(\"data.xlsx\", sheet_name=\"Sheet1\")\n\n# Read Fixed-Width File with schema\ndf = forklift.read_fwf(\"data.txt\", schema_path=\"fwf_schema.json\")\n```\n\n## CLI Usage\n\n### Data Import\n\n```bash\n# Import CSV with schema validation\nforklift ingest data.csv --dest ./output/ --input-kind csv --schema schema.json\n\n# Import from S3\nforklift ingest s3://bucket/data.csv --dest s3://bucket/output/ --input-kind csv\n\n# Import Excel file\nforklift ingest data.xlsx --dest ./output/ --input-kind excel --sheet \"Sheet1\"\n\n# Import Fixed-Width File\nforklift ingest data.txt --dest ./output/ --input-kind fwf --fwf-spec schema.json\n```\n\n### Schema Generation\n\n```bash\n# Generate schema from CSV (analyzes entire file by default)\nforklift generate-schema data.csv --file-type csv\n\n# Generate with limited row analysis\nforklift generate-schema data.csv --file-type csv --nrows 1000\n\n# Save to file\nforklift generate-schema data.csv --file-type csv --output file --output-path schema.json\n\n# Include sample data for development (explicit opt-in)\nforklift generate-schema data.csv --file-type csv --include-sample\n\n# Copy to clipboard\nforklift generate-schema data.csv --file-type csv --output clipboard\n\n# Excel files\nforklift generate-schema data.xlsx --file-type excel --sheet \"Sheet1\"\n\n# Parquet files\nforklift generate-schema data.parquet --file-type parquet\n\n# With primary key inference\nforklift generate-schema data.csv --file-type csv --infer-primary-key\n```\n\n## Core Components\n\n- **Import Engine**: High-performance data processing with PyArrow\n- **Schema Generator**: Intelligent schema inference and generation\n- **Validation System**: Constraint validation and error handling\n- **Processors**: Pluggable data transformation components\n- **I/O Operations**: S3 and local file system support\n\n## Documentation\n\nFor detailed documentation, see the [`docs/`](docs/) directory:\n\n- **[Usage Guide](docs/USAGE.md)** - Comprehensive usage examples and workflows\n- **[Schema Standards](docs/SCHEMA_STANDARDS.md)** - JSON Schema format and extensions\n- **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation\n- **[Constraint Validation](docs/CONSTRAINT_VALIDATION_IMPLEMENTATION.md)** - Validation features\n- **[S3 Integration](docs/S3_TESTING.md)** - S3 usage and testing\n\n## Examples\n\nSee the [`examples/`](examples/) directory for comprehensive examples:\n\n- **[getting_started.py](examples/getting_started.py)** - **Start here!** Complete introduction to CSV processing with schema validation, including basic usage, complete schema validation, and passthrough mode for processing subsets of columns\n- **calculated_columns_demo.py** - Calculated columns functionality\n- **constraint_validation_demo.py** - Constraint validation examples\n- **validation_demo.py** - Data validation with bad rows handling\n- **datetime_features_example.py** - Date/time processing examples\n- And more...\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests for new functionality\n5. Run the test suite\n6. Submit a pull request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 cornyhorse\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.",
    "summary": "A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support",
    "version": "0.1.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/cornyhorse/forklift/issues",
        "Changelog": "https://github.com/cornyhorse/forklift/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/cornyhorse/forklift/blob/main/docs/",
        "Homepage": "https://github.com/cornyhorse/forklift",
        "Repository": "https://github.com/cornyhorse/forklift"
    },
    "split_keywords": [
        "csv",
        " data-processing",
        " etl",
        " excel",
        " parquet",
        " pyarrow",
        " s3",
        " schema-generation",
        " validation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fb660b90df3d9ceb6a7e1923e48dfcdb0eb059b2b316e6801ef786cd3fbeca69",
                "md5": "cf11fd7376aa41408160549ab4f837e1",
                "sha256": "837e9440b773b80272856f4ac08bfcf7124af51c0ee3f6347291cf1b28432763"
            },
            "downloads": -1,
            "filename": "forklift_etl-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cf11fd7376aa41408160549ab4f837e1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 313798,
            "upload_time": "2025-10-19T14:38:10",
            "upload_time_iso_8601": "2025-10-19T14:38:10.101638Z",
            "url": "https://files.pythonhosted.org/packages/fb/66/0b90df3d9ceb6a7e1923e48dfcdb0eb059b2b316e6801ef786cd3fbeca69/forklift_etl-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "25f371fb75f8b128ee8bc0d47b9f1b4a4c99fc23926fa10a8184e5777f08a13e",
                "md5": "ddcdbed4f6e30180946780b2d3599b70",
                "sha256": "528fd58277dbfff59a2388198390f7de56f3739a16835e7bd48bb511fa3bcd33"
            },
            "downloads": -1,
            "filename": "forklift_etl-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "ddcdbed4f6e30180946780b2d3599b70",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 301044,
            "upload_time": "2025-10-19T14:38:11",
            "upload_time_iso_8601": "2025-10-19T14:38:11.418960Z",
            "url": "https://files.pythonhosted.org/packages/25/f3/71fb75f8b128ee8bc0d47b9f1b4a4c99fc23926fa10a8184e5777f08a13e/forklift_etl-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-19 14:38:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cornyhorse",
    "github_project": "forklift",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pyarrow",
            "specs": [
                [
                    ">=",
                    "15"
                ],
                [
                    "<",
                    "18"
                ]
            ]
        },
        {
            "name": "charset-normalizer",
            "specs": [
                [
                    ">=",
                    "3.3"
                ]
            ]
        },
        {
            "name": "jsonschema",
            "specs": [
                [
                    ">=",
                    "4.22"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    ">=",
                    "3.1"
                ]
            ]
        },
        {
            "name": "xlrd",
            "specs": [
                [
                    ">=",
                    "2.0"
                ]
            ]
        },
        {
            "name": "sqlalchemy",
            "specs": [
                [
                    "==",
                    "2.0.43"
                ]
            ]
        },
        {
            "name": "chardet",
            "specs": []
        },
        {
            "name": "boto3",
            "specs": [
                [
                    ">=",
                    "1.34"
                ]
            ]
        },
        {
            "name": "smart-open",
            "specs": [
                [
                    ">=",
                    "7.0"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    ">=",
                    "2.9"
                ]
            ]
        },
        {
            "name": "pytz",
            "specs": [
                [
                    ">=",
                    "2024.1"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "8.0"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    ">=",
                    "5.0"
                ]
            ]
        },
        {
            "name": "ruff",
            "specs": [
                [
                    ">=",
                    "0.6"
                ]
            ]
        },
        {
            "name": "mypy",
            "specs": [
                [
                    ">=",
                    "1.10"
                ]
            ]
        },
        {
            "name": "build",
            "specs": [
                [
                    ">=",
                    "1.2"
                ]
            ]
        },
        {
            "name": "twine",
            "specs": [
                [
                    ">=",
                    "5.1"
                ]
            ]
        },
        {
            "name": "mattstash",
            "specs": []
        },
        {
            "name": "python-dotenv",
            "specs": []
        }
    ],
    "lcname": "forklift-etl"
}
        
Elapsed time: 2.31852s