| Name | forklift-etl JSON |
| Version |
0.1.4
JSON |
| download |
| home_page | None |
| Summary | A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support |
| upload_time | 2025-10-19 14:38:11 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.8 |
| license | MIT License
Copyright (c) 2025 cornyhorse
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE. |
| keywords |
csv
data-processing
etl
excel
parquet
pyarrow
s3
schema-generation
validation
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
pyarrow
charset-normalizer
jsonschema
openpyxl
xlrd
sqlalchemy
chardet
boto3
smart-open
python-dateutil
pytz
pytest
pytest-cov
ruff
mypy
build
twine
mattstash
python-dotenv
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# Forklift
A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support.

## Overview
Forklift is a comprehensive data processing tool that provides:
- **High-performance data import** with PyArrow streaming for CSV, Excel, FWF, and SQL sources
- **Intelligent schema generation** that analyzes your data and creates standardized schema definitions
- **Robust validation** with configurable error handling and constraint validation
- **S3 streaming support** for both input and output operations
- **Multiple output formats** including Parquet, with comprehensive metadata and manifests
## Key Features
### 🚀 **Data Import & Processing**
- Stream large files efficiently with PyArrow
- Support for CSV, Excel, Fixed-Width Files (FWF), and SQL sources
- Configurable batch processing with memory optimization
- Comprehensive validation with detailed error reporting
- S3 integration for cloud-native workflows
### 🔍 **Schema Generation**
- **Intelligent schema inference** from data analysis
- **Privacy-first approach** - no sensitive sample data included by default
- **Multiple file format support** - CSV, Excel, Parquet
- **Flexible output options** - stdout, file, or clipboard
- **Standards-compliant schemas** following JSON Schema with Forklift extensions
### 🛡️ **Validation & Quality**
- JSON Schema validation with custom extensions
- Primary key inference and enforcement
- Constraint validation (unique, not-null, primary key)
- Data type validation and conversion
- Configurable error handling modes (fail-fast, fail-complete, bad-rows)
## Installation
```bash
pip install forklift
```
### Optional Dependencies
```bash
# For Excel support
pip install openpyxl
# For clipboard functionality
pip install pyperclip
```
## Quick Start
### Data Import
```python
import forklift
# Import CSV to Parquet with validation
from forklift import import_csv
results = import_csv(
source="data.csv",
destination="./output/",
schema_path="schema.json"
)
print(f"Import completed successfully!")
```
### Schema Generation
```python
import forklift
# Generate schema from CSV (analyzes entire file by default)
schema = forklift.generate_schema_from_csv("data.csv")
# Generate with limited row analysis
schema = forklift.generate_schema_from_csv("data.csv", nrows=1000)
# Save schema to file
forklift.generate_and_save_schema(
input_path="data.csv",
output_path="schema.json",
file_type="csv"
)
# Generate with primary key inference
schema = forklift.generate_schema_from_csv(
"data.csv",
infer_primary_key_from_metadata=True
)
```
### Reading Data for Analysis
```python
import forklift
# Read CSV into DataFrame for analysis
df = forklift.read_csv("data.csv")
# Read Excel with specific sheet
df = forklift.read_excel("data.xlsx", sheet_name="Sheet1")
# Read Fixed-Width File with schema
df = forklift.read_fwf("data.txt", schema_path="fwf_schema.json")
```
## CLI Usage
### Data Import
```bash
# Import CSV with schema validation
forklift ingest data.csv --dest ./output/ --input-kind csv --schema schema.json
# Import from S3
forklift ingest s3://bucket/data.csv --dest s3://bucket/output/ --input-kind csv
# Import Excel file
forklift ingest data.xlsx --dest ./output/ --input-kind excel --sheet "Sheet1"
# Import Fixed-Width File
forklift ingest data.txt --dest ./output/ --input-kind fwf --fwf-spec schema.json
```
### Schema Generation
```bash
# Generate schema from CSV (analyzes entire file by default)
forklift generate-schema data.csv --file-type csv
# Generate with limited row analysis
forklift generate-schema data.csv --file-type csv --nrows 1000
# Save to file
forklift generate-schema data.csv --file-type csv --output file --output-path schema.json
# Include sample data for development (explicit opt-in)
forklift generate-schema data.csv --file-type csv --include-sample
# Copy to clipboard
forklift generate-schema data.csv --file-type csv --output clipboard
# Excel files
forklift generate-schema data.xlsx --file-type excel --sheet "Sheet1"
# Parquet files
forklift generate-schema data.parquet --file-type parquet
# With primary key inference
forklift generate-schema data.csv --file-type csv --infer-primary-key
```
## Core Components
- **Import Engine**: High-performance data processing with PyArrow
- **Schema Generator**: Intelligent schema inference and generation
- **Validation System**: Constraint validation and error handling
- **Processors**: Pluggable data transformation components
- **I/O Operations**: S3 and local file system support
## Documentation
For detailed documentation, see the [`docs/`](docs/) directory:
- **[Usage Guide](docs/USAGE.md)** - Comprehensive usage examples and workflows
- **[Schema Standards](docs/SCHEMA_STANDARDS.md)** - JSON Schema format and extensions
- **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation
- **[Constraint Validation](docs/CONSTRAINT_VALIDATION_IMPLEMENTATION.md)** - Validation features
- **[S3 Integration](docs/S3_TESTING.md)** - S3 usage and testing
## Examples
See the [`examples/`](examples/) directory for comprehensive examples:
- **[getting_started.py](examples/getting_started.py)** - **Start here!** Complete introduction to CSV processing with schema validation, including basic usage, complete schema validation, and passthrough mode for processing subsets of columns
- **calculated_columns_demo.py** - Calculated columns functionality
- **constraint_validation_demo.py** - Constraint validation examples
- **validation_demo.py** - Data validation with bad rows handling
- **datetime_features_example.py** - Date/time processing examples
- And more...
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "forklift-etl",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Matt <matt@mattkingsbury.com>",
"keywords": "csv, data-processing, etl, excel, parquet, pyarrow, s3, schema-generation, validation",
"author": null,
"author_email": "Matt <matt@mattkingsbury.com>",
"download_url": "https://files.pythonhosted.org/packages/25/f3/71fb75f8b128ee8bc0d47b9f1b4a4c99fc23926fa10a8184e5777f08a13e/forklift_etl-0.1.4.tar.gz",
"platform": null,
"description": "# Forklift\n\nA powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support.\n\n\n\n## Overview\n\nForklift is a comprehensive data processing tool that provides:\n\n- **High-performance data import** with PyArrow streaming for CSV, Excel, FWF, and SQL sources\n- **Intelligent schema generation** that analyzes your data and creates standardized schema definitions \n- **Robust validation** with configurable error handling and constraint validation\n- **S3 streaming support** for both input and output operations\n- **Multiple output formats** including Parquet, with comprehensive metadata and manifests\n\n## Key Features\n\n### \ud83d\ude80 **Data Import & Processing**\n- Stream large files efficiently with PyArrow\n- Support for CSV, Excel, Fixed-Width Files (FWF), and SQL sources\n- Configurable batch processing with memory optimization\n- Comprehensive validation with detailed error reporting\n- S3 integration for cloud-native workflows\n\n### \ud83d\udd0d **Schema Generation**\n- **Intelligent schema inference** from data analysis\n- **Privacy-first approach** - no sensitive sample data included by default\n- **Multiple file format support** - CSV, Excel, Parquet\n- **Flexible output options** - stdout, file, or clipboard\n- **Standards-compliant schemas** following JSON Schema with Forklift extensions\n\n### \ud83d\udee1\ufe0f **Validation & Quality**\n- JSON Schema validation with custom extensions\n- Primary key inference and enforcement\n- Constraint validation (unique, not-null, primary key)\n- Data type validation and conversion\n- Configurable error handling modes (fail-fast, fail-complete, bad-rows)\n\n## Installation\n\n```bash\npip install forklift\n```\n\n### Optional Dependencies\n\n```bash\n# For Excel support\npip install openpyxl\n\n# For clipboard functionality\npip install pyperclip\n```\n\n## Quick Start\n\n### Data Import\n\n```python\nimport forklift\n\n# Import CSV to Parquet with validation\nfrom forklift import import_csv\n\nresults = import_csv(\n source=\"data.csv\",\n destination=\"./output/\",\n schema_path=\"schema.json\"\n)\n\nprint(f\"Import completed successfully!\")\n```\n\n### Schema Generation\n\n```python\nimport forklift\n\n# Generate schema from CSV (analyzes entire file by default)\nschema = forklift.generate_schema_from_csv(\"data.csv\")\n\n# Generate with limited row analysis\nschema = forklift.generate_schema_from_csv(\"data.csv\", nrows=1000)\n\n# Save schema to file\nforklift.generate_and_save_schema(\n input_path=\"data.csv\",\n output_path=\"schema.json\",\n file_type=\"csv\"\n)\n\n# Generate with primary key inference\nschema = forklift.generate_schema_from_csv(\n \"data.csv\", \n infer_primary_key_from_metadata=True\n)\n```\n\n### Reading Data for Analysis\n\n```python\nimport forklift\n\n# Read CSV into DataFrame for analysis\ndf = forklift.read_csv(\"data.csv\")\n\n# Read Excel with specific sheet\ndf = forklift.read_excel(\"data.xlsx\", sheet_name=\"Sheet1\")\n\n# Read Fixed-Width File with schema\ndf = forklift.read_fwf(\"data.txt\", schema_path=\"fwf_schema.json\")\n```\n\n## CLI Usage\n\n### Data Import\n\n```bash\n# Import CSV with schema validation\nforklift ingest data.csv --dest ./output/ --input-kind csv --schema schema.json\n\n# Import from S3\nforklift ingest s3://bucket/data.csv --dest s3://bucket/output/ --input-kind csv\n\n# Import Excel file\nforklift ingest data.xlsx --dest ./output/ --input-kind excel --sheet \"Sheet1\"\n\n# Import Fixed-Width File\nforklift ingest data.txt --dest ./output/ --input-kind fwf --fwf-spec schema.json\n```\n\n### Schema Generation\n\n```bash\n# Generate schema from CSV (analyzes entire file by default)\nforklift generate-schema data.csv --file-type csv\n\n# Generate with limited row analysis\nforklift generate-schema data.csv --file-type csv --nrows 1000\n\n# Save to file\nforklift generate-schema data.csv --file-type csv --output file --output-path schema.json\n\n# Include sample data for development (explicit opt-in)\nforklift generate-schema data.csv --file-type csv --include-sample\n\n# Copy to clipboard\nforklift generate-schema data.csv --file-type csv --output clipboard\n\n# Excel files\nforklift generate-schema data.xlsx --file-type excel --sheet \"Sheet1\"\n\n# Parquet files\nforklift generate-schema data.parquet --file-type parquet\n\n# With primary key inference\nforklift generate-schema data.csv --file-type csv --infer-primary-key\n```\n\n## Core Components\n\n- **Import Engine**: High-performance data processing with PyArrow\n- **Schema Generator**: Intelligent schema inference and generation\n- **Validation System**: Constraint validation and error handling\n- **Processors**: Pluggable data transformation components\n- **I/O Operations**: S3 and local file system support\n\n## Documentation\n\nFor detailed documentation, see the [`docs/`](docs/) directory:\n\n- **[Usage Guide](docs/USAGE.md)** - Comprehensive usage examples and workflows\n- **[Schema Standards](docs/SCHEMA_STANDARDS.md)** - JSON Schema format and extensions\n- **[API Reference](docs/API_REFERENCE.md)** - Complete API documentation\n- **[Constraint Validation](docs/CONSTRAINT_VALIDATION_IMPLEMENTATION.md)** - Validation features\n- **[S3 Integration](docs/S3_TESTING.md)** - S3 usage and testing\n\n## Examples\n\nSee the [`examples/`](examples/) directory for comprehensive examples:\n\n- **[getting_started.py](examples/getting_started.py)** - **Start here!** Complete introduction to CSV processing with schema validation, including basic usage, complete schema validation, and passthrough mode for processing subsets of columns\n- **calculated_columns_demo.py** - Calculated columns functionality\n- **constraint_validation_demo.py** - Constraint validation examples\n- **validation_demo.py** - Data validation with bad rows handling\n- **datetime_features_example.py** - Date/time processing examples\n- And more...\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests for new functionality\n5. Run the test suite\n6. Submit a pull request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) 2025 cornyhorse\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.",
"summary": "A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support",
"version": "0.1.4",
"project_urls": {
"Bug Tracker": "https://github.com/cornyhorse/forklift/issues",
"Changelog": "https://github.com/cornyhorse/forklift/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/cornyhorse/forklift/blob/main/docs/",
"Homepage": "https://github.com/cornyhorse/forklift",
"Repository": "https://github.com/cornyhorse/forklift"
},
"split_keywords": [
"csv",
" data-processing",
" etl",
" excel",
" parquet",
" pyarrow",
" s3",
" schema-generation",
" validation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "fb660b90df3d9ceb6a7e1923e48dfcdb0eb059b2b316e6801ef786cd3fbeca69",
"md5": "cf11fd7376aa41408160549ab4f837e1",
"sha256": "837e9440b773b80272856f4ac08bfcf7124af51c0ee3f6347291cf1b28432763"
},
"downloads": -1,
"filename": "forklift_etl-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cf11fd7376aa41408160549ab4f837e1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 313798,
"upload_time": "2025-10-19T14:38:10",
"upload_time_iso_8601": "2025-10-19T14:38:10.101638Z",
"url": "https://files.pythonhosted.org/packages/fb/66/0b90df3d9ceb6a7e1923e48dfcdb0eb059b2b316e6801ef786cd3fbeca69/forklift_etl-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "25f371fb75f8b128ee8bc0d47b9f1b4a4c99fc23926fa10a8184e5777f08a13e",
"md5": "ddcdbed4f6e30180946780b2d3599b70",
"sha256": "528fd58277dbfff59a2388198390f7de56f3739a16835e7bd48bb511fa3bcd33"
},
"downloads": -1,
"filename": "forklift_etl-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "ddcdbed4f6e30180946780b2d3599b70",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 301044,
"upload_time": "2025-10-19T14:38:11",
"upload_time_iso_8601": "2025-10-19T14:38:11.418960Z",
"url": "https://files.pythonhosted.org/packages/25/f3/71fb75f8b128ee8bc0d47b9f1b4a4c99fc23926fa10a8184e5777f08a13e/forklift_etl-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-19 14:38:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "cornyhorse",
"github_project": "forklift",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pyarrow",
"specs": [
[
">=",
"15"
],
[
"<",
"18"
]
]
},
{
"name": "charset-normalizer",
"specs": [
[
">=",
"3.3"
]
]
},
{
"name": "jsonschema",
"specs": [
[
">=",
"4.22"
]
]
},
{
"name": "openpyxl",
"specs": [
[
">=",
"3.1"
]
]
},
{
"name": "xlrd",
"specs": [
[
">=",
"2.0"
]
]
},
{
"name": "sqlalchemy",
"specs": [
[
"==",
"2.0.43"
]
]
},
{
"name": "chardet",
"specs": []
},
{
"name": "boto3",
"specs": [
[
">=",
"1.34"
]
]
},
{
"name": "smart-open",
"specs": [
[
">=",
"7.0"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
">=",
"2.9"
]
]
},
{
"name": "pytz",
"specs": [
[
">=",
"2024.1"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"8.0"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
">=",
"5.0"
]
]
},
{
"name": "ruff",
"specs": [
[
">=",
"0.6"
]
]
},
{
"name": "mypy",
"specs": [
[
">=",
"1.10"
]
]
},
{
"name": "build",
"specs": [
[
">=",
"1.2"
]
]
},
{
"name": "twine",
"specs": [
[
">=",
"5.1"
]
]
},
{
"name": "mattstash",
"specs": []
},
{
"name": "python-dotenv",
"specs": []
}
],
"lcname": "forklift-etl"
}