bulkflow


Namebulkflow JSON
Version 0.1.4 PyPI version JSON
download
home_pagehttps://github.com/clwillingham/bulkflow
SummaryA high-performance CSV to PostgreSQL data loader with chunked processing and error handling
upload_time2024-11-12 05:54:22
maintainerNone
docs_urlNone
authorChris
requires_python>=3.7
licenseNone
keywords postgresql csv data-loading etl database bulk-import data-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # BulkFlow

A high-performance Python package for efficiently loading large CSV datasets into PostgreSQL databases. Features chunked processing, automatic resume capability, and comprehensive error handling.

## Key Features

- 🚀 **High Performance**: Optimized chunk-based processing for handling large datasets efficiently
- 🔄 **Resume Capability**: Automatically resume interrupted imports from the last successful position
- 🛡️ **Error Resilience**: Comprehensive error handling with detailed logging and failed row tracking
- 🔍 **Data Validation**: Preview data before import and validate row structure
- 📊 **Progress Tracking**: Real-time progress updates with ETA and processing speed
- 🔄 **Duplicate Handling**: Smart handling of duplicate records
- 🔌 **Connection Pooling**: Efficient database connection management
- 📝 **Detailed Logging**: Comprehensive logging of all operations and errors

## Installation

```bash
pip install bulkflow
```

## Quick Start

1. Create a database configuration file (`db_config.json`):
```json
{
    "dbname": "your_database",
    "user": "your_username",
    "password": "your_password",
    "host": "localhost",
    "port": "5432"
}
```

2. Run the import:
```bash
bulkflow path/to/your/file.csv your_table_name
```

## Project Structure

```
bulkflow/
├── src/
│   ├── models/          # Data models
│   ├── processors/      # Core processing logic
│   ├── database/        # Database operations
│   └── utils/          # Utility functions
```

## Usage Examples

### Basic Usage

```python
from bulkflow import process_file

db_config = {
    "dbname": "your_database",
    "user": "your_username",
    "password": "your_password",
    "host": "localhost",
    "port": "5432"
}

process_file(file_path, db_config, table_name)
```

### CLI Usage

```bash
# Basic usage
bulkflow data.csv target_table

# Custom config file
bulkflow data.csv target_table --config my_config.json
```

## Error Handling

BulkFlow provides comprehensive error handling:

1. **Failed Rows File**: `failed_rows_YYYYMMDD_HHMMSS.csv`
   - Records individual row failures
   - Includes row number, content, error reason, and timestamp

2. **Import State File**: `import_state.json`
   - Tracks overall import progress
   - Enables resume capability
   - Records failed chunk information

## Performance Optimization

BulkFlow automatically optimizes performance by:

- Calculating optimal chunk sizes based on available memory
- Using connection pooling for database operations
- Implementing efficient duplicate handling strategies
- Minimizing memory usage through streaming processing

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Inspired by the need for robust, production-ready data import solutions
- Built with modern Python best practices
- Designed for real-world use cases and large-scale data processing

## Support

If you encounter any issues or have questions:

1. Check the [Issues](https://github.com/clwillingham/bulkflow/issues) page
2. Create a new issue if your problem isn't already listed
3. Provide as much context as possible in your issue description
4. Try to fix the issue yourself and submit a Pull Request if you can

## Author

Created and maintained by [Chris Willingham](https://github.com/clwillingham)

## AI Contribution

The majority of this project's code was generated using AI assistance, specifically:
- [Cline](https://github.com/cline/cline) - AI coding assistant
- Claude 3.5 Sonnet (new) - Large language model by Anthropic
- In fact... the entire project was generated by AI, i'm kinda freakin out right now!
- even the name was generated by AI... I'm not sure if I count as the author, all hail our robot overlords!

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/clwillingham/bulkflow",
    "name": "bulkflow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "postgresql csv data-loading etl database bulk-import data-processing",
    "author": "Chris",
    "author_email": "clwillingham@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/0f/0b/aeaa2e099dac9bf8b3aca3147457df8961e5642ac39b8ff177a427fc9243/bulkflow-0.1.4.tar.gz",
    "platform": null,
    "description": "# BulkFlow\n\nA high-performance Python package for efficiently loading large CSV datasets into PostgreSQL databases. Features chunked processing, automatic resume capability, and comprehensive error handling.\n\n## Key Features\n\n- \ud83d\ude80 **High Performance**: Optimized chunk-based processing for handling large datasets efficiently\n- \ud83d\udd04 **Resume Capability**: Automatically resume interrupted imports from the last successful position\n- \ud83d\udee1\ufe0f **Error Resilience**: Comprehensive error handling with detailed logging and failed row tracking\n- \ud83d\udd0d **Data Validation**: Preview data before import and validate row structure\n- \ud83d\udcca **Progress Tracking**: Real-time progress updates with ETA and processing speed\n- \ud83d\udd04 **Duplicate Handling**: Smart handling of duplicate records\n- \ud83d\udd0c **Connection Pooling**: Efficient database connection management\n- \ud83d\udcdd **Detailed Logging**: Comprehensive logging of all operations and errors\n\n## Installation\n\n```bash\npip install bulkflow\n```\n\n## Quick Start\n\n1. Create a database configuration file (`db_config.json`):\n```json\n{\n    \"dbname\": \"your_database\",\n    \"user\": \"your_username\",\n    \"password\": \"your_password\",\n    \"host\": \"localhost\",\n    \"port\": \"5432\"\n}\n```\n\n2. Run the import:\n```bash\nbulkflow path/to/your/file.csv your_table_name\n```\n\n## Project Structure\n\n```\nbulkflow/\n\u251c\u2500\u2500 src/\n\u2502   \u251c\u2500\u2500 models/          # Data models\n\u2502   \u251c\u2500\u2500 processors/      # Core processing logic\n\u2502   \u251c\u2500\u2500 database/        # Database operations\n\u2502   \u2514\u2500\u2500 utils/          # Utility functions\n```\n\n## Usage Examples\n\n### Basic Usage\n\n```python\nfrom bulkflow import process_file\n\ndb_config = {\n    \"dbname\": \"your_database\",\n    \"user\": \"your_username\",\n    \"password\": \"your_password\",\n    \"host\": \"localhost\",\n    \"port\": \"5432\"\n}\n\nprocess_file(file_path, db_config, table_name)\n```\n\n### CLI Usage\n\n```bash\n# Basic usage\nbulkflow data.csv target_table\n\n# Custom config file\nbulkflow data.csv target_table --config my_config.json\n```\n\n## Error Handling\n\nBulkFlow provides comprehensive error handling:\n\n1. **Failed Rows File**: `failed_rows_YYYYMMDD_HHMMSS.csv`\n   - Records individual row failures\n   - Includes row number, content, error reason, and timestamp\n\n2. **Import State File**: `import_state.json`\n   - Tracks overall import progress\n   - Enables resume capability\n   - Records failed chunk information\n\n## Performance Optimization\n\nBulkFlow automatically optimizes performance by:\n\n- Calculating optimal chunk sizes based on available memory\n- Using connection pooling for database operations\n- Implementing efficient duplicate handling strategies\n- Minimizing memory usage through streaming processing\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- Inspired by the need for robust, production-ready data import solutions\n- Built with modern Python best practices\n- Designed for real-world use cases and large-scale data processing\n\n## Support\n\nIf you encounter any issues or have questions:\n\n1. Check the [Issues](https://github.com/clwillingham/bulkflow/issues) page\n2. Create a new issue if your problem isn't already listed\n3. Provide as much context as possible in your issue description\n4. Try to fix the issue yourself and submit a Pull Request if you can\n\n## Author\n\nCreated and maintained by [Chris Willingham](https://github.com/clwillingham)\n\n## AI Contribution\n\nThe majority of this project's code was generated using AI assistance, specifically:\n- [Cline](https://github.com/cline/cline) - AI coding assistant\n- Claude 3.5 Sonnet (new) - Large language model by Anthropic\n- In fact... the entire project was generated by AI, i'm kinda freakin out right now!\n- even the name was generated by AI... I'm not sure if I count as the author, all hail our robot overlords!\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A high-performance CSV to PostgreSQL data loader with chunked processing and error handling",
    "version": "0.1.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/clwillingham/bulkflow/issues",
        "Documentation": "https://github.com/clwillingham/bulkflow",
        "Homepage": "https://github.com/clwillingham/bulkflow",
        "Source Code": "https://github.com/clwillingham/bulkflow"
    },
    "split_keywords": [
        "postgresql",
        "csv",
        "data-loading",
        "etl",
        "database",
        "bulk-import",
        "data-processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2c725d52a6ed8a18777f51d56afe08b3c2f697eb5d881514adc8ddf647dbccbe",
                "md5": "a761874131bcb1d32a62614f7053dbf9",
                "sha256": "d3647a228adc432606d4065251c3211c83ab88209536dce8c6d6a4e3d5d49751"
            },
            "downloads": -1,
            "filename": "bulkflow-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a761874131bcb1d32a62614f7053dbf9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 14686,
            "upload_time": "2024-11-12T05:54:20",
            "upload_time_iso_8601": "2024-11-12T05:54:20.465570Z",
            "url": "https://files.pythonhosted.org/packages/2c/72/5d52a6ed8a18777f51d56afe08b3c2f697eb5d881514adc8ddf647dbccbe/bulkflow-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0f0baeaa2e099dac9bf8b3aca3147457df8961e5642ac39b8ff177a427fc9243",
                "md5": "b497a102f52961409677dce13c643ed5",
                "sha256": "1d081a5fdbce17468681e6778fa827862ccea9730d4b492d663645f3734197aa"
            },
            "downloads": -1,
            "filename": "bulkflow-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "b497a102f52961409677dce13c643ed5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 13810,
            "upload_time": "2024-11-12T05:54:22",
            "upload_time_iso_8601": "2024-11-12T05:54:22.410845Z",
            "url": "https://files.pythonhosted.org/packages/0f/0b/aeaa2e099dac9bf8b3aca3147457df8961e5642ac39b8ff177a427fc9243/bulkflow-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-12 05:54:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "clwillingham",
    "github_project": "bulkflow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "bulkflow"
}
        
Elapsed time: 0.38176s