# BulkFlow
A high-performance Python package for efficiently loading large CSV datasets into PostgreSQL databases. Features chunked processing, automatic resume capability, and comprehensive error handling.
## Key Features
- 🚀 **High Performance**: Optimized chunk-based processing for handling large datasets efficiently
- 🔄 **Resume Capability**: Automatically resume interrupted imports from the last successful position
- 🛡️ **Error Resilience**: Comprehensive error handling with detailed logging and failed row tracking
- 🔍 **Data Validation**: Preview data before import and validate row structure
- 📊 **Progress Tracking**: Real-time progress updates with ETA and processing speed
- 🔄 **Duplicate Handling**: Smart handling of duplicate records
- 🔌 **Connection Pooling**: Efficient database connection management
- 📝 **Detailed Logging**: Comprehensive logging of all operations and errors
## Installation
```bash
pip install bulkflow
```
## Quick Start
1. Create a database configuration file (`db_config.json`):
```json
{
"dbname": "your_database",
"user": "your_username",
"password": "your_password",
"host": "localhost",
"port": "5432"
}
```
2. Run the import:
```bash
bulkflow path/to/your/file.csv your_table_name
```
## Project Structure
```
bulkflow/
├── src/
│ ├── models/ # Data models
│ ├── processors/ # Core processing logic
│ ├── database/ # Database operations
│ └── utils/ # Utility functions
```
## Usage Examples
### Basic Usage
```python
from bulkflow import process_file
db_config = {
"dbname": "your_database",
"user": "your_username",
"password": "your_password",
"host": "localhost",
"port": "5432"
}
process_file(file_path, db_config, table_name)
```
### CLI Usage
```bash
# Basic usage
bulkflow data.csv target_table
# Custom config file
bulkflow data.csv target_table --config my_config.json
```
## Error Handling
BulkFlow provides comprehensive error handling:
1. **Failed Rows File**: `failed_rows_YYYYMMDD_HHMMSS.csv`
- Records individual row failures
- Includes row number, content, error reason, and timestamp
2. **Import State File**: `import_state.json`
- Tracks overall import progress
- Enables resume capability
- Records failed chunk information
## Performance Optimization
BulkFlow automatically optimizes performance by:
- Calculating optimal chunk sizes based on available memory
- Using connection pooling for database operations
- Implementing efficient duplicate handling strategies
- Minimizing memory usage through streaming processing
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- Inspired by the need for robust, production-ready data import solutions
- Built with modern Python best practices
- Designed for real-world use cases and large-scale data processing
## Support
If you encounter any issues or have questions:
1. Check the [Issues](https://github.com/clwillingham/bulkflow/issues) page
2. Create a new issue if your problem isn't already listed
3. Provide as much context as possible in your issue description
4. Try to fix the issue yourself and submit a Pull Request if you can
## Author
Created and maintained by [Chris Willingham](https://github.com/clwillingham)
## AI Contribution
The majority of this project's code was generated using AI assistance, specifically:
- [Cline](https://github.com/cline/cline) - AI coding assistant
- Claude 3.5 Sonnet (new) - Large language model by Anthropic
- In fact... the entire project was generated by AI, i'm kinda freakin out right now!
- even the name was generated by AI... I'm not sure if I count as the author, all hail our robot overlords!
Raw data
{
"_id": null,
"home_page": "https://github.com/clwillingham/bulkflow",
"name": "bulkflow",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "postgresql csv data-loading etl database bulk-import data-processing",
"author": "Chris",
"author_email": "clwillingham@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/0f/0b/aeaa2e099dac9bf8b3aca3147457df8961e5642ac39b8ff177a427fc9243/bulkflow-0.1.4.tar.gz",
"platform": null,
"description": "# BulkFlow\n\nA high-performance Python package for efficiently loading large CSV datasets into PostgreSQL databases. Features chunked processing, automatic resume capability, and comprehensive error handling.\n\n## Key Features\n\n- \ud83d\ude80 **High Performance**: Optimized chunk-based processing for handling large datasets efficiently\n- \ud83d\udd04 **Resume Capability**: Automatically resume interrupted imports from the last successful position\n- \ud83d\udee1\ufe0f **Error Resilience**: Comprehensive error handling with detailed logging and failed row tracking\n- \ud83d\udd0d **Data Validation**: Preview data before import and validate row structure\n- \ud83d\udcca **Progress Tracking**: Real-time progress updates with ETA and processing speed\n- \ud83d\udd04 **Duplicate Handling**: Smart handling of duplicate records\n- \ud83d\udd0c **Connection Pooling**: Efficient database connection management\n- \ud83d\udcdd **Detailed Logging**: Comprehensive logging of all operations and errors\n\n## Installation\n\n```bash\npip install bulkflow\n```\n\n## Quick Start\n\n1. Create a database configuration file (`db_config.json`):\n```json\n{\n \"dbname\": \"your_database\",\n \"user\": \"your_username\",\n \"password\": \"your_password\",\n \"host\": \"localhost\",\n \"port\": \"5432\"\n}\n```\n\n2. Run the import:\n```bash\nbulkflow path/to/your/file.csv your_table_name\n```\n\n## Project Structure\n\n```\nbulkflow/\n\u251c\u2500\u2500 src/\n\u2502 \u251c\u2500\u2500 models/ # Data models\n\u2502 \u251c\u2500\u2500 processors/ # Core processing logic\n\u2502 \u251c\u2500\u2500 database/ # Database operations\n\u2502 \u2514\u2500\u2500 utils/ # Utility functions\n```\n\n## Usage Examples\n\n### Basic Usage\n\n```python\nfrom bulkflow import process_file\n\ndb_config = {\n \"dbname\": \"your_database\",\n \"user\": \"your_username\",\n \"password\": \"your_password\",\n \"host\": \"localhost\",\n \"port\": \"5432\"\n}\n\nprocess_file(file_path, db_config, table_name)\n```\n\n### CLI Usage\n\n```bash\n# Basic usage\nbulkflow data.csv target_table\n\n# Custom config file\nbulkflow data.csv target_table --config my_config.json\n```\n\n## Error Handling\n\nBulkFlow provides comprehensive error handling:\n\n1. **Failed Rows File**: `failed_rows_YYYYMMDD_HHMMSS.csv`\n - Records individual row failures\n - Includes row number, content, error reason, and timestamp\n\n2. **Import State File**: `import_state.json`\n - Tracks overall import progress\n - Enables resume capability\n - Records failed chunk information\n\n## Performance Optimization\n\nBulkFlow automatically optimizes performance by:\n\n- Calculating optimal chunk sizes based on available memory\n- Using connection pooling for database operations\n- Implementing efficient duplicate handling strategies\n- Minimizing memory usage through streaming processing\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- Inspired by the need for robust, production-ready data import solutions\n- Built with modern Python best practices\n- Designed for real-world use cases and large-scale data processing\n\n## Support\n\nIf you encounter any issues or have questions:\n\n1. Check the [Issues](https://github.com/clwillingham/bulkflow/issues) page\n2. Create a new issue if your problem isn't already listed\n3. Provide as much context as possible in your issue description\n4. Try to fix the issue yourself and submit a Pull Request if you can\n\n## Author\n\nCreated and maintained by [Chris Willingham](https://github.com/clwillingham)\n\n## AI Contribution\n\nThe majority of this project's code was generated using AI assistance, specifically:\n- [Cline](https://github.com/cline/cline) - AI coding assistant\n- Claude 3.5 Sonnet (new) - Large language model by Anthropic\n- In fact... the entire project was generated by AI, i'm kinda freakin out right now!\n- even the name was generated by AI... I'm not sure if I count as the author, all hail our robot overlords!\n",
"bugtrack_url": null,
"license": null,
"summary": "A high-performance CSV to PostgreSQL data loader with chunked processing and error handling",
"version": "0.1.4",
"project_urls": {
"Bug Tracker": "https://github.com/clwillingham/bulkflow/issues",
"Documentation": "https://github.com/clwillingham/bulkflow",
"Homepage": "https://github.com/clwillingham/bulkflow",
"Source Code": "https://github.com/clwillingham/bulkflow"
},
"split_keywords": [
"postgresql",
"csv",
"data-loading",
"etl",
"database",
"bulk-import",
"data-processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2c725d52a6ed8a18777f51d56afe08b3c2f697eb5d881514adc8ddf647dbccbe",
"md5": "a761874131bcb1d32a62614f7053dbf9",
"sha256": "d3647a228adc432606d4065251c3211c83ab88209536dce8c6d6a4e3d5d49751"
},
"downloads": -1,
"filename": "bulkflow-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a761874131bcb1d32a62614f7053dbf9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 14686,
"upload_time": "2024-11-12T05:54:20",
"upload_time_iso_8601": "2024-11-12T05:54:20.465570Z",
"url": "https://files.pythonhosted.org/packages/2c/72/5d52a6ed8a18777f51d56afe08b3c2f697eb5d881514adc8ddf647dbccbe/bulkflow-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0f0baeaa2e099dac9bf8b3aca3147457df8961e5642ac39b8ff177a427fc9243",
"md5": "b497a102f52961409677dce13c643ed5",
"sha256": "1d081a5fdbce17468681e6778fa827862ccea9730d4b492d663645f3734197aa"
},
"downloads": -1,
"filename": "bulkflow-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "b497a102f52961409677dce13c643ed5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 13810,
"upload_time": "2024-11-12T05:54:22",
"upload_time_iso_8601": "2024-11-12T05:54:22.410845Z",
"url": "https://files.pythonhosted.org/packages/0f/0b/aeaa2e099dac9bf8b3aca3147457df8961e5642ac39b8ff177a427fc9243/bulkflow-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-12 05:54:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "clwillingham",
"github_project": "bulkflow",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "bulkflow"
}