cloudbulkupload


Namecloudbulkupload JSON
Version 2.0.0 PyPI version JSON
download
home_pagehttps://github.com/dynamicdeploy/cloudbulkupload
SummaryPython package for fast and parallel transferring a bulk of files to S3, Azure Blob Storage, and Google Cloud Storage
upload_time2025-08-12 21:22:41
maintainerNone
docs_urlNone
authorDynamic Deploy. Credit: Amir Masoud Sefidian
requires_python>=3.11.0
licenseNone
keywords boto3 s3 azure google cloud blob storage parallel multi-thread bulk boto bulk boto bulk boto3 simple storage service minio amazon aws s3 microsoft azure google cloud storage python async
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!-- PROJECT LOGO -->
<br />
<div align="center">
  <a href="https://github.com/dynamicdeploy/cloudbulkupload">
    <img src="https://raw.githubusercontent.com/dynamicdeploy/cloudbulkupload/refs/heads/main/imgs/logo.png" alt="Logo">
  </a>
    
  <h3 align="center">Cloud Bulk Upload (cloudbulkupload)</h3>

  <p align="center">
    Python package for fast and parallel transferring a bulk of files to S3, Azure Blob Storage, and Google Cloud Storage!
    <br />
    <a href="https://pypi.org/project/cloudbulkupload/">See on PyPI</a>
    ยท
    <a href="https://github.com/dynamicdeploy/cloudbulkupload/blob/main/examples.py">View Examples</a>
    ยท
    <a href="https://github.com/dynamicdeploy/cloudbulkupload/issues">Report Bug/Request Feature</a>
    

![Python](https://img.shields.io/pypi/pyversions/cloudbulkupload.svg?style=flat)
![Version](https://img.shields.io/pypi/v/cloudbulkupload.svg?style=flat)
![License](https://img.shields.io/pypi/l/cloudbulkupload.svg?style=flat)
[![Downloads](https://img.shields.io/pypi/dm/cloudbulkupload.svg)](https://pypi.org/project/cloudbulkupload/)   

</p>
</div>

<!-- TABLE OF CONTENTS -->
<details>
  <summary>Table of Contents</summary>
  <ol>
    <li>
      <a href="#about-cloudbulkupload">About cloudbulkupload</a>
    </li>
    <li>
      <a href="#getting-started">Getting Started</a>
      <ul>
        <li><a href="#prerequisites">Prerequisites</a></li>
        <li><a href="#installation">Installation</a></li>
        <li><a href="#quick-start">Quick Start</a></li>
      </ul>
    </li>
    <li>
      <a href="#usage-by-provider">Usage by Provider</a>
      <ul>
        <li><a href="#aws-s3">AWS S3</a></li>
        <li><a href="#azure-blob-storage">Azure Blob Storage</a></li>
        <li><a href="#google-cloud-storage">Google Cloud Storage</a></li>
      </ul>
    </li>
    <li>
      <a href="#testing-and-performance">Testing and Performance</a>
      <ul>
        <li><a href="#running-tests">Running Tests</a></li>
        <li><a href="#performance-comparison">Performance Comparison</a></li>
        <li><a href="#test-results">Test Results</a></li>
      </ul>
    </li>
    <li>
      <a href="#documentation">Documentation</a>
    </li>
    <li>
      <a href="#contributing">Contributing</a>
    </li>
    <li>
      <a href="#contributors">Contributors</a>
    </li>
    <li>
      <a href="#license">License</a>
    </li>
  </ol>
</details>

## About cloudbulkupload

[Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) is the official Python SDK
for accessing and managing all AWS resources such as Amazon Simple Storage Service (S3).
Generally, it's pretty ok to transfer a small number of files using Boto3. However, transferring a large number of
small files impede performance. Although it only takes a few milliseconds per file to transfer,
it can take up to hours to transfer hundreds of thousands, or millions, of files if you do it sequentially.
Moreover, because Amazon S3 does not have folders/directories, managing the hierarchy of directories and files
manually can be a bit tedious especially if there are many files located in different folders.

The `cloudbulkupload` package solves these issues. It speeds up transferring of many small files to **Amazon AWS S3**, **Azure Blob Storage**, and **Google Cloud Storage** by
executing multiple download/upload operations in parallel by leveraging the Python multiprocessing module and async/await patterns.
Depending on the number of cores of your machine, Cloud Bulk Upload can make cloud storage transfers even **100X faster** than sequential
mode using traditional Boto3! Furthermore, Cloud Bulk Upload can keep the original folder structure of files and
directories when transferring them.

### ๐Ÿš€ Main Functionalities

- **๐Ÿ”„ Multi-Cloud Support**: AWS S3, Azure Blob Storage, and Google Cloud Storage
- **โšก High Performance**: Multi-thread and async operations for maximum speed
- **๐Ÿ“ Directory Operations**: Upload/download entire directories with structure preservation
- **๐ŸŽฏ Bulk Operations**: Efficient handling of thousands of files
- **๐Ÿ“Š Progress Tracking**: Built-in progress bars for long-running operations
- **๐Ÿงช Comprehensive Testing**: Full test suite with performance comparisons
- **๐Ÿ”ง Configurable**: Customizable concurrency, timeouts, and error handling
- **๐Ÿ“ˆ Performance Monitoring**: Built-in metrics and comparison tools

### ๐Ÿ† Performance Benefits

- **100X faster** than sequential uploads
- **Async operations** for Azure and Google Cloud
- **Multi-threading** for AWS S3
- **Configurable concurrency** for optimal performance
- **Memory efficient** for large file sets

## Getting Started

### Prerequisites

* [Python 3.11+](https://www.python.org/)
* [pip](https://pip.pypa.io/en/stable/)
* API credentials for your chosen cloud provider(s)

**Note**: You can deploy a free S3-compatible server using [MinIO](https://min.io/) 
on your local machine for testing. See our [documentation](docs/TESTING.md) for setup instructions.

### Installation

Use the package manager [pip](https://pypi.org/project/cloudbulkupload/) to install `cloudbulkupload`.

```bash
pip install cloudbulkupload
```

For development and testing:
```bash
pip install "cloudbulkupload[test]"
```

### Quick Start

```python
# AWS S3
from cloudbulkupload import BulkBoto3

aws_client = BulkBoto3(
    endpoint_url="your-endpoint",
    aws_access_key_id="your-key",
    aws_secret_access_key="your-secret",
    verbose=True
)

# Upload directory
aws_client.upload_dir_to_storage(
    bucket_name="my-bucket",
    local_dir="path/to/files",
    storage_dir="uploads",
    n_threads=50
)
```

```python
# Azure Blob Storage
import asyncio
from cloudbulkupload import BulkAzureBlob

async def azure_example():
    azure_client = BulkAzureBlob(
        connection_string="your-connection-string",
        verbose=True
    )
    
    await azure_client.upload_directory(
        container_name="my-container",
        local_dir="path/to/files",
        storage_dir="uploads"
    )

asyncio.run(azure_example())
```

```python
# Google Cloud Storage
import asyncio
from cloudbulkupload import BulkGoogleStorage

async def google_example():
    google_client = BulkGoogleStorage(
        project_id="your-project-id",
        verbose=True
    )
    
    await google_client.upload_directory(
        bucket_name="my-bucket",
        local_dir="path/to/files",
        storage_dir="uploads"
    )

asyncio.run(google_example())
```

## Usage by Provider

### AWS S3

AWS S3 support uses multi-threading for optimal performance on the AWS platform.

#### Basic Setup

```python
from cloudbulkupload import BulkBoto3

client = BulkBoto3(
    endpoint_url="https://s3.amazonaws.com",  # or your custom endpoint
    aws_access_key_id="your-access-key",
    aws_secret_access_key="your-secret-key",
    max_pool_connections=300,
    verbose=True
)
```

#### Directory Operations

```python
# Upload entire directory
client.upload_dir_to_storage(
    bucket_name="my-bucket",
    local_dir="path/to/local/directory",
    storage_dir="uploads/my-files",
    n_threads=50
)

# Download entire directory
client.download_dir_from_storage(
    bucket_name="my-bucket",
    storage_dir="uploads/my-files",
    local_dir="downloads",
    n_threads=50
)
```

#### Individual File Operations

```python
from cloudbulkupload import StorageTransferPath

# Upload specific files
upload_paths = [
    StorageTransferPath("file1.txt", "uploads/file1.txt"),
    StorageTransferPath("file2.txt", "uploads/file2.txt")
]

client.upload(bucket_name="my-bucket", upload_paths=upload_paths)

# Download specific files
download_paths = [
    StorageTransferPath("uploads/file1.txt", "local/file1.txt"),
    StorageTransferPath("uploads/file2.txt", "local/file2.txt")
]

client.download(bucket_name="my-bucket", download_paths=download_paths)
```

#### Bucket Management

```python
# Create bucket
client.create_new_bucket("new-bucket-name")

# List objects
objects = client.list_objects(bucket_name="my-bucket", storage_dir="uploads")

# Check if object exists
exists = client.check_object_exists(bucket_name="my-bucket", object_path="uploads/file.txt")

# Empty bucket
client.empty_bucket("my-bucket")
```

### Azure Blob Storage

Azure Blob Storage support uses async/await patterns for optimal performance.

#### Basic Setup

```python
import asyncio
from cloudbulkupload import BulkAzureBlob

async def main():
    client = BulkAzureBlob(
        connection_string="your-azure-connection-string",
        max_concurrent_operations=50,
        verbose=True
    )
    
    # Your operations here
    await client.upload_directory(
        container_name="my-container",
        local_dir="path/to/files",
        storage_dir="uploads"
    )

asyncio.run(main())
```

#### Directory Operations

```python
# Upload directory
await client.upload_directory(
    container_name="my-container",
    local_dir="path/to/local/directory",
    storage_dir="uploads/my-files"
)

# Download directory
await client.download_directory(
    container_name="my-container",
    storage_dir="uploads/my-files",
    local_dir="downloads"
)
```

#### Individual File Operations

```python
from cloudbulkupload import StorageTransferPath

# Upload specific files
upload_paths = [
    StorageTransferPath("file1.txt", "uploads/file1.txt"),
    StorageTransferPath("file2.txt", "uploads/file2.txt")
]

await client.upload_files("my-container", upload_paths)

# Download specific files
download_paths = [
    StorageTransferPath("uploads/file1.txt", "local/file1.txt"),
    StorageTransferPath("uploads/file2.txt", "local/file2.txt")
]

await client.download_files("my-container", download_paths)
```

#### Container Management

```python
# Create container
await client.create_container("new-container")

# List blobs
blobs = await client.list_blobs("my-container", prefix="uploads/")

# Check if blob exists
exists = await client.check_blob_exists("my-container", "uploads/file.txt")

# Empty container
await client.empty_container("my-container")
```

#### Convenience Functions

```python
from cloudbulkupload import bulk_upload_blobs, bulk_download_blobs

# Bulk upload
files = ["file1.txt", "file2.txt", "file3.txt"]
await bulk_upload_blobs(
    connection_string="your-connection-string",
    container_name="my-container",
    files_to_upload=files,
    max_concurrent=50,
    verbose=True
)

# Bulk download
await bulk_download_blobs(
    connection_string="your-connection-string",
    container_name="my-container",
    files_to_download=files,
    local_dir="downloads",
    max_concurrent=50,
    verbose=True
)
```

### Google Cloud Storage

Google Cloud Storage support uses async/await patterns and includes a hybrid approach with Google's Transfer Manager for maximum performance.

#### Basic Setup

```python
import asyncio
from cloudbulkupload import BulkGoogleStorage

async def main():
    client = BulkGoogleStorage(
        project_id="your-project-id",
        credentials_path="/path/to/service-account.json",  # Optional
        max_concurrent_operations=50,
        verbose=True
    )
    
    # Your operations here
    await client.upload_directory(
        bucket_name="my-bucket",
        local_dir="path/to/files",
        storage_dir="uploads"
    )

asyncio.run(main())
```

#### Authentication Options

```python
# Method 1: Service Account Key File
client = BulkGoogleStorage(
    project_id="your-project-id",
    credentials_path="/path/to/service-account.json"
)

# Method 2: Service Account JSON String (for cloud/container environments)
client = BulkGoogleStorage(
    project_id="your-project-id",
    credentials_json='{"type": "service_account", ...}'
)

# Method 3: Application Default Credentials
client = BulkGoogleStorage(project_id="your-project-id")
```

#### Directory Operations

```python
# Upload directory
await client.upload_directory(
    bucket_name="my-bucket",
    local_dir="path/to/local/directory",
    storage_dir="uploads/my-files"
)

# Download directory
await client.download_directory(
    bucket_name="my-bucket",
    storage_dir="uploads/my-files",
    local_dir="downloads"
)
```

#### Individual File Operations

```python
from cloudbulkupload import StorageTransferPath

# Upload specific files
upload_paths = [
    StorageTransferPath("file1.txt", "uploads/file1.txt"),
    StorageTransferPath("file2.txt", "uploads/file2.txt")
]

await client.upload_files("my-bucket", upload_paths)

# Download specific files
download_paths = [
    StorageTransferPath("uploads/file1.txt", "local/file1.txt"),
    StorageTransferPath("uploads/file2.txt", "local/file2.txt")
]

await client.download_files("my-bucket", download_paths)
```

#### Hybrid Approach: Standard vs Transfer Manager

```python
# Standard Mode (Consistent API across all providers)
await client.upload_files("my-bucket", upload_paths)

# Transfer Manager Mode (High Performance - Google Cloud only)
await client.upload_files("my-bucket", upload_paths, use_transfer_manager=True)
```

#### Bucket Management

```python
# Create bucket
await client.create_bucket("new-bucket-name")

# List blobs
blobs = await client.list_blobs("my-bucket", prefix="uploads/")

# Check if blob exists
exists = await client.check_blob_exists("my-bucket", "uploads/file.txt")

# Empty bucket
await client.empty_bucket("my-bucket")
```

#### Convenience Functions

```python
from cloudbulkupload import google_bulk_upload_blobs, google_bulk_download_blobs

# Bulk upload
files = ["file1.txt", "file2.txt", "file3.txt"]
await google_bulk_upload_blobs(
    project_id="your-project-id",
    bucket_name="my-bucket",
    files_to_upload=files,
    max_concurrent=50,
    verbose=True,
    use_transfer_manager=True  # Optional: Use Google's Transfer Manager
)

# Bulk download
await google_bulk_download_blobs(
    project_id="your-project-id",
    bucket_name="my-bucket",
    files_to_download=files,
    local_dir="downloads",
    max_concurrent=50,
    verbose=True
)
```

## Testing and Performance

### Running Tests

The package includes a comprehensive test suite for all providers and performance comparisons.

#### Install Test Dependencies

```bash
pip install "cloudbulkupload[test]"
```

#### Run Different Test Types

```bash
# Unit tests
python run_tests.py --type unit

# Performance tests
python run_tests.py --type performance

# AWS S3 tests
python run_tests.py --type aws

# Azure Blob Storage tests
python run_tests.py --type azure

# Google Cloud Storage tests
python run_tests.py --type google-cloud

# AWS vs Azure comparison
python run_tests.py --type azure-comparison

# Three-way comparison (AWS, Azure, Google)
python run_tests.py --type three-way-comparison

# All tests
python run_tests.py --type all
```

#### Individual Test Files

```bash
# Run specific test files
python tests/aws_s3_test.py
python tests/azure_blob_test.py
python tests/google_cloud_test.py
python tests/performance_comparison_three_way.py
```

### Performance Comparison

The package includes built-in performance comparison tools to test and compare different cloud providers.

#### Three-Way Performance Comparison

```bash
python tests/performance_comparison_three_way.py
```

This will:
- Test AWS S3, Azure Blob Storage, and Google Cloud Storage
- Compare upload/download speeds
- Generate performance reports
- Create CSV files with detailed metrics

#### Performance Metrics

The tests measure:
- **Upload Speed**: MB/s for different file sizes
- **Download Speed**: MB/s for different file sizes
- **Concurrency Impact**: Performance with different thread counts
- **File Size Impact**: Performance with different file sizes
- **Provider Comparison**: Direct comparison between AWS, Azure, and Google Cloud

#### Expected Performance

Based on our testing:
- **AWS S3**: 5-8 MB/s with multi-threading
- **Azure Blob Storage**: 6-9 MB/s with async operations
- **Google Cloud Storage**: 6-9 MB/s with async operations
- **Google Transfer Manager**: 8-12 MB/s for large files

### Test Results

Test results are automatically generated and saved to:
- `performance_comparison_results.csv` - AWS vs Azure comparison
- `performance_comparison_three_way_results.csv` - Three-way comparison
- `test_results.csv` - General test results
- `google_cloud_test_results.json` - Google Cloud specific results

For detailed test documentation, see [docs/TESTING.md](docs/TESTING.md).

## Documentation

Comprehensive documentation is available in the `docs/` directory:

### ๐Ÿ“š Implementation Guides
- [docs/AZURE_GUIDE.md](docs/AZURE_GUIDE.md) - Complete Azure Blob Storage guide
- [docs/GOOGLE_CLOUD_GUIDE.md](docs/GOOGLE_CLOUD_GUIDE.md) - Complete Google Cloud Storage guide

### ๐Ÿ“‹ Implementation Summaries
- [docs/AZURE_IMPLEMENTATION_SUMMARY.md](docs/AZURE_IMPLEMENTATION_SUMMARY.md) - Azure implementation details
- [docs/GOOGLE_CLOUD_IMPLEMENTATION_SUMMARY.md](docs/GOOGLE_CLOUD_IMPLEMENTATION_SUMMARY.md) - Google Cloud implementation details

### ๐Ÿงช Testing Documentation
- [docs/TESTING.md](docs/TESTING.md) - Complete testing guide
- [docs/TEST_RESULTS.md](docs/TEST_RESULTS.md) - Test results and analysis
- [docs/COMPREHENSIVE_TEST_SUMMARY.md](docs/COMPREHENSIVE_TEST_SUMMARY.md) - Comprehensive test summary

### ๐Ÿ“ฆ PyPI Publishing
- [docs/PYPI_PUBLISHING_GUIDE.md](docs/PYPI_PUBLISHING_GUIDE.md) - How to publish to PyPI
- [docs/PYPI_QUICK_REFERENCE.md](docs/PYPI_QUICK_REFERENCE.md) - Quick PyPI reference

### ๐Ÿ“– Original Documentation
- [docs/ORIGINAL_README.md](docs/ORIGINAL_README.md) - Original README for reference

## Contributing

Any contributions you make are **greatly appreciated**. If you have a suggestion that would make this better, please fork the repo and create a pull request. 
You can also simply open an issue with the tag "enhancement". To contribute to `cloudbulkupload`, follow these steps:

1. Fork this repository
2. Create a feature branch (`git checkout -b feature/AmazingFeature`)
3. Make your changes and commit them (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a pull request

Alternatively, see the GitHub documentation on [creating a pull request](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request).

### Development Setup

```bash
# Clone the repository
git clone https://github.com/dynamicdeploy/cloudbulkupload.git
cd cloudbulkupload

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[test,dev]"

# Run tests
python run_tests.py --type all
```

## Contributors

Thanks to the following people who have contributed to this project:

* [Amir Masoud Sefidian](https://sefidian.com/) ๐Ÿ“– - Original creator of the bulk upload concept
* [Dynamic Deploy](https://github.com/dynamicdeploy) ๐Ÿš€ - Multi-cloud expansion and maintenance

## License

Distributed under the [MIT](https://choosealicense.com/licenses/mit/) License. See `LICENSE` for more information.

---

## Credits

This project is based on the original work by **Amir Masoud Sefidian** who created the bulk upload concept and initial implementation. The original repository can be found at: [https://github.com/iamirmasoud/bulkboto3](https://github.com/iamirmasoud/bulkboto3)

The project has been significantly expanded to support multiple cloud providers (AWS S3, Azure Blob Storage, and Google Cloud Storage) while maintaining the core performance benefits of the original implementation.




            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/dynamicdeploy/cloudbulkupload",
    "name": "cloudbulkupload",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11.0",
    "maintainer_email": "\"Dynamic Deploy. Credit: Amir Masoud Sefidian\" <dynamicdeploy@gmail.com>",
    "keywords": "Boto3, S3, Azure, Google Cloud, Blob Storage, Parallel, Multi-thread, Bulk, Boto, Bulk Boto, Bulk Boto3, Simple Storage Service, Minio, Amazon AWS S3, Microsoft Azure, Google Cloud Storage, Python, Async",
    "author": "Dynamic Deploy. Credit: Amir Masoud Sefidian",
    "author_email": "\"Dynamic Deploy. Credit: Amir Masoud Sefidian\" <dynamicdeploy@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/74/b8/6636c565eecb2b64ac46671e3478d1f4cdcaec92cbf506e4ea31c13b9f94/cloudbulkupload-2.0.0.tar.gz",
    "platform": null,
    "description": "<!-- PROJECT LOGO -->\n<br />\n<div align=\"center\">\n  <a href=\"https://github.com/dynamicdeploy/cloudbulkupload\">\n    <img src=\"https://raw.githubusercontent.com/dynamicdeploy/cloudbulkupload/refs/heads/main/imgs/logo.png\" alt=\"Logo\">\n  </a>\n    \n  <h3 align=\"center\">Cloud Bulk Upload (cloudbulkupload)</h3>\n\n  <p align=\"center\">\n    Python package for fast and parallel transferring a bulk of files to S3, Azure Blob Storage, and Google Cloud Storage!\n    <br />\n    <a href=\"https://pypi.org/project/cloudbulkupload/\">See on PyPI</a>\n    \u00b7\n    <a href=\"https://github.com/dynamicdeploy/cloudbulkupload/blob/main/examples.py\">View Examples</a>\n    \u00b7\n    <a href=\"https://github.com/dynamicdeploy/cloudbulkupload/issues\">Report Bug/Request Feature</a>\n    \n\n![Python](https://img.shields.io/pypi/pyversions/cloudbulkupload.svg?style=flat)\n![Version](https://img.shields.io/pypi/v/cloudbulkupload.svg?style=flat)\n![License](https://img.shields.io/pypi/l/cloudbulkupload.svg?style=flat)\n[![Downloads](https://img.shields.io/pypi/dm/cloudbulkupload.svg)](https://pypi.org/project/cloudbulkupload/)   \n\n</p>\n</div>\n\n<!-- TABLE OF CONTENTS -->\n<details>\n  <summary>Table of Contents</summary>\n  <ol>\n    <li>\n      <a href=\"#about-cloudbulkupload\">About cloudbulkupload</a>\n    </li>\n    <li>\n      <a href=\"#getting-started\">Getting Started</a>\n      <ul>\n        <li><a href=\"#prerequisites\">Prerequisites</a></li>\n        <li><a href=\"#installation\">Installation</a></li>\n        <li><a href=\"#quick-start\">Quick Start</a></li>\n      </ul>\n    </li>\n    <li>\n      <a href=\"#usage-by-provider\">Usage by Provider</a>\n      <ul>\n        <li><a href=\"#aws-s3\">AWS S3</a></li>\n        <li><a href=\"#azure-blob-storage\">Azure Blob Storage</a></li>\n        <li><a href=\"#google-cloud-storage\">Google Cloud Storage</a></li>\n      </ul>\n    </li>\n    <li>\n      <a href=\"#testing-and-performance\">Testing and Performance</a>\n      <ul>\n        <li><a href=\"#running-tests\">Running Tests</a></li>\n        <li><a href=\"#performance-comparison\">Performance Comparison</a></li>\n        <li><a href=\"#test-results\">Test Results</a></li>\n      </ul>\n    </li>\n    <li>\n      <a href=\"#documentation\">Documentation</a>\n    </li>\n    <li>\n      <a href=\"#contributing\">Contributing</a>\n    </li>\n    <li>\n      <a href=\"#contributors\">Contributors</a>\n    </li>\n    <li>\n      <a href=\"#license\">License</a>\n    </li>\n  </ol>\n</details>\n\n## About cloudbulkupload\n\n[Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) is the official Python SDK\nfor accessing and managing all AWS resources such as Amazon Simple Storage Service (S3).\nGenerally, it's pretty ok to transfer a small number of files using Boto3. However, transferring a large number of\nsmall files impede performance. Although it only takes a few milliseconds per file to transfer,\nit can take up to hours to transfer hundreds of thousands, or millions, of files if you do it sequentially.\nMoreover, because Amazon S3 does not have folders/directories, managing the hierarchy of directories and files\nmanually can be a bit tedious especially if there are many files located in different folders.\n\nThe `cloudbulkupload` package solves these issues. It speeds up transferring of many small files to **Amazon AWS S3**, **Azure Blob Storage**, and **Google Cloud Storage** by\nexecuting multiple download/upload operations in parallel by leveraging the Python multiprocessing module and async/await patterns.\nDepending on the number of cores of your machine, Cloud Bulk Upload can make cloud storage transfers even **100X faster** than sequential\nmode using traditional Boto3! Furthermore, Cloud Bulk Upload can keep the original folder structure of files and\ndirectories when transferring them.\n\n### \ud83d\ude80 Main Functionalities\n\n- **\ud83d\udd04 Multi-Cloud Support**: AWS S3, Azure Blob Storage, and Google Cloud Storage\n- **\u26a1 High Performance**: Multi-thread and async operations for maximum speed\n- **\ud83d\udcc1 Directory Operations**: Upload/download entire directories with structure preservation\n- **\ud83c\udfaf Bulk Operations**: Efficient handling of thousands of files\n- **\ud83d\udcca Progress Tracking**: Built-in progress bars for long-running operations\n- **\ud83e\uddea Comprehensive Testing**: Full test suite with performance comparisons\n- **\ud83d\udd27 Configurable**: Customizable concurrency, timeouts, and error handling\n- **\ud83d\udcc8 Performance Monitoring**: Built-in metrics and comparison tools\n\n### \ud83c\udfc6 Performance Benefits\n\n- **100X faster** than sequential uploads\n- **Async operations** for Azure and Google Cloud\n- **Multi-threading** for AWS S3\n- **Configurable concurrency** for optimal performance\n- **Memory efficient** for large file sets\n\n## Getting Started\n\n### Prerequisites\n\n* [Python 3.11+](https://www.python.org/)\n* [pip](https://pip.pypa.io/en/stable/)\n* API credentials for your chosen cloud provider(s)\n\n**Note**: You can deploy a free S3-compatible server using [MinIO](https://min.io/) \non your local machine for testing. See our [documentation](docs/TESTING.md) for setup instructions.\n\n### Installation\n\nUse the package manager [pip](https://pypi.org/project/cloudbulkupload/) to install `cloudbulkupload`.\n\n```bash\npip install cloudbulkupload\n```\n\nFor development and testing:\n```bash\npip install \"cloudbulkupload[test]\"\n```\n\n### Quick Start\n\n```python\n# AWS S3\nfrom cloudbulkupload import BulkBoto3\n\naws_client = BulkBoto3(\n    endpoint_url=\"your-endpoint\",\n    aws_access_key_id=\"your-key\",\n    aws_secret_access_key=\"your-secret\",\n    verbose=True\n)\n\n# Upload directory\naws_client.upload_dir_to_storage(\n    bucket_name=\"my-bucket\",\n    local_dir=\"path/to/files\",\n    storage_dir=\"uploads\",\n    n_threads=50\n)\n```\n\n```python\n# Azure Blob Storage\nimport asyncio\nfrom cloudbulkupload import BulkAzureBlob\n\nasync def azure_example():\n    azure_client = BulkAzureBlob(\n        connection_string=\"your-connection-string\",\n        verbose=True\n    )\n    \n    await azure_client.upload_directory(\n        container_name=\"my-container\",\n        local_dir=\"path/to/files\",\n        storage_dir=\"uploads\"\n    )\n\nasyncio.run(azure_example())\n```\n\n```python\n# Google Cloud Storage\nimport asyncio\nfrom cloudbulkupload import BulkGoogleStorage\n\nasync def google_example():\n    google_client = BulkGoogleStorage(\n        project_id=\"your-project-id\",\n        verbose=True\n    )\n    \n    await google_client.upload_directory(\n        bucket_name=\"my-bucket\",\n        local_dir=\"path/to/files\",\n        storage_dir=\"uploads\"\n    )\n\nasyncio.run(google_example())\n```\n\n## Usage by Provider\n\n### AWS S3\n\nAWS S3 support uses multi-threading for optimal performance on the AWS platform.\n\n#### Basic Setup\n\n```python\nfrom cloudbulkupload import BulkBoto3\n\nclient = BulkBoto3(\n    endpoint_url=\"https://s3.amazonaws.com\",  # or your custom endpoint\n    aws_access_key_id=\"your-access-key\",\n    aws_secret_access_key=\"your-secret-key\",\n    max_pool_connections=300,\n    verbose=True\n)\n```\n\n#### Directory Operations\n\n```python\n# Upload entire directory\nclient.upload_dir_to_storage(\n    bucket_name=\"my-bucket\",\n    local_dir=\"path/to/local/directory\",\n    storage_dir=\"uploads/my-files\",\n    n_threads=50\n)\n\n# Download entire directory\nclient.download_dir_from_storage(\n    bucket_name=\"my-bucket\",\n    storage_dir=\"uploads/my-files\",\n    local_dir=\"downloads\",\n    n_threads=50\n)\n```\n\n#### Individual File Operations\n\n```python\nfrom cloudbulkupload import StorageTransferPath\n\n# Upload specific files\nupload_paths = [\n    StorageTransferPath(\"file1.txt\", \"uploads/file1.txt\"),\n    StorageTransferPath(\"file2.txt\", \"uploads/file2.txt\")\n]\n\nclient.upload(bucket_name=\"my-bucket\", upload_paths=upload_paths)\n\n# Download specific files\ndownload_paths = [\n    StorageTransferPath(\"uploads/file1.txt\", \"local/file1.txt\"),\n    StorageTransferPath(\"uploads/file2.txt\", \"local/file2.txt\")\n]\n\nclient.download(bucket_name=\"my-bucket\", download_paths=download_paths)\n```\n\n#### Bucket Management\n\n```python\n# Create bucket\nclient.create_new_bucket(\"new-bucket-name\")\n\n# List objects\nobjects = client.list_objects(bucket_name=\"my-bucket\", storage_dir=\"uploads\")\n\n# Check if object exists\nexists = client.check_object_exists(bucket_name=\"my-bucket\", object_path=\"uploads/file.txt\")\n\n# Empty bucket\nclient.empty_bucket(\"my-bucket\")\n```\n\n### Azure Blob Storage\n\nAzure Blob Storage support uses async/await patterns for optimal performance.\n\n#### Basic Setup\n\n```python\nimport asyncio\nfrom cloudbulkupload import BulkAzureBlob\n\nasync def main():\n    client = BulkAzureBlob(\n        connection_string=\"your-azure-connection-string\",\n        max_concurrent_operations=50,\n        verbose=True\n    )\n    \n    # Your operations here\n    await client.upload_directory(\n        container_name=\"my-container\",\n        local_dir=\"path/to/files\",\n        storage_dir=\"uploads\"\n    )\n\nasyncio.run(main())\n```\n\n#### Directory Operations\n\n```python\n# Upload directory\nawait client.upload_directory(\n    container_name=\"my-container\",\n    local_dir=\"path/to/local/directory\",\n    storage_dir=\"uploads/my-files\"\n)\n\n# Download directory\nawait client.download_directory(\n    container_name=\"my-container\",\n    storage_dir=\"uploads/my-files\",\n    local_dir=\"downloads\"\n)\n```\n\n#### Individual File Operations\n\n```python\nfrom cloudbulkupload import StorageTransferPath\n\n# Upload specific files\nupload_paths = [\n    StorageTransferPath(\"file1.txt\", \"uploads/file1.txt\"),\n    StorageTransferPath(\"file2.txt\", \"uploads/file2.txt\")\n]\n\nawait client.upload_files(\"my-container\", upload_paths)\n\n# Download specific files\ndownload_paths = [\n    StorageTransferPath(\"uploads/file1.txt\", \"local/file1.txt\"),\n    StorageTransferPath(\"uploads/file2.txt\", \"local/file2.txt\")\n]\n\nawait client.download_files(\"my-container\", download_paths)\n```\n\n#### Container Management\n\n```python\n# Create container\nawait client.create_container(\"new-container\")\n\n# List blobs\nblobs = await client.list_blobs(\"my-container\", prefix=\"uploads/\")\n\n# Check if blob exists\nexists = await client.check_blob_exists(\"my-container\", \"uploads/file.txt\")\n\n# Empty container\nawait client.empty_container(\"my-container\")\n```\n\n#### Convenience Functions\n\n```python\nfrom cloudbulkupload import bulk_upload_blobs, bulk_download_blobs\n\n# Bulk upload\nfiles = [\"file1.txt\", \"file2.txt\", \"file3.txt\"]\nawait bulk_upload_blobs(\n    connection_string=\"your-connection-string\",\n    container_name=\"my-container\",\n    files_to_upload=files,\n    max_concurrent=50,\n    verbose=True\n)\n\n# Bulk download\nawait bulk_download_blobs(\n    connection_string=\"your-connection-string\",\n    container_name=\"my-container\",\n    files_to_download=files,\n    local_dir=\"downloads\",\n    max_concurrent=50,\n    verbose=True\n)\n```\n\n### Google Cloud Storage\n\nGoogle Cloud Storage support uses async/await patterns and includes a hybrid approach with Google's Transfer Manager for maximum performance.\n\n#### Basic Setup\n\n```python\nimport asyncio\nfrom cloudbulkupload import BulkGoogleStorage\n\nasync def main():\n    client = BulkGoogleStorage(\n        project_id=\"your-project-id\",\n        credentials_path=\"/path/to/service-account.json\",  # Optional\n        max_concurrent_operations=50,\n        verbose=True\n    )\n    \n    # Your operations here\n    await client.upload_directory(\n        bucket_name=\"my-bucket\",\n        local_dir=\"path/to/files\",\n        storage_dir=\"uploads\"\n    )\n\nasyncio.run(main())\n```\n\n#### Authentication Options\n\n```python\n# Method 1: Service Account Key File\nclient = BulkGoogleStorage(\n    project_id=\"your-project-id\",\n    credentials_path=\"/path/to/service-account.json\"\n)\n\n# Method 2: Service Account JSON String (for cloud/container environments)\nclient = BulkGoogleStorage(\n    project_id=\"your-project-id\",\n    credentials_json='{\"type\": \"service_account\", ...}'\n)\n\n# Method 3: Application Default Credentials\nclient = BulkGoogleStorage(project_id=\"your-project-id\")\n```\n\n#### Directory Operations\n\n```python\n# Upload directory\nawait client.upload_directory(\n    bucket_name=\"my-bucket\",\n    local_dir=\"path/to/local/directory\",\n    storage_dir=\"uploads/my-files\"\n)\n\n# Download directory\nawait client.download_directory(\n    bucket_name=\"my-bucket\",\n    storage_dir=\"uploads/my-files\",\n    local_dir=\"downloads\"\n)\n```\n\n#### Individual File Operations\n\n```python\nfrom cloudbulkupload import StorageTransferPath\n\n# Upload specific files\nupload_paths = [\n    StorageTransferPath(\"file1.txt\", \"uploads/file1.txt\"),\n    StorageTransferPath(\"file2.txt\", \"uploads/file2.txt\")\n]\n\nawait client.upload_files(\"my-bucket\", upload_paths)\n\n# Download specific files\ndownload_paths = [\n    StorageTransferPath(\"uploads/file1.txt\", \"local/file1.txt\"),\n    StorageTransferPath(\"uploads/file2.txt\", \"local/file2.txt\")\n]\n\nawait client.download_files(\"my-bucket\", download_paths)\n```\n\n#### Hybrid Approach: Standard vs Transfer Manager\n\n```python\n# Standard Mode (Consistent API across all providers)\nawait client.upload_files(\"my-bucket\", upload_paths)\n\n# Transfer Manager Mode (High Performance - Google Cloud only)\nawait client.upload_files(\"my-bucket\", upload_paths, use_transfer_manager=True)\n```\n\n#### Bucket Management\n\n```python\n# Create bucket\nawait client.create_bucket(\"new-bucket-name\")\n\n# List blobs\nblobs = await client.list_blobs(\"my-bucket\", prefix=\"uploads/\")\n\n# Check if blob exists\nexists = await client.check_blob_exists(\"my-bucket\", \"uploads/file.txt\")\n\n# Empty bucket\nawait client.empty_bucket(\"my-bucket\")\n```\n\n#### Convenience Functions\n\n```python\nfrom cloudbulkupload import google_bulk_upload_blobs, google_bulk_download_blobs\n\n# Bulk upload\nfiles = [\"file1.txt\", \"file2.txt\", \"file3.txt\"]\nawait google_bulk_upload_blobs(\n    project_id=\"your-project-id\",\n    bucket_name=\"my-bucket\",\n    files_to_upload=files,\n    max_concurrent=50,\n    verbose=True,\n    use_transfer_manager=True  # Optional: Use Google's Transfer Manager\n)\n\n# Bulk download\nawait google_bulk_download_blobs(\n    project_id=\"your-project-id\",\n    bucket_name=\"my-bucket\",\n    files_to_download=files,\n    local_dir=\"downloads\",\n    max_concurrent=50,\n    verbose=True\n)\n```\n\n## Testing and Performance\n\n### Running Tests\n\nThe package includes a comprehensive test suite for all providers and performance comparisons.\n\n#### Install Test Dependencies\n\n```bash\npip install \"cloudbulkupload[test]\"\n```\n\n#### Run Different Test Types\n\n```bash\n# Unit tests\npython run_tests.py --type unit\n\n# Performance tests\npython run_tests.py --type performance\n\n# AWS S3 tests\npython run_tests.py --type aws\n\n# Azure Blob Storage tests\npython run_tests.py --type azure\n\n# Google Cloud Storage tests\npython run_tests.py --type google-cloud\n\n# AWS vs Azure comparison\npython run_tests.py --type azure-comparison\n\n# Three-way comparison (AWS, Azure, Google)\npython run_tests.py --type three-way-comparison\n\n# All tests\npython run_tests.py --type all\n```\n\n#### Individual Test Files\n\n```bash\n# Run specific test files\npython tests/aws_s3_test.py\npython tests/azure_blob_test.py\npython tests/google_cloud_test.py\npython tests/performance_comparison_three_way.py\n```\n\n### Performance Comparison\n\nThe package includes built-in performance comparison tools to test and compare different cloud providers.\n\n#### Three-Way Performance Comparison\n\n```bash\npython tests/performance_comparison_three_way.py\n```\n\nThis will:\n- Test AWS S3, Azure Blob Storage, and Google Cloud Storage\n- Compare upload/download speeds\n- Generate performance reports\n- Create CSV files with detailed metrics\n\n#### Performance Metrics\n\nThe tests measure:\n- **Upload Speed**: MB/s for different file sizes\n- **Download Speed**: MB/s for different file sizes\n- **Concurrency Impact**: Performance with different thread counts\n- **File Size Impact**: Performance with different file sizes\n- **Provider Comparison**: Direct comparison between AWS, Azure, and Google Cloud\n\n#### Expected Performance\n\nBased on our testing:\n- **AWS S3**: 5-8 MB/s with multi-threading\n- **Azure Blob Storage**: 6-9 MB/s with async operations\n- **Google Cloud Storage**: 6-9 MB/s with async operations\n- **Google Transfer Manager**: 8-12 MB/s for large files\n\n### Test Results\n\nTest results are automatically generated and saved to:\n- `performance_comparison_results.csv` - AWS vs Azure comparison\n- `performance_comparison_three_way_results.csv` - Three-way comparison\n- `test_results.csv` - General test results\n- `google_cloud_test_results.json` - Google Cloud specific results\n\nFor detailed test documentation, see [docs/TESTING.md](docs/TESTING.md).\n\n## Documentation\n\nComprehensive documentation is available in the `docs/` directory:\n\n### \ud83d\udcda Implementation Guides\n- [docs/AZURE_GUIDE.md](docs/AZURE_GUIDE.md) - Complete Azure Blob Storage guide\n- [docs/GOOGLE_CLOUD_GUIDE.md](docs/GOOGLE_CLOUD_GUIDE.md) - Complete Google Cloud Storage guide\n\n### \ud83d\udccb Implementation Summaries\n- [docs/AZURE_IMPLEMENTATION_SUMMARY.md](docs/AZURE_IMPLEMENTATION_SUMMARY.md) - Azure implementation details\n- [docs/GOOGLE_CLOUD_IMPLEMENTATION_SUMMARY.md](docs/GOOGLE_CLOUD_IMPLEMENTATION_SUMMARY.md) - Google Cloud implementation details\n\n### \ud83e\uddea Testing Documentation\n- [docs/TESTING.md](docs/TESTING.md) - Complete testing guide\n- [docs/TEST_RESULTS.md](docs/TEST_RESULTS.md) - Test results and analysis\n- [docs/COMPREHENSIVE_TEST_SUMMARY.md](docs/COMPREHENSIVE_TEST_SUMMARY.md) - Comprehensive test summary\n\n### \ud83d\udce6 PyPI Publishing\n- [docs/PYPI_PUBLISHING_GUIDE.md](docs/PYPI_PUBLISHING_GUIDE.md) - How to publish to PyPI\n- [docs/PYPI_QUICK_REFERENCE.md](docs/PYPI_QUICK_REFERENCE.md) - Quick PyPI reference\n\n### \ud83d\udcd6 Original Documentation\n- [docs/ORIGINAL_README.md](docs/ORIGINAL_README.md) - Original README for reference\n\n## Contributing\n\nAny contributions you make are **greatly appreciated**. If you have a suggestion that would make this better, please fork the repo and create a pull request. \nYou can also simply open an issue with the tag \"enhancement\". To contribute to `cloudbulkupload`, follow these steps:\n\n1. Fork this repository\n2. Create a feature branch (`git checkout -b feature/AmazingFeature`)\n3. Make your changes and commit them (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a pull request\n\nAlternatively, see the GitHub documentation on [creating a pull request](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request).\n\n### Development Setup\n\n```bash\n# Clone the repository\ngit clone https://github.com/dynamicdeploy/cloudbulkupload.git\ncd cloudbulkupload\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install development dependencies\npip install -e \".[test,dev]\"\n\n# Run tests\npython run_tests.py --type all\n```\n\n## Contributors\n\nThanks to the following people who have contributed to this project:\n\n* [Amir Masoud Sefidian](https://sefidian.com/) \ud83d\udcd6 - Original creator of the bulk upload concept\n* [Dynamic Deploy](https://github.com/dynamicdeploy) \ud83d\ude80 - Multi-cloud expansion and maintenance\n\n## License\n\nDistributed under the [MIT](https://choosealicense.com/licenses/mit/) License. See `LICENSE` for more information.\n\n---\n\n## Credits\n\nThis project is based on the original work by **Amir Masoud Sefidian** who created the bulk upload concept and initial implementation. The original repository can be found at: [https://github.com/iamirmasoud/bulkboto3](https://github.com/iamirmasoud/bulkboto3)\n\nThe project has been significantly expanded to support multiple cloud providers (AWS S3, Azure Blob Storage, and Google Cloud Storage) while maintaining the core performance benefits of the original implementation.\n\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python package for fast and parallel transferring a bulk of files to S3, Azure Blob Storage, and Google Cloud Storage",
    "version": "2.0.0",
    "project_urls": {
        "Homepage": "https://github.com/dynamicdeploy/cloudbulkupload",
        "Issues": "https://github.com/dynamicdeploy/cloudbulkupload/issues",
        "Repository": "https://github.com/dynamicdeploy/cloudbulkupload"
    },
    "split_keywords": [
        "boto3",
        " s3",
        " azure",
        " google cloud",
        " blob storage",
        " parallel",
        " multi-thread",
        " bulk",
        " boto",
        " bulk boto",
        " bulk boto3",
        " simple storage service",
        " minio",
        " amazon aws s3",
        " microsoft azure",
        " google cloud storage",
        " python",
        " async"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a2accc78d9cc60b6bf0516c4e61cc9bfadf82d34c6e04f192f07e2547779e35a",
                "md5": "a702ae0b6f5a6144f46f9c1641d3f1ee",
                "sha256": "e2c4ec51767bca0342f11c7d2c47801ce2125e90b573f5b699b7f74fc4eaacea"
            },
            "downloads": -1,
            "filename": "cloudbulkupload-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a702ae0b6f5a6144f46f9c1641d3f1ee",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11.0",
            "size": 18661,
            "upload_time": "2025-08-12T21:22:40",
            "upload_time_iso_8601": "2025-08-12T21:22:40.427356Z",
            "url": "https://files.pythonhosted.org/packages/a2/ac/cc78d9cc60b6bf0516c4e61cc9bfadf82d34c6e04f192f07e2547779e35a/cloudbulkupload-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "74b86636c565eecb2b64ac46671e3478d1f4cdcaec92cbf506e4ea31c13b9f94",
                "md5": "2d983d004c38e85914991801eff3dcd3",
                "sha256": "883d1eb4a5a0ae2cf155b7c26cf13bb69efa5729b057a41118fe180784b13c1b"
            },
            "downloads": -1,
            "filename": "cloudbulkupload-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "2d983d004c38e85914991801eff3dcd3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11.0",
            "size": 27049,
            "upload_time": "2025-08-12T21:22:41",
            "upload_time_iso_8601": "2025-08-12T21:22:41.395927Z",
            "url": "https://files.pythonhosted.org/packages/74/b8/6636c565eecb2b64ac46671e3478d1f4cdcaec92cbf506e4ea31c13b9f94/cloudbulkupload-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-12 21:22:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dynamicdeploy",
    "github_project": "cloudbulkupload",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "cloudbulkupload"
}
        
Elapsed time: 0.74541s