gs-fastcopy

Name	gs-fastcopy JSON
Version	1.0a6 JSON
	download
home_page	None
Summary	Optimized file transfer and compression for large files on Google Cloud Storage
upload_time	2024-10-19 04:23:25
maintainer	None
docs_url	None
author	None
requires_python	<=3.12,>=3.9
license	Boost Software License - Version 1.0 - August 17th, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	file copy cloud storage google gcp
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # gs_fastcopy (python)
Optimized file copying &amp; compression for large files on Google Cloud Storage.

**TLDR:**

```python
import gs_fastcopy
import numpy as np

with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))

with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']
```



Provides file-like interfaces for:

- Parallel, [XML multipart](https://cloud.google.com/storage/docs/multipart-uploads) uploads to Cloud Storage.
- Parallel, sliced downloads from Cloud Storage using `gcloud storage`.
- Parallel (de)compression using [`pigz` and `unpigz`](https://github.com/madler/pigz) if available (with fallback to standard `gzip` and `gunzip`).

Together, these provided ~70% improvement on uploading a 1.2GB file, and ~40% improvement downloading the same.

> [!Note]
>
> This benchmark is being tested more rigorously, stay tuned.

## Examples

`gs_fastcopy` is easy to use for reading and writing files.

You can use it without compression:

```python
with gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))
    
with gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']
```

`gs_fastcopy` also handles gzip compression transparently. Note that we don't use numpy's `savez_compressed`:

```python
with gs_fastcopy.write('gs://my-bucket/my-file.npz.gz') as f:
    np.savez(f, a=np.zeros(12), b=np.ones(23))
    
with gs_fastcopy.read('gs://my-bucket/my-file.npz.gz') as f:
    npz = np.load(f)
    a = npz['a']
    b = npz['b']
```

## Caveats & limitations

* **You need a `__main__` guard.**

  Subprocesses spawned during parallel processing re-interpret the main script.  This is bad if the main script then spawns its own subprocesses…

  See also [gs-fastcopy-python#5](https://github.com/redwoodconsulting-io/gs-fastcopy-python/issues/5) with a further note on freezing scripts into executables.

* **You need a filesystem.**

  Because `gs_fastcopy` uses tools that work with files, it must be able to read/write a filesystem, in particular temporary files as set up by `tempfile.TemporaryDirectory()` [[python docs](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory)].

  This is surprisingly versatile: even "very" serverless environments like Cloud Functions present an in-memory file system.

* **You need the `gcloud` SDK on your path.**

  Or, at least the `gcloud storage` component of the SDK.

  `gs_fastcloud` uses `gcloud` to download files. 

  Issue [#2](https://github.com/redwoodconsulting-io/gs-fastcopy-python/issues/2) considers falling back to Python API downloads if the specialized tools aren't available.

* **You need enough disk space for the compressed & uncompressed files, together.**

  Because `gs_fastcloud` writes the (un)compressed file to disk while (de)compressing it, the file system needs to accommodate both files while the operation is in progress.
  
  * Reads from Cloud: (1) fetch to temp file; (2) decompress if gzipped; (3) stream temp file to Python app via `read()`; (4) delete the temp file
  * Writes to Cloud: (1) app writes to temp file; (2) compress if gzipped; (3) upload temp file to Google Cloud; (4) delete the temp file
  

## Why gs_fastcopy

APIs for Google Storage (GS) typically present `File`-like interfaces which read/write data sequentially. For example: open up a stream then write bytes to it until done. Data is streamed between cloud storage and memory. It's easy to use stream-based compression like `gzip` along the way.

Libraries like [`smart_open`](https://github.com/piskvorky/smart_open) add yet more convenience, providing a unified interface for reading/writing local files and several cloud providers, with transparent encryption for `.gz` files. Quite delightful!

Unfortunately, these approaches are single-threaded. We [noticed](https://github.com/dchaley/deepcell-imaging/issues/248) that transfer time for files sized many 100s of MBs was lower than expected. [@lynnlangit](https://github.com/lynnlangit) pointed me toward the composite upload feature in  `gcloud storage cp`. A "few" hours later, `gs_fastcopy` came to be.

## Why both `gcloud` and XML multi-part

I'm glad you asked! I initially implemented this just with `gcloud`'s [composite uploads](https://cloud.google.com/storage/docs/parallel-composite-uploads). But the documentation gave a few warnings about composite uploads.

> [!Warning]
>
> Parallel composite uploads involve deleting temporary objects shortly after upload. Keep in mind the following:
>
> * Because other storage classes are subject to [early deletion fees](https://cloud.google.com/storage/pricing#early-delete), you should always use [Standard storage](https://cloud.google.com/storage/docs/storage-classes#standard) for temporary objects. Once the final object is composed, you can change its storage class.
> * You should not use parallel composite uploads when uploading to a bucket that has a [retention policy](https://cloud.google.com/storage/docs/bucket-lock), because the temporary objects can't be deleted until they meet the retention period.
> * If the bucket you upload to has [default object holds](https://cloud.google.com/storage/docs/object-holds#default-holds) enabled, you must [release the hold](https://cloud.google.com/storage/docs/holding-objects#set-object-hold) from each temporary object before you can delete it.

Basically, composite uploads leverage independent API functions, whereas XML multi-part is a managed operation. The managed operation plays more nicely with other features like retention policies. On the other hand, because it's separate, the XML multi-part API needs additional permissions. (We may need to fall back to `gcloud` in that case!)

On top of being "weird" in these ways, composite uploads are actually slower. I found this wonderful benchmarking by Christopher Madden: [High throughput file transfers with Google Cloud Storage (GCS)](https://www.beginswithdata.com/2024/02/01/google-cloud-storage-max-throughput/). TLDR, `gcloud` sliced downloads outperform the Python API, but for writes the XML multi-part API is best. (By far, if many cores are available.)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "gs-fastcopy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<=3.12,>=3.9",
    "maintainer_email": null,
    "keywords": "file, copy, cloud, storage, google, gcp",
    "author": null,
    "author_email": "David Haley <dchaley@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/57/b7/a2b78a6551059f20acd37493e44a1d920d12d25d78a022db03c3dc51f190/gs_fastcopy-1.0a6.tar.gz",
    "platform": null,
    "description": "# gs_fastcopy (python)\nOptimized file copying &amp; compression for large files on Google Cloud Storage.\n\n**TLDR:**\n\n```python\nimport gs_fastcopy\nimport numpy as np\n\nwith gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:\n    np.savez(f, a=np.zeros(12), b=np.ones(23))\n\nwith gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:\n    npz = np.load(f)\n    a = npz['a']\n    b = npz['b']\n```\n\n\n\nProvides file-like interfaces for:\n\n- Parallel, [XML multipart](https://cloud.google.com/storage/docs/multipart-uploads) uploads to Cloud Storage.\n- Parallel, sliced downloads from Cloud Storage using `gcloud storage`.\n- Parallel (de)compression using [`pigz` and `unpigz`](https://github.com/madler/pigz) if available (with fallback to standard `gzip` and `gunzip`).\n\nTogether, these provided ~70% improvement on uploading a 1.2GB file, and ~40% improvement downloading the same.\n\n> [!Note]\n>\n> This benchmark is being tested more rigorously, stay tuned.\n\n## Examples\n\n`gs_fastcopy` is easy to use for reading and writing files.\n\nYou can use it without compression:\n\n```python\nwith gs_fastcopy.write('gs://my-bucket/my-file.npz') as f:\n    np.savez(f, a=np.zeros(12), b=np.ones(23))\n    \nwith gs_fastcopy.read('gs://my-bucket/my-file.npz') as f:\n    npz = np.load(f)\n    a = npz['a']\n    b = npz['b']\n```\n\n`gs_fastcopy` also handles gzip compression transparently. Note that we don't use numpy's `savez_compressed`:\n\n```python\nwith gs_fastcopy.write('gs://my-bucket/my-file.npz.gz') as f:\n    np.savez(f, a=np.zeros(12), b=np.ones(23))\n    \nwith gs_fastcopy.read('gs://my-bucket/my-file.npz.gz') as f:\n    npz = np.load(f)\n    a = npz['a']\n    b = npz['b']\n```\n\n## Caveats & limitations\n\n* **You need a `__main__` guard.**\n\n  Subprocesses spawned during parallel processing re-interpret the main script.  This is bad if the main script then spawns its own subprocesses\u2026\n\n  See also [gs-fastcopy-python#5](https://github.com/redwoodconsulting-io/gs-fastcopy-python/issues/5) with a further note on freezing scripts into executables.\n\n* **You need a filesystem.**\n\n  Because `gs_fastcopy` uses tools that work with files, it must be able to read/write a filesystem, in particular temporary files as set up by `tempfile.TemporaryDirectory()` [[python docs](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory)].\n\n  This is surprisingly versatile: even \"very\" serverless environments like Cloud Functions present an in-memory file system.\n\n* **You need the `gcloud` SDK on your path.**\n\n  Or, at least the `gcloud storage` component of the SDK.\n\n  `gs_fastcloud` uses `gcloud` to download files. \n\n  Issue [#2](https://github.com/redwoodconsulting-io/gs-fastcopy-python/issues/2) considers falling back to Python API downloads if the specialized tools aren't available.\n\n* **You need enough disk space for the compressed & uncompressed files, together.**\n\n  Because `gs_fastcloud` writes the (un)compressed file to disk while (de)compressing it, the file system needs to accommodate both files while the operation is in progress.\n  \n  * Reads from Cloud: (1) fetch to temp file; (2) decompress if gzipped; (3) stream temp file to Python app via `read()`; (4) delete the temp file\n  * Writes to Cloud: (1) app writes to temp file; (2) compress if gzipped; (3) upload temp file to Google Cloud; (4) delete the temp file\n  \n\n## Why gs_fastcopy\n\nAPIs for Google Storage (GS) typically present `File`-like interfaces which read/write data sequentially. For example: open up a stream then write bytes to it until done. Data is streamed between cloud storage and memory. It's easy to use stream-based compression like `gzip` along the way.\n\nLibraries like [`smart_open`](https://github.com/piskvorky/smart_open) add yet more convenience, providing a unified interface for reading/writing local files and several cloud providers, with transparent encryption for `.gz` files. Quite delightful!\n\nUnfortunately, these approaches are single-threaded. We [noticed](https://github.com/dchaley/deepcell-imaging/issues/248) that transfer time for files sized many 100s of MBs was lower than expected. [@lynnlangit](https://github.com/lynnlangit) pointed me toward the composite upload feature in  `gcloud storage cp`. A \"few\" hours later, `gs_fastcopy` came to be.\n\n## Why both `gcloud` and XML multi-part\n\nI'm glad you asked! I initially implemented this just with `gcloud`'s [composite uploads](https://cloud.google.com/storage/docs/parallel-composite-uploads). But the documentation gave a few warnings about composite uploads.\n\n> [!Warning]\n>\n> Parallel composite uploads involve deleting temporary objects shortly after upload. Keep in mind the following:\n>\n> * Because other storage classes are subject to [early deletion fees](https://cloud.google.com/storage/pricing#early-delete), you should always use [Standard storage](https://cloud.google.com/storage/docs/storage-classes#standard) for temporary objects. Once the final object is composed, you can change its storage class.\n> * You should not use parallel composite uploads when uploading to a bucket that has a [retention policy](https://cloud.google.com/storage/docs/bucket-lock), because the temporary objects can't be deleted until they meet the retention period.\n> * If the bucket you upload to has [default object holds](https://cloud.google.com/storage/docs/object-holds#default-holds) enabled, you must [release the hold](https://cloud.google.com/storage/docs/holding-objects#set-object-hold) from each temporary object before you can delete it.\n\nBasically, composite uploads leverage independent API functions, whereas XML multi-part is a managed operation. The managed operation plays more nicely with other features like retention policies. On the other hand, because it's separate, the XML multi-part API needs additional permissions. (We may need to fall back to `gcloud` in that case!)\n\nOn top of being \"weird\" in these ways, composite uploads are actually slower. I found this wonderful benchmarking by Christopher Madden: [High throughput file transfers with Google Cloud Storage (GCS)](https://www.beginswithdata.com/2024/02/01/google-cloud-storage-max-throughput/). TLDR, `gcloud` sliced downloads outperform the Python API, but for writes the XML multi-part API is best. (By far, if many cores are available.)\n\n",
    "bugtrack_url": null,
    "license": "Boost Software License - Version 1.0 - August 17th, 2003  Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the \"Software\") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following:  The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Optimized file transfer and compression for large files on Google Cloud Storage",
    "version": "1.0a6",
    "project_urls": {
        "Homepage": "https://github.com/redwoodconsulting-io/gs-fastcopy-python"
    },
    "split_keywords": [
        "file",
        " copy",
        " cloud",
        " storage",
        " google",
        " gcp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6a7979962a9f94c8a875e0e5fd61f40e3886bf0812531ae4a6614171678ed5a9",
                "md5": "53b6860cbd0c8fb183d069a1893b824f",
                "sha256": "e007a7511fd1da3e6d4081d6e7922dc36569ee9e980cc2d1295217234d1b4cc1"
            },
            "downloads": -1,
            "filename": "gs_fastcopy-1.0a6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "53b6860cbd0c8fb183d069a1893b824f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<=3.12,>=3.9",
            "size": 8141,
            "upload_time": "2024-10-19T04:23:23",
            "upload_time_iso_8601": "2024-10-19T04:23:23.971743Z",
            "url": "https://files.pythonhosted.org/packages/6a/79/79962a9f94c8a875e0e5fd61f40e3886bf0812531ae4a6614171678ed5a9/gs_fastcopy-1.0a6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "57b7a2b78a6551059f20acd37493e44a1d920d12d25d78a022db03c3dc51f190",
                "md5": "0b529f081bd18cec5e8342f64d5f2498",
                "sha256": "fca1c98a987602daab8e1247fd7cc4b041bcb20e75552a0c2ddd1b909a19673f"
            },
            "downloads": -1,
            "filename": "gs_fastcopy-1.0a6.tar.gz",
            "has_sig": false,
            "md5_digest": "0b529f081bd18cec5e8342f64d5f2498",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<=3.12,>=3.9",
            "size": 8862,
            "upload_time": "2024-10-19T04:23:25",
            "upload_time_iso_8601": "2024-10-19T04:23:25.619672Z",
            "url": "https://files.pythonhosted.org/packages/57/b7/a2b78a6551059f20acd37493e44a1d920d12d25d78a022db03c3dc51f190/gs_fastcopy-1.0a6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-19 04:23:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "redwoodconsulting-io",
    "github_project": "gs-fastcopy-python",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "gs-fastcopy"
}

None