resource-backed-dask-array


Nameresource-backed-dask-array JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/tlambert03/resource-backed-dask-array
Summaryexperimental Dask array that opens/closes a resource when computing
upload_time2022-02-18 02:10:06
maintainer
docs_urlNone
authorTalley Lambert
requires_python>=3.7
licenseBSD-3-Clause
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # resource-backed-dask-array

[![License](https://img.shields.io/pypi/l/resource-backed-dask-array.svg?color=green)](https://github.com/tlambert03/resource-backed-dask-array/raw/main/LICENSE)
[![PyPI](https://img.shields.io/pypi/v/resource-backed-dask-array.svg?color=green)](https://pypi.org/project/resource-backed-dask-array)
[![Python Version](https://img.shields.io/pypi/pyversions/resource-backed-dask-array.svg?color=green)](https://python.org)
[![CI](https://github.com/tlambert03/resource-backed-dask-array/actions/workflows/ci.yml/badge.svg)](https://github.com/tlambert03/resource-backed-dask-array/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/tlambert03/resource-backed-dask-array/branch/main/graph/badge.svg)](https://codecov.io/gh/tlambert03/resource-backed-dask-array)

`ResourceBackedDaskArray` is an experimental Dask array subclass
that opens/closes a resource when computing (but only once per compute call).

## installation

```sh
pip install resource-backed-dask-array
```

## motivation for this package

Consider the following class that simulates a file reader capable of returning a
dask array (using
[dask.array.map_blocks](https://docs.dask.org/en/latest/generated/dask.array.map_blocks.html))
The file handle must be in an *open* state in order to read a chunk, otherwise
it segfaults (or otherwise errors)

```python
import dask.array as da
import numpy as np


class FileReader:

    def __init__(self):
        self._closed = False

    def close(self):
        """close the imaginary file"""
        self._closed = True

    @property
    def closed(self):
        return self._closed

    def __enter__(self):
        if self.closed:
            self._closed = False  # open
        return self

    def __exit__(self, *_):
        self.close()

    def to_dask(self) -> da.Array:
        """Method that returns a dask array for this file."""
        return da.map_blocks(
            self._dask_block,
            chunks=((1,) * 4, 4, 4),
            dtype=float,
        )

    def _dask_block(self):
        """simulate getting a single chunk of the file."""
        if self.closed:
            raise RuntimeError("Segfault!")
        return np.random.rand(1, 4, 4)
```

As long as the file stays open, everything works fine:

```python
>>> fr = FileReader()
>>> dsk_ary = fr.to_dask()
>>> dsk_ary.compute().shape
(4, 4, 4)
```

However, if one closes the file, the dask array returned
from `to_dask` will now fail:

```python
>>> fr.close()
>>> dsk_ary.compute()  # RuntimeError: Segfault!
```

A "quick-and-dirty" solution here might be to force the `_dask_block` method to
temporarily reopen the file if it finds the file in the closed state, but if the
file-open process takes any amount of time, this could incur significant
overhead as it opens-and-closes for *every* chunk in the array.

## usage

`ResourceBackedDaskArray.from_array`

This library attempts to provide a solution to the above problem with a
`ResourceBackedDaskArray` object.  This manages the opening/closing of
an underlying resource whenever [`.compute()`](https://docs.dask.org/en/stable/generated/dask.array.Array.compute.html#dask.array.Array.compute) is called – and does so only once for all chunks in a single compute task graph.

```python
>>> from resource_backed_dask_array import resource_backed_dask_array
>>> safe_dsk_ary = resource_backed_dask_array(dsk_ary, fr)
>>> safe_dsk_ary.compute().shape
(4, 4, 4)

>>> fr.closed  # leave it as we found it
True
```

The second argument passed to `from_array` must be a [resuable context manager](https://docs.python.org/3/library/contextlib.html#reusable-context-managers)
that additionally provides a `closed` attribute (like [io.IOBase](https://docs.python.org/3/library/io.html#io.IOBase.closed)).  In other words,
it must implement the following protocol:

1. it must have an [`__enter__` method](https://docs.python.org/3/reference/datamodel.html#object.__enter__) that opens the underlying resource
2. it must have an [`__exit__` method](https://docs.python.org/3/reference/datamodel.html#object.__exit__) that closes the resource and optionally handles exceptions
3. it must have a `closed` attribute that reports whether or not the resource is closed.

In the example above, the `FileReader` class itself implemented this protocol, and so was suitable as the second argument to `ResourceBackedDaskArray.from_array` above.

## Important Caveats

This was created for single-process (and maybe just single-threaded?)
use cases where dask's out-of-core lazy loading is still very desireable.  Usage
with `dask.distributed` is untested and may very well fail.  Using stateful objects (such as the reusable context manager used here) in multi-threaded/processed tasks is error prone.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tlambert03/resource-backed-dask-array",
    "name": "resource-backed-dask-array",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "Talley Lambert",
    "author_email": "talley.lambert@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/62/80/b8952048ae1772d33b95dbf7d7107cf364c037cc229a2690fc8fa9ee8e48/resource_backed_dask_array-0.1.0.tar.gz",
    "platform": "",
    "description": "# resource-backed-dask-array\n\n[![License](https://img.shields.io/pypi/l/resource-backed-dask-array.svg?color=green)](https://github.com/tlambert03/resource-backed-dask-array/raw/main/LICENSE)\n[![PyPI](https://img.shields.io/pypi/v/resource-backed-dask-array.svg?color=green)](https://pypi.org/project/resource-backed-dask-array)\n[![Python Version](https://img.shields.io/pypi/pyversions/resource-backed-dask-array.svg?color=green)](https://python.org)\n[![CI](https://github.com/tlambert03/resource-backed-dask-array/actions/workflows/ci.yml/badge.svg)](https://github.com/tlambert03/resource-backed-dask-array/actions/workflows/ci.yml)\n[![codecov](https://codecov.io/gh/tlambert03/resource-backed-dask-array/branch/main/graph/badge.svg)](https://codecov.io/gh/tlambert03/resource-backed-dask-array)\n\n`ResourceBackedDaskArray` is an experimental Dask array subclass\nthat opens/closes a resource when computing (but only once per compute call).\n\n## installation\n\n```sh\npip install resource-backed-dask-array\n```\n\n## motivation for this package\n\nConsider the following class that simulates a file reader capable of returning a\ndask array (using\n[dask.array.map_blocks](https://docs.dask.org/en/latest/generated/dask.array.map_blocks.html))\nThe file handle must be in an *open* state in order to read a chunk, otherwise\nit segfaults (or otherwise errors)\n\n```python\nimport dask.array as da\nimport numpy as np\n\n\nclass FileReader:\n\n    def __init__(self):\n        self._closed = False\n\n    def close(self):\n        \"\"\"close the imaginary file\"\"\"\n        self._closed = True\n\n    @property\n    def closed(self):\n        return self._closed\n\n    def __enter__(self):\n        if self.closed:\n            self._closed = False  # open\n        return self\n\n    def __exit__(self, *_):\n        self.close()\n\n    def to_dask(self) -> da.Array:\n        \"\"\"Method that returns a dask array for this file.\"\"\"\n        return da.map_blocks(\n            self._dask_block,\n            chunks=((1,) * 4, 4, 4),\n            dtype=float,\n        )\n\n    def _dask_block(self):\n        \"\"\"simulate getting a single chunk of the file.\"\"\"\n        if self.closed:\n            raise RuntimeError(\"Segfault!\")\n        return np.random.rand(1, 4, 4)\n```\n\nAs long as the file stays open, everything works fine:\n\n```python\n>>> fr = FileReader()\n>>> dsk_ary = fr.to_dask()\n>>> dsk_ary.compute().shape\n(4, 4, 4)\n```\n\nHowever, if one closes the file, the dask array returned\nfrom `to_dask` will now fail:\n\n```python\n>>> fr.close()\n>>> dsk_ary.compute()  # RuntimeError: Segfault!\n```\n\nA \"quick-and-dirty\" solution here might be to force the `_dask_block` method to\ntemporarily reopen the file if it finds the file in the closed state, but if the\nfile-open process takes any amount of time, this could incur significant\noverhead as it opens-and-closes for *every* chunk in the array.\n\n## usage\n\n`ResourceBackedDaskArray.from_array`\n\nThis library attempts to provide a solution to the above problem with a\n`ResourceBackedDaskArray` object.  This manages the opening/closing of\nan underlying resource whenever [`.compute()`](https://docs.dask.org/en/stable/generated/dask.array.Array.compute.html#dask.array.Array.compute) is called \u2013 and does so only once for all chunks in a single compute task graph.\n\n```python\n>>> from resource_backed_dask_array import resource_backed_dask_array\n>>> safe_dsk_ary = resource_backed_dask_array(dsk_ary, fr)\n>>> safe_dsk_ary.compute().shape\n(4, 4, 4)\n\n>>> fr.closed  # leave it as we found it\nTrue\n```\n\nThe second argument passed to `from_array` must be a [resuable context manager](https://docs.python.org/3/library/contextlib.html#reusable-context-managers)\nthat additionally provides a `closed` attribute (like [io.IOBase](https://docs.python.org/3/library/io.html#io.IOBase.closed)).  In other words,\nit must implement the following protocol:\n\n1. it must have an [`__enter__` method](https://docs.python.org/3/reference/datamodel.html#object.__enter__) that opens the underlying resource\n2. it must have an [`__exit__` method](https://docs.python.org/3/reference/datamodel.html#object.__exit__) that closes the resource and optionally handles exceptions\n3. it must have a `closed` attribute that reports whether or not the resource is closed.\n\nIn the example above, the `FileReader` class itself implemented this protocol, and so was suitable as the second argument to `ResourceBackedDaskArray.from_array` above.\n\n## Important Caveats\n\nThis was created for single-process (and maybe just single-threaded?)\nuse cases where dask's out-of-core lazy loading is still very desireable.  Usage\nwith `dask.distributed` is untested and may very well fail.  Using stateful objects (such as the reusable context manager used here) in multi-threaded/processed tasks is error prone.\n\n\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "experimental Dask array that opens/closes a resource when computing",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/tlambert03/resource-backed-dask-array",
        "Source Code": "https://github.com/tlambert03/resource-backed-dask-array"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0db5852f619e53fa7fb70d8915fcae66632df3958cac7e926c4ac38458958674",
                "md5": "3da22ac0ac7f3d70d1d87639aa399a46",
                "sha256": "ec457fa72d81f0340a67ea6557a5a5919323a11cccc978a950df29fa69fe5679"
            },
            "downloads": -1,
            "filename": "resource_backed_dask_array-0.1.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3da22ac0ac7f3d70d1d87639aa399a46",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.7",
            "size": 8044,
            "upload_time": "2022-02-18T02:10:05",
            "upload_time_iso_8601": "2022-02-18T02:10:05.559297Z",
            "url": "https://files.pythonhosted.org/packages/0d/b5/852f619e53fa7fb70d8915fcae66632df3958cac7e926c4ac38458958674/resource_backed_dask_array-0.1.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6280b8952048ae1772d33b95dbf7d7107cf364c037cc229a2690fc8fa9ee8e48",
                "md5": "dea76680ec59e13b0bc9b3df93bbf65c",
                "sha256": "8fabcccf5c7e29059b5badd6786dd7675a258a203c58babf10077d9c90ada54f"
            },
            "downloads": -1,
            "filename": "resource_backed_dask_array-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "dea76680ec59e13b0bc9b3df93bbf65c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 10300,
            "upload_time": "2022-02-18T02:10:06",
            "upload_time_iso_8601": "2022-02-18T02:10:06.981807Z",
            "url": "https://files.pythonhosted.org/packages/62/80/b8952048ae1772d33b95dbf7d7107cf364c037cc229a2690fc8fa9ee8e48/resource_backed_dask_array-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-02-18 02:10:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tlambert03",
    "github_project": "resource-backed-dask-array",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "resource-backed-dask-array"
}
        
Elapsed time: 0.12178s