# resource-backed-dask-array
[![License](https://img.shields.io/pypi/l/resource-backed-dask-array.svg?color=green)](https://github.com/tlambert03/resource-backed-dask-array/raw/main/LICENSE)
[![PyPI](https://img.shields.io/pypi/v/resource-backed-dask-array.svg?color=green)](https://pypi.org/project/resource-backed-dask-array)
[![Python Version](https://img.shields.io/pypi/pyversions/resource-backed-dask-array.svg?color=green)](https://python.org)
[![CI](https://github.com/tlambert03/resource-backed-dask-array/actions/workflows/ci.yml/badge.svg)](https://github.com/tlambert03/resource-backed-dask-array/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/tlambert03/resource-backed-dask-array/branch/main/graph/badge.svg)](https://codecov.io/gh/tlambert03/resource-backed-dask-array)
`ResourceBackedDaskArray` is an experimental Dask array subclass
that opens/closes a resource when computing (but only once per compute call).
## installation
```sh
pip install resource-backed-dask-array
```
## motivation for this package
Consider the following class that simulates a file reader capable of returning a
dask array (using
[dask.array.map_blocks](https://docs.dask.org/en/latest/generated/dask.array.map_blocks.html))
The file handle must be in an *open* state in order to read a chunk, otherwise
it segfaults (or otherwise errors)
```python
import dask.array as da
import numpy as np
class FileReader:
def __init__(self):
self._closed = False
def close(self):
"""close the imaginary file"""
self._closed = True
@property
def closed(self):
return self._closed
def __enter__(self):
if self.closed:
self._closed = False # open
return self
def __exit__(self, *_):
self.close()
def to_dask(self) -> da.Array:
"""Method that returns a dask array for this file."""
return da.map_blocks(
self._dask_block,
chunks=((1,) * 4, 4, 4),
dtype=float,
)
def _dask_block(self):
"""simulate getting a single chunk of the file."""
if self.closed:
raise RuntimeError("Segfault!")
return np.random.rand(1, 4, 4)
```
As long as the file stays open, everything works fine:
```python
>>> fr = FileReader()
>>> dsk_ary = fr.to_dask()
>>> dsk_ary.compute().shape
(4, 4, 4)
```
However, if one closes the file, the dask array returned
from `to_dask` will now fail:
```python
>>> fr.close()
>>> dsk_ary.compute() # RuntimeError: Segfault!
```
A "quick-and-dirty" solution here might be to force the `_dask_block` method to
temporarily reopen the file if it finds the file in the closed state, but if the
file-open process takes any amount of time, this could incur significant
overhead as it opens-and-closes for *every* chunk in the array.
## usage
`ResourceBackedDaskArray.from_array`
This library attempts to provide a solution to the above problem with a
`ResourceBackedDaskArray` object. This manages the opening/closing of
an underlying resource whenever [`.compute()`](https://docs.dask.org/en/stable/generated/dask.array.Array.compute.html#dask.array.Array.compute) is called – and does so only once for all chunks in a single compute task graph.
```python
>>> from resource_backed_dask_array import resource_backed_dask_array
>>> safe_dsk_ary = resource_backed_dask_array(dsk_ary, fr)
>>> safe_dsk_ary.compute().shape
(4, 4, 4)
>>> fr.closed # leave it as we found it
True
```
The second argument passed to `from_array` must be a [resuable context manager](https://docs.python.org/3/library/contextlib.html#reusable-context-managers)
that additionally provides a `closed` attribute (like [io.IOBase](https://docs.python.org/3/library/io.html#io.IOBase.closed)). In other words,
it must implement the following protocol:
1. it must have an [`__enter__` method](https://docs.python.org/3/reference/datamodel.html#object.__enter__) that opens the underlying resource
2. it must have an [`__exit__` method](https://docs.python.org/3/reference/datamodel.html#object.__exit__) that closes the resource and optionally handles exceptions
3. it must have a `closed` attribute that reports whether or not the resource is closed.
In the example above, the `FileReader` class itself implemented this protocol, and so was suitable as the second argument to `ResourceBackedDaskArray.from_array` above.
## Important Caveats
This was created for single-process (and maybe just single-threaded?)
use cases where dask's out-of-core lazy loading is still very desireable. Usage
with `dask.distributed` is untested and may very well fail. Using stateful objects (such as the reusable context manager used here) in multi-threaded/processed tasks is error prone.
Raw data
{
"_id": null,
"home_page": "https://github.com/tlambert03/resource-backed-dask-array",
"name": "resource-backed-dask-array",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Talley Lambert",
"author_email": "talley.lambert@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/62/80/b8952048ae1772d33b95dbf7d7107cf364c037cc229a2690fc8fa9ee8e48/resource_backed_dask_array-0.1.0.tar.gz",
"platform": "",
"description": "# resource-backed-dask-array\n\n[![License](https://img.shields.io/pypi/l/resource-backed-dask-array.svg?color=green)](https://github.com/tlambert03/resource-backed-dask-array/raw/main/LICENSE)\n[![PyPI](https://img.shields.io/pypi/v/resource-backed-dask-array.svg?color=green)](https://pypi.org/project/resource-backed-dask-array)\n[![Python Version](https://img.shields.io/pypi/pyversions/resource-backed-dask-array.svg?color=green)](https://python.org)\n[![CI](https://github.com/tlambert03/resource-backed-dask-array/actions/workflows/ci.yml/badge.svg)](https://github.com/tlambert03/resource-backed-dask-array/actions/workflows/ci.yml)\n[![codecov](https://codecov.io/gh/tlambert03/resource-backed-dask-array/branch/main/graph/badge.svg)](https://codecov.io/gh/tlambert03/resource-backed-dask-array)\n\n`ResourceBackedDaskArray` is an experimental Dask array subclass\nthat opens/closes a resource when computing (but only once per compute call).\n\n## installation\n\n```sh\npip install resource-backed-dask-array\n```\n\n## motivation for this package\n\nConsider the following class that simulates a file reader capable of returning a\ndask array (using\n[dask.array.map_blocks](https://docs.dask.org/en/latest/generated/dask.array.map_blocks.html))\nThe file handle must be in an *open* state in order to read a chunk, otherwise\nit segfaults (or otherwise errors)\n\n```python\nimport dask.array as da\nimport numpy as np\n\n\nclass FileReader:\n\n def __init__(self):\n self._closed = False\n\n def close(self):\n \"\"\"close the imaginary file\"\"\"\n self._closed = True\n\n @property\n def closed(self):\n return self._closed\n\n def __enter__(self):\n if self.closed:\n self._closed = False # open\n return self\n\n def __exit__(self, *_):\n self.close()\n\n def to_dask(self) -> da.Array:\n \"\"\"Method that returns a dask array for this file.\"\"\"\n return da.map_blocks(\n self._dask_block,\n chunks=((1,) * 4, 4, 4),\n dtype=float,\n )\n\n def _dask_block(self):\n \"\"\"simulate getting a single chunk of the file.\"\"\"\n if self.closed:\n raise RuntimeError(\"Segfault!\")\n return np.random.rand(1, 4, 4)\n```\n\nAs long as the file stays open, everything works fine:\n\n```python\n>>> fr = FileReader()\n>>> dsk_ary = fr.to_dask()\n>>> dsk_ary.compute().shape\n(4, 4, 4)\n```\n\nHowever, if one closes the file, the dask array returned\nfrom `to_dask` will now fail:\n\n```python\n>>> fr.close()\n>>> dsk_ary.compute() # RuntimeError: Segfault!\n```\n\nA \"quick-and-dirty\" solution here might be to force the `_dask_block` method to\ntemporarily reopen the file if it finds the file in the closed state, but if the\nfile-open process takes any amount of time, this could incur significant\noverhead as it opens-and-closes for *every* chunk in the array.\n\n## usage\n\n`ResourceBackedDaskArray.from_array`\n\nThis library attempts to provide a solution to the above problem with a\n`ResourceBackedDaskArray` object. This manages the opening/closing of\nan underlying resource whenever [`.compute()`](https://docs.dask.org/en/stable/generated/dask.array.Array.compute.html#dask.array.Array.compute) is called \u2013 and does so only once for all chunks in a single compute task graph.\n\n```python\n>>> from resource_backed_dask_array import resource_backed_dask_array\n>>> safe_dsk_ary = resource_backed_dask_array(dsk_ary, fr)\n>>> safe_dsk_ary.compute().shape\n(4, 4, 4)\n\n>>> fr.closed # leave it as we found it\nTrue\n```\n\nThe second argument passed to `from_array` must be a [resuable context manager](https://docs.python.org/3/library/contextlib.html#reusable-context-managers)\nthat additionally provides a `closed` attribute (like [io.IOBase](https://docs.python.org/3/library/io.html#io.IOBase.closed)). In other words,\nit must implement the following protocol:\n\n1. it must have an [`__enter__` method](https://docs.python.org/3/reference/datamodel.html#object.__enter__) that opens the underlying resource\n2. it must have an [`__exit__` method](https://docs.python.org/3/reference/datamodel.html#object.__exit__) that closes the resource and optionally handles exceptions\n3. it must have a `closed` attribute that reports whether or not the resource is closed.\n\nIn the example above, the `FileReader` class itself implemented this protocol, and so was suitable as the second argument to `ResourceBackedDaskArray.from_array` above.\n\n## Important Caveats\n\nThis was created for single-process (and maybe just single-threaded?)\nuse cases where dask's out-of-core lazy loading is still very desireable. Usage\nwith `dask.distributed` is untested and may very well fail. Using stateful objects (such as the reusable context manager used here) in multi-threaded/processed tasks is error prone.\n\n\n",
"bugtrack_url": null,
"license": "BSD-3-Clause",
"summary": "experimental Dask array that opens/closes a resource when computing",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/tlambert03/resource-backed-dask-array",
"Source Code": "https://github.com/tlambert03/resource-backed-dask-array"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0db5852f619e53fa7fb70d8915fcae66632df3958cac7e926c4ac38458958674",
"md5": "3da22ac0ac7f3d70d1d87639aa399a46",
"sha256": "ec457fa72d81f0340a67ea6557a5a5919323a11cccc978a950df29fa69fe5679"
},
"downloads": -1,
"filename": "resource_backed_dask_array-0.1.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "3da22ac0ac7f3d70d1d87639aa399a46",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.7",
"size": 8044,
"upload_time": "2022-02-18T02:10:05",
"upload_time_iso_8601": "2022-02-18T02:10:05.559297Z",
"url": "https://files.pythonhosted.org/packages/0d/b5/852f619e53fa7fb70d8915fcae66632df3958cac7e926c4ac38458958674/resource_backed_dask_array-0.1.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6280b8952048ae1772d33b95dbf7d7107cf364c037cc229a2690fc8fa9ee8e48",
"md5": "dea76680ec59e13b0bc9b3df93bbf65c",
"sha256": "8fabcccf5c7e29059b5badd6786dd7675a258a203c58babf10077d9c90ada54f"
},
"downloads": -1,
"filename": "resource_backed_dask_array-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "dea76680ec59e13b0bc9b3df93bbf65c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 10300,
"upload_time": "2022-02-18T02:10:06",
"upload_time_iso_8601": "2022-02-18T02:10:06.981807Z",
"url": "https://files.pythonhosted.org/packages/62/80/b8952048ae1772d33b95dbf7d7107cf364c037cc229a2690fc8fa9ee8e48/resource_backed_dask_array-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-02-18 02:10:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tlambert03",
"github_project": "resource-backed-dask-array",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "resource-backed-dask-array"
}