h5mapper


Nameh5mapper JSON
Version 0.3.2 PyPI version JSON
download
home_pagehttps://github.com/ktonal/h5mapper
Summarypythonic ORM tool for reading and writing HDF5 data
upload_time2023-04-26 17:27:02
maintainer
docs_urlNone
authorAntoine Daurat
requires_python>=3.6
licenseMIT License
keywords hdf5 h5py orm deep-learning machine-learning
VCS
bugtrack_url
requirements h5py numpy torch tqdm multiprocess dill ipython
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # h5mapper

``h5mapper`` is a pythonic ORM-like tool for reading and writing HDF5 data.

It is built on top of `h5py` and lets you define types of **.h5 files as python classes** which you can then easily 
**create from raw sources** (e.g. files, urls...), **serve** (use as ``Dataset`` for a ``Dataloader``), 
or dynamically populate (logs, checkpoints of an experiment).

## Content
- [Installation](#Installation)
- [Quickstart](#Quickstart)
    - [TypedFile](#TypedFile)
    - [Feature](#Feature)
- [Examples](#Examples)
- [Development](#Development)
- [License](#License)

## Installation

### ``pip``

``h5mapper`` is on pypi, to install it, one only needs to 

```bash
pip install h5mapper
```

### developer install

for playing around with the internals of the package, a good solution is to first

```bash
git clone https://github.com/ktonal/h5mapper.git
```
and then 

```bash
pip install -e h5mapper/
```
which installs the repo in editable mode.

## Quickstart

### TypedFile

``h5m`` assumes that you want to store collections of contiguous arrays in single datasets and that you want several such concatenated datasets in a file.

Thus, ``TypedFile`` allows you to create and read files that maintain a 2-d reference system, where contiguous arrays are stored within features and indexed by their source's id.

Such a file might then look like 
```bash
<Experiment "experiment.h5">
----------------------------------------------------> sources' ids axis
|                   "planes/01.jpeg"  |     "train"
|                                     |
|   data/                             |
|        images/        (32, 32)      |       None
|        labels/        (1, )         |       None
|   logs/                             |
|        loss/           None         |       (10000,)
|        ...
V
features axis
``` 
where the entries correspond to the shapes of arrays or their absence (`None`).

> Note that this is a different approach than storing each file or image in a separate dataset. 
> In this case, there would be an `h5py.Dataset` located at `data/images/planes/01.jpeg` although in our
> example, the only dataset is at `data/images/` and one of its regions is indexed by the id `"planes/01.jpeg"` 

For interacting with files that follow this particular structure, simply define a class

```python
import h5mapper as h5m

class Experiment(h5m.TypedFile):

    data = h5m.Group(
            # your custom h5m.Feature classes:
            images=Image(),
            labels=DirLabels()
            )
    logs = h5m.Group(
            loss=h5m.Array()
            )
```
#### ``create``, ``add``

now, create an instance, load data from files through parallel jobs and add data on the fly :

```python
# create instance from raw sources
exp = Experiment.create("experiment.h5",
        # those are then used as ids :
        sources=["planes/01.jpeg", "planes/02.jpeg"],
        n_workers=8)
...
# add id <-> data on the fly :
exp.logs.add("train", dict(loss=losses_array))
``` 

#### ``get``, ``refs`` and ``__getitem__`` 

There are 3 main options to read data from a ``TypedFile`` or one of its ``Proxy``

1/ By their id

```python
>> exp.logs.get("train")
Out: {"loss": np.array([...])}
# which, in this case, is equivalent to 
>> exp.logs["train"]
Out: {"loss": np.array([...])}
# because `exp.logs` is a Group and Groups only support id-based indexing
```

2/ By the index of their ids through their ``refs`` attribute :

```python
>> exp.data.images[exp.data.images.refs[0]].shape
Out: (32, 32)
```
Which works because `exp.data.images` is a `Dataset` and only `Datasets` have `refs`

3/ with any ``item`` supported by the ``h5py.Dataset``
```python
>> exp.data.labels[:32]
Out: np.array([0, 0, ....])
```
Which only works for `Dataset`s - not for `Group`s.

> Note that, in this last case, you are indexing into the **concatenation of all sub-arrays along their first axis**.

> The same interface is also implemented for ``set(source, data)`` and ``__setitem__``

### Feature

``h5m`` exposes a class that helps you configure the behaviour of your ``TypedFile`` classes and the properties of the .h5 they create.

the ``Feature`` class helps you define :
- how sources' ids are loaded into arrays (``feature.load(source)``)
- which types of files are supported
- how the data is stored by ``h5py`` (compression, chunks)
- which extraction parameters need to be stored with the data (e.g. sample rate of audio files)
- custom-methods relevant to this kind of data

Once you defined a `Feature` class, attach it to the class dict of a ``TypedFile``, that's it!

For example :

```python
import h5mapper as h5m


class MyFeature(h5m.Feature):

    # only sources matching this pattern will be passed to load(...)
    __re__ = r".special$"

    # args for the h5py.Dataset
    __ds_kwargs__ = dict(compression='lzf', chunks=(1, 350))

    def __init__(self, my_extraction_param=0):
        self.my_extraction_param = my_extraction_param

    @property
    def attrs(self):
        # those are then written in the h5py.Group.attrs
        return {"p": self.my_extraction_param}

    def load(self, source):
        """your method to get an np.ndarray or a dict thereof
        from a path, an url, whatever sources you have..."""   
        return data

    def plot(self, data):
        """custom plotting method for this kind of data"""
        # ...

# attach it
class Data(h5m.TypedFile):
    feat = MyFeature(47)

# load sources...
f = Data.create(....)

# read your data through __getitem__ 
batch = f.feat[4:8]

# access your method 
f.feat.plot(batch)

# modify the file through __setitem__
f.feat[4:8] = batch ** 2 
```

for more examples, checkout `h5mapper/h5mapper/features.py`.

#### ``serve``

Primarly designed with `pytorch` users in mind, `h5m` plays very nicely with the `Dataset` class :

```python
class MyDS(h5m.TypedFile, torch.utils.data.Dataset):

    x = MyInputFeature(42)

    def __getitem__(self, item):
        return self.x[item], self.labels[item]

    def __len__(self):
        return len(self.x)

ds = MyDS.create("train.h5", sources, keep_open=True)

dl = torch.utils.data.DataLoader(ds, batch_size=16, num_workers=8, pin_memory=True)
```

`TypedFile` even have a method that takes the Dataloader args and a batch object filled with `BatchItems` and returns 
a Dataloader that will yield such batch objects.

Example :

```python
f = TypedFile("train.h5", keep_open=True)
loader = f.serve(
    # batch object :
    dict(
        x=h5m.Input(key='data/image', getter=h5m.GetId()),
        labels=h5m.Target(key='data/labels', getter=h5m.GetId())
    ),
    # Dataloader kwargs :
    num_workers=8, pin_memory=True, batch_size=32, shuffle=True
)
```  

### Examples

in ``h5mapper/examples`` you'll find for now
- a train script with data, checkpoints and logs in `dataset_and_logs.py`
- a script for benchmarking batch-loading times of different options

### Development

`h5mapper` is just getting started and you're welcome to contribute!

You'll find some tests you can run from the root of the repo with a simple
```bash
pytest
```

If you'd like to get involved, just drop us an email : ktonalberlin@gmail.com


### License

`h5mapper` is distributed under the terms of the MIT License. 


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ktonal/h5mapper",
    "name": "h5mapper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "hdf5 h5py ORM deep-learning machine-learning",
    "author": "Antoine Daurat",
    "author_email": "antoinedaurat@gmail.com",
    "download_url": "https://github.com/ktonal/h5mapper",
    "platform": null,
    "description": "# h5mapper\n\n``h5mapper`` is a pythonic ORM-like tool for reading and writing HDF5 data.\n\nIt is built on top of `h5py` and lets you define types of **.h5 files as python classes** which you can then easily \n**create from raw sources** (e.g. files, urls...), **serve** (use as ``Dataset`` for a ``Dataloader``), \nor dynamically populate (logs, checkpoints of an experiment).\n\n## Content\n- [Installation](#Installation)\n- [Quickstart](#Quickstart)\n    - [TypedFile](#TypedFile)\n    - [Feature](#Feature)\n- [Examples](#Examples)\n- [Development](#Development)\n- [License](#License)\n\n## Installation\n\n### ``pip``\n\n``h5mapper`` is on pypi, to install it, one only needs to \n\n```bash\npip install h5mapper\n```\n\n### developer install\n\nfor playing around with the internals of the package, a good solution is to first\n\n```bash\ngit clone https://github.com/ktonal/h5mapper.git\n```\nand then \n\n```bash\npip install -e h5mapper/\n```\nwhich installs the repo in editable mode.\n\n## Quickstart\n\n### TypedFile\n\n``h5m`` assumes that you want to store collections of contiguous arrays in single datasets and that you want several such concatenated datasets in a file.\n\nThus, ``TypedFile`` allows you to create and read files that maintain a 2-d reference system, where contiguous arrays are stored within features and indexed by their source's id.\n\nSuch a file might then look like \n```bash\n<Experiment \"experiment.h5\">\n----------------------------------------------------> sources' ids axis\n|                   \"planes/01.jpeg\"  |     \"train\"\n|                                     |\n|   data/                             |\n|        images/        (32, 32)      |       None\n|        labels/        (1, )         |       None\n|   logs/                             |\n|        loss/           None         |       (10000,)\n|        ...\nV\nfeatures axis\n``` \nwhere the entries correspond to the shapes of arrays or their absence (`None`).\n\n> Note that this is a different approach than storing each file or image in a separate dataset. \n> In this case, there would be an `h5py.Dataset` located at `data/images/planes/01.jpeg` although in our\n> example, the only dataset is at `data/images/` and one of its regions is indexed by the id `\"planes/01.jpeg\"` \n\nFor interacting with files that follow this particular structure, simply define a class\n\n```python\nimport h5mapper as h5m\n\nclass Experiment(h5m.TypedFile):\n\n    data = h5m.Group(\n            # your custom h5m.Feature classes:\n            images=Image(),\n            labels=DirLabels()\n            )\n    logs = h5m.Group(\n            loss=h5m.Array()\n            )\n```\n#### ``create``, ``add``\n\nnow, create an instance, load data from files through parallel jobs and add data on the fly :\n\n```python\n# create instance from raw sources\nexp = Experiment.create(\"experiment.h5\",\n        # those are then used as ids :\n        sources=[\"planes/01.jpeg\", \"planes/02.jpeg\"],\n        n_workers=8)\n...\n# add id <-> data on the fly :\nexp.logs.add(\"train\", dict(loss=losses_array))\n``` \n\n#### ``get``, ``refs`` and ``__getitem__`` \n\nThere are 3 main options to read data from a ``TypedFile`` or one of its ``Proxy``\n\n1/ By their id\n\n```python\n>> exp.logs.get(\"train\")\nOut: {\"loss\": np.array([...])}\n# which, in this case, is equivalent to \n>> exp.logs[\"train\"]\nOut: {\"loss\": np.array([...])}\n# because `exp.logs` is a Group and Groups only support id-based indexing\n```\n\n2/ By the index of their ids through their ``refs`` attribute :\n\n```python\n>> exp.data.images[exp.data.images.refs[0]].shape\nOut: (32, 32)\n```\nWhich works because `exp.data.images` is a `Dataset` and only `Datasets` have `refs`\n\n3/ with any ``item`` supported by the ``h5py.Dataset``\n```python\n>> exp.data.labels[:32]\nOut: np.array([0, 0, ....])\n```\nWhich only works for `Dataset`s - not for `Group`s.\n\n> Note that, in this last case, you are indexing into the **concatenation of all sub-arrays along their first axis**.\n\n> The same interface is also implemented for ``set(source, data)`` and ``__setitem__``\n\n### Feature\n\n``h5m`` exposes a class that helps you configure the behaviour of your ``TypedFile`` classes and the properties of the .h5 they create.\n\nthe ``Feature`` class helps you define :\n- how sources' ids are loaded into arrays (``feature.load(source)``)\n- which types of files are supported\n- how the data is stored by ``h5py`` (compression, chunks)\n- which extraction parameters need to be stored with the data (e.g. sample rate of audio files)\n- custom-methods relevant to this kind of data\n\nOnce you defined a `Feature` class, attach it to the class dict of a ``TypedFile``, that's it!\n\nFor example :\n\n```python\nimport h5mapper as h5m\n\n\nclass MyFeature(h5m.Feature):\n\n    # only sources matching this pattern will be passed to load(...)\n    __re__ = r\".special$\"\n\n    # args for the h5py.Dataset\n    __ds_kwargs__ = dict(compression='lzf', chunks=(1, 350))\n\n    def __init__(self, my_extraction_param=0):\n        self.my_extraction_param = my_extraction_param\n\n    @property\n    def attrs(self):\n        # those are then written in the h5py.Group.attrs\n        return {\"p\": self.my_extraction_param}\n\n    def load(self, source):\n        \"\"\"your method to get an np.ndarray or a dict thereof\n        from a path, an url, whatever sources you have...\"\"\"   \n        return data\n\n    def plot(self, data):\n        \"\"\"custom plotting method for this kind of data\"\"\"\n        # ...\n\n# attach it\nclass Data(h5m.TypedFile):\n    feat = MyFeature(47)\n\n# load sources...\nf = Data.create(....)\n\n# read your data through __getitem__ \nbatch = f.feat[4:8]\n\n# access your method \nf.feat.plot(batch)\n\n# modify the file through __setitem__\nf.feat[4:8] = batch ** 2 \n```\n\nfor more examples, checkout `h5mapper/h5mapper/features.py`.\n\n#### ``serve``\n\nPrimarly designed with `pytorch` users in mind, `h5m` plays very nicely with the `Dataset` class :\n\n```python\nclass MyDS(h5m.TypedFile, torch.utils.data.Dataset):\n\n    x = MyInputFeature(42)\n\n    def __getitem__(self, item):\n        return self.x[item], self.labels[item]\n\n    def __len__(self):\n        return len(self.x)\n\nds = MyDS.create(\"train.h5\", sources, keep_open=True)\n\ndl = torch.utils.data.DataLoader(ds, batch_size=16, num_workers=8, pin_memory=True)\n```\n\n`TypedFile` even have a method that takes the Dataloader args and a batch object filled with `BatchItems` and returns \na Dataloader that will yield such batch objects.\n\nExample :\n\n```python\nf = TypedFile(\"train.h5\", keep_open=True)\nloader = f.serve(\n    # batch object :\n    dict(\n        x=h5m.Input(key='data/image', getter=h5m.GetId()),\n        labels=h5m.Target(key='data/labels', getter=h5m.GetId())\n    ),\n    # Dataloader kwargs :\n    num_workers=8, pin_memory=True, batch_size=32, shuffle=True\n)\n```  \n\n### Examples\n\nin ``h5mapper/examples`` you'll find for now\n- a train script with data, checkpoints and logs in `dataset_and_logs.py`\n- a script for benchmarking batch-loading times of different options\n\n### Development\n\n`h5mapper` is just getting started and you're welcome to contribute!\n\nYou'll find some tests you can run from the root of the repo with a simple\n```bash\npytest\n```\n\nIf you'd like to get involved, just drop us an email : ktonalberlin@gmail.com\n\n\n### License\n\n`h5mapper` is distributed under the terms of the MIT License. \n\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "pythonic ORM tool for reading and writing HDF5 data",
    "version": "0.3.2",
    "split_keywords": [
        "hdf5",
        "h5py",
        "orm",
        "deep-learning",
        "machine-learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "eab577d7414c99f47b73a27e210b8c218babbe9f4fa1e74e05b8f237c7c14d30",
                "md5": "181b86a2835c2846f30666184e089972",
                "sha256": "d6ca8ade89f938b8e4c2e40f482f7d34db88714390d1a150a270c1bb7c820998"
            },
            "downloads": -1,
            "filename": "h5mapper-0.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "181b86a2835c2846f30666184e089972",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 19794,
            "upload_time": "2023-04-26T17:27:02",
            "upload_time_iso_8601": "2023-04-26T17:27:02.043097Z",
            "url": "https://files.pythonhosted.org/packages/ea/b5/77d7414c99f47b73a27e210b8c218babbe9f4fa1e74e05b8f237c7c14d30/h5mapper-0.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-26 17:27:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "ktonal",
    "github_project": "h5mapper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "h5py",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.19.0"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "1.8.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "multiprocess",
            "specs": []
        },
        {
            "name": "dill",
            "specs": []
        },
        {
            "name": "ipython",
            "specs": []
        }
    ],
    "lcname": "h5mapper"
}
        
Elapsed time: 0.08150s