hcai-datasets

Name	hcai-datasets JSON
Version	0.1.14 JSON
	download
home_page	https://github.com/hcmlab/nova-server
Summary	!Alpha Version! - This repository contains the backend server for the nova annotation ui (https://github.com/hcmlab/nova)
upload_time	2023-09-12 07:01:32
maintainer
docs_url	None
author	Dominik Schiller
requires_python	>=3.6, <4
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ## This Repository is deprecated since version 0.1.14 and will not be further developed. Bugfixes might be released if necessary. 

## Description
This repository contains code to make datasets stored on the corpora network drive of the chair.
You can use this project to easily create or reuse a data loader that is universally compatible with either plain python code or tensorflow / pytorch. 
Also you this code can be used to dynamically create a dataloader for a Nova database to directly work with Nova Datasets in Python. 

Compatible with the [tensorflow dataset api](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).
Pytorch Dataset [is also supported](https://pytorch.org/vision/stable/datasets.html).
 
## Installation Information

For efficient data loading we rely on the [decord](https://github.com/dmlc/decord) library. 
Decord ist not available as prebuild binary for non x86 architectures. 
If you want to install the project on other architecture you will need to compile it yourself. 

## Currently available Datasets

| Dataset       | Status        | Url  |
| :------------- |:-------------:| :-----|
| ckplus        | ✅             | http://www.iainm.com/publications/Lucey2010-The-Extended/paper.pdf |
| affectnet     | ✅             | http://mohammadmahoor.com/affectnet/ |
| faces         | ✅             |    https://faces.mpdl.mpg.de/imeji/ |
| nova_dynamic  | ✅             |    https://github.com/hcmlab/nova |
| audioset      | ❌             | https://research.google.com/audioset/ |
| is2021_ess    | ❌             |    -|
| librispeech   | ❌             |    https://www.openslr.org/12 |


## Architecture

![uml diagram](image/architecture.png)

Dataset implementations are split into two parts.\

Data access is handled by a generic python iterable, implemented by the DatasetIterable interface.\
The access class is then extended by an API class, which implements tfds.core.GeneratorBasedBuilder.
This results in the dataset being available by the Tensorflow Datasets API, and enables features 
such as local caching.

The iterables themselves can also be used as-is, either in PyTorch native DataGenerators by wrapping them in
the utility class BridgePyTorch, or as tensorflow-native Datasets by passing them to BridgeTensorflow.

The benefits of this setup are that a pytorch application can be served without installing or loading 
tensorflow, and vice versa, since the stack up to the adapters does not involve tf or pytorch. 
Also, when using tf, caching can be used or discarded by using tfds or the plain tensorflow Dataset
provided by the bridge.


### Dynamic Dataset usage with Nova Example 

To use the hcai_datasets repository with Nova you can use the `HcaiNovaDynamicIterable`class from the `hcai_datasets.hcai_nova_dynamic.hcai_nova_dynamic_iterable` module to create an iterator for a specific data configuration. 
This readme assumes that you are already familiar with the terminology and the general concept of the NOVA annotation tool / database.
The constructor of the class takes the following arguments as input: 

`db_config_path`: `string` path to a configfile with the nova database config. the config file looks like this:

```
[DB]
ip = 127.0.0.1
port = 37317
user = my_user
password = my_password
```

`db_config_dict`: `string` dictionary with the nova database config. can be used instead of db_config_path. if both are specified db_config_dict is used.

`dataset`: `string` the name of the dataset. Same as the entry in the Nova db.

`nova_data_dir`: `string` the directory to look for data. same as the directory specified in the nova gui. 

`sessions`: `list` list of sessions that should be loaded. must match the session names in nova.

`annotator`: `string` the name of the annotator that labeld the session. must match annotator names in nova.

`schemes`: `list` list of the annotation schemes to fetch.

`roles`: `list` list of roles for which the annotation should be loaded.

`data_streams`: `list` list datastreams for which the annotation should be loaded. must match stream names in nova.

`start`: `string | int | float` start time_ms. use if only a specific chunk of a session should be retrieved. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).

`end`: `string | int | float` optional end time_ms. use if only a specific chunk of a session should be retrieved. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).

`left_context`: `string | int | float` additional data to pass to the classifier on the left side of the frame. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).

`right_context`: `string | int | float` additional data to pass to the classifier on the left side of the frame. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).

`frame_size`: `string | int | float` the framesize to look at. the matching annotation will be calculated as majority vote from all annotations that are overlapping with the timeframe. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).

`stride`: `string | int | float`  how much a frame is moved to calculate the next sample. equals framesize by default. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).

`flatten_samples`: `bool` if set to `True` samples with the same annotation scheme but from different roles will be treated as separate samples. only <scheme> is used for the keys.  

`add_rest_class`: `bool` when set to True an additional rest class will be added to the end the label list


```python

from pathlib import Path
from hcai_dataset_utils.bridge_tf import BridgeTensorflow
import tensorflow as tf
from hcai_datasets.hcai_nova_dynamic.hcai_nova_dynamic_iterable import HcaiNovaDynamicIterable

ds_iter = HcaiNovaDynamicIterable(
    db_config_path="./nova_db.cfg",
    db_config_dict=None,
    dataset="affect-net",
    nova_data_dir=Path("./nova/data"),
    sessions=[f"{i}_man_eval" for i in range(8)],
    roles=["session"],
    schemes=["emotion_categorical"],
    annotator="gold",
    data_streams=["video"],
    frame_size=0.04,
    left_context=0,
    right_context=0,
    start = "0s",
    end = "3000ms",
    flatten_samples=False,
)

for sample in ds_iter:
    print(sample)
```

## Pytorch Example

The BridePyTorch module can be used to create a Pytorch DataLoader directly from the Dataset iterable. 

```python
from torch.utils.data import DataLoader
from hcai_dataset_utils.bridge_pytorch import BridgePyTorch
from hcai_datasets.hcai_affectnet.hcai_affectnet_iterable import HcaiAffectnetIterable

ds_iter = HcaiAffectnetIterable(
    dataset_dir="path/to/data_sets/AffectNet",
    split="test"
)
dataloader = DataLoader(BridgePyTorch(ds_iter))

for sample in dataloader:
    print(sample)
```


## Tensorflow Example

The BridgeTensorflow module can be used to create a Pytorch DataLoader directly from the Dataset iterable. 

```python
from hcai_dataset_utils.bridge_tf import BridgeTensorflow
from hcai_datasets.hcai_affectnet.hcai_affectnet_iterable import HcaiAffectnetIterable

ds_iter = HcaiAffectnetIterable(
    dataset_dir="path/to/data_sets/AffectNet",
    split="test"
)

tf_dataset = BridgeTensorflow.make(ds_iter)

for sample in tf_dataset:
    print(sample)
```


## Tensorflow Dataset API (DEPRECATED)

### Example Usage

```python
import os
import tensorflow as tf
import tensorflow_datasets as tfds
import hcai_datasets
from matplotlib import pyplot as plt

# Preprocessing function
def preprocess(x, y):
  img = x.numpy()
  return img, y

# Creating a dataset
ds, ds_info = tfds.load(
  'hcai_example_dataset',
  split='train',
  with_info=True,
  as_supervised=True,
  builder_kwargs={'dataset_dir': os.path.join('path', 'to', 'directory')}
)

# Input output mapping
ds = ds.map(lambda x, y: (tf.py_function(func=preprocess, inp=[x, y], Tout=[tf.float32, tf.int64])))

# Manually iterate over dataset
img, label = next(ds.as_numpy_iterator())

# Visualize
plt.imshow(img / 255.)
plt.show()
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hcmlab/nova-server",
    "name": "hcai-datasets",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6, <4",
    "maintainer_email": "",
    "keywords": "",
    "author": "Dominik Schiller",
    "author_email": "dominik.schiller@uni-a.de",
    "download_url": "https://files.pythonhosted.org/packages/1a/2f/2dc289621664f2c834b2e55ab04d4d6379a292c4ab6d88988cc465b63b1c/hcai-datasets-0.1.14.tar.gz",
    "platform": null,
    "description": "## This Repository is deprecated since version 0.1.14 and will not be further developed. Bugfixes might be released if necessary. \n\n## Description\nThis repository contains code to make datasets stored on the corpora network drive of the chair.\nYou can use this project to easily create or reuse a data loader that is universally compatible with either plain python code or tensorflow / pytorch. \nAlso you this code can be used to dynamically create a dataloader for a Nova database to directly work with Nova Datasets in Python. \n\nCompatible with the [tensorflow dataset api](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).\nPytorch Dataset [is also supported](https://pytorch.org/vision/stable/datasets.html).\n \n## Installation Information\n\nFor efficient data loading we rely on the [decord](https://github.com/dmlc/decord) library. \nDecord ist not available as prebuild binary for non x86 architectures. \nIf you want to install the project on other architecture you will need to compile it yourself. \n\n## Currently available Datasets\n\n| Dataset       | Status        | Url  |\n| :------------- |:-------------:| :-----|\n| ckplus        | \u2705             | http://www.iainm.com/publications/Lucey2010-The-Extended/paper.pdf |\n| affectnet     | \u2705             | http://mohammadmahoor.com/affectnet/ |\n| faces         | \u2705             |    https://faces.mpdl.mpg.de/imeji/ |\n| nova_dynamic  | \u2705             |    https://github.com/hcmlab/nova |\n| audioset      | \u274c             | https://research.google.com/audioset/ |\n| is2021_ess    | \u274c             |    -|\n| librispeech   | \u274c             |    https://www.openslr.org/12 |\n\n\n## Architecture\n\n![uml diagram](image/architecture.png)\n\nDataset implementations are split into two parts.\\\n\nData access is handled by a generic python iterable, implemented by the DatasetIterable interface.\\\nThe access class is then extended by an API class, which implements tfds.core.GeneratorBasedBuilder.\nThis results in the dataset being available by the Tensorflow Datasets API, and enables features \nsuch as local caching.\n\nThe iterables themselves can also be used as-is, either in PyTorch native DataGenerators by wrapping them in\nthe utility class BridgePyTorch, or as tensorflow-native Datasets by passing them to BridgeTensorflow.\n\nThe benefits of this setup are that a pytorch application can be served without installing or loading \ntensorflow, and vice versa, since the stack up to the adapters does not involve tf or pytorch. \nAlso, when using tf, caching can be used or discarded by using tfds or the plain tensorflow Dataset\nprovided by the bridge.\n\n\n### Dynamic Dataset usage with Nova Example \n\nTo use the hcai_datasets repository with Nova you can use the `HcaiNovaDynamicIterable`class from the `hcai_datasets.hcai_nova_dynamic.hcai_nova_dynamic_iterable` module to create an iterator for a specific data configuration. \nThis readme assumes that you are already familiar with the terminology and the general concept of the NOVA annotation tool / database.\nThe constructor of the class takes the following arguments as input: \n\n`db_config_path`: `string` path to a configfile with the nova database config. the config file looks like this:\n\n```\n[DB]\nip = 127.0.0.1\nport = 37317\nuser = my_user\npassword = my_password\n```\n\n`db_config_dict`: `string` dictionary with the nova database config. can be used instead of db_config_path. if both are specified db_config_dict is used.\n\n`dataset`: `string` the name of the dataset. Same as the entry in the Nova db.\n\n`nova_data_dir`: `string` the directory to look for data. same as the directory specified in the nova gui. \n\n`sessions`: `list` list of sessions that should be loaded. must match the session names in nova.\n\n`annotator`: `string` the name of the annotator that labeld the session. must match annotator names in nova.\n\n`schemes`: `list` list of the annotation schemes to fetch.\n\n`roles`: `list` list of roles for which the annotation should be loaded.\n\n`data_streams`: `list` list datastreams for which the annotation should be loaded. must match stream names in nova.\n\n`start`: `string | int | float` start time_ms. use if only a specific chunk of a session should be retrieved. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).\n\n`end`: `string | int | float` optional end time_ms. use if only a specific chunk of a session should be retrieved. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).\n\n`left_context`: `string | int | float` additional data to pass to the classifier on the left side of the frame. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).\n\n`right_context`: `string | int | float` additional data to pass to the classifier on the left side of the frame. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).\n\n`frame_size`: `string | int | float` the framesize to look at. the matching annotation will be calculated as majority vote from all annotations that are overlapping with the timeframe. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).\n\n`stride`: `string | int | float`  how much a frame is moved to calculate the next sample. equals framesize by default. can be passed as String (e.g. '1s' or '1ms'), Int (interpreted as milliseconds) or Float (interpreted as seconds).\n\n`flatten_samples`: `bool` if set to `True` samples with the same annotation scheme but from different roles will be treated as separate samples. only <scheme> is used for the keys.  \n\n`add_rest_class`: `bool` when set to True an additional rest class will be added to the end the label list\n\n\n```python\n\nfrom pathlib import Path\nfrom hcai_dataset_utils.bridge_tf import BridgeTensorflow\nimport tensorflow as tf\nfrom hcai_datasets.hcai_nova_dynamic.hcai_nova_dynamic_iterable import HcaiNovaDynamicIterable\n\nds_iter = HcaiNovaDynamicIterable(\n    db_config_path=\"./nova_db.cfg\",\n    db_config_dict=None,\n    dataset=\"affect-net\",\n    nova_data_dir=Path(\"./nova/data\"),\n    sessions=[f\"{i}_man_eval\" for i in range(8)],\n    roles=[\"session\"],\n    schemes=[\"emotion_categorical\"],\n    annotator=\"gold\",\n    data_streams=[\"video\"],\n    frame_size=0.04,\n    left_context=0,\n    right_context=0,\n    start = \"0s\",\n    end = \"3000ms\",\n    flatten_samples=False,\n)\n\nfor sample in ds_iter:\n    print(sample)\n```\n\n## Pytorch Example\n\nThe BridePyTorch module can be used to create a Pytorch DataLoader directly from the Dataset iterable. \n\n```python\nfrom torch.utils.data import DataLoader\nfrom hcai_dataset_utils.bridge_pytorch import BridgePyTorch\nfrom hcai_datasets.hcai_affectnet.hcai_affectnet_iterable import HcaiAffectnetIterable\n\nds_iter = HcaiAffectnetIterable(\n    dataset_dir=\"path/to/data_sets/AffectNet\",\n    split=\"test\"\n)\ndataloader = DataLoader(BridgePyTorch(ds_iter))\n\nfor sample in dataloader:\n    print(sample)\n```\n\n\n## Tensorflow Example\n\nThe BridgeTensorflow module can be used to create a Pytorch DataLoader directly from the Dataset iterable. \n\n```python\nfrom hcai_dataset_utils.bridge_tf import BridgeTensorflow\nfrom hcai_datasets.hcai_affectnet.hcai_affectnet_iterable import HcaiAffectnetIterable\n\nds_iter = HcaiAffectnetIterable(\n    dataset_dir=\"path/to/data_sets/AffectNet\",\n    split=\"test\"\n)\n\ntf_dataset = BridgeTensorflow.make(ds_iter)\n\nfor sample in tf_dataset:\n    print(sample)\n```\n\n\n## Tensorflow Dataset API (DEPRECATED)\n\n### Example Usage\n\n```python\nimport os\nimport tensorflow as tf\nimport tensorflow_datasets as tfds\nimport hcai_datasets\nfrom matplotlib import pyplot as plt\n\n# Preprocessing function\ndef preprocess(x, y):\n  img = x.numpy()\n  return img, y\n\n# Creating a dataset\nds, ds_info = tfds.load(\n  'hcai_example_dataset',\n  split='train',\n  with_info=True,\n  as_supervised=True,\n  builder_kwargs={'dataset_dir': os.path.join('path', 'to', 'directory')}\n)\n\n# Input output mapping\nds = ds.map(lambda x, y: (tf.py_function(func=preprocess, inp=[x, y], Tout=[tf.float32, tf.int64])))\n\n# Manually iterate over dataset\nimg, label = next(ds.as_numpy_iterator())\n\n# Visualize\nplt.imshow(img / 255.)\nplt.show()\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "!Alpha Version! - This repository contains the backend server for the nova annotation ui (https://github.com/hcmlab/nova)",
    "version": "0.1.14",
    "project_urls": {
        "Homepage": "https://github.com/hcmlab/nova-server"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0e609da58f154194f40b28f1417c48cc195677c1e3a9dd7257ee80ad65662bee",
                "md5": "57046696fee47628d859f4ed43a0e2d6",
                "sha256": "d7d3d67cd4c78336da42123f95840e7f572b0b9bc4e3e329de4fcbba588c792a"
            },
            "downloads": -1,
            "filename": "hcai_datasets-0.1.14-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "57046696fee47628d859f4ed43a0e2d6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6, <4",
            "size": 1257404,
            "upload_time": "2023-09-12T07:01:30",
            "upload_time_iso_8601": "2023-09-12T07:01:30.284818Z",
            "url": "https://files.pythonhosted.org/packages/0e/60/9da58f154194f40b28f1417c48cc195677c1e3a9dd7257ee80ad65662bee/hcai_datasets-0.1.14-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1a2f2dc289621664f2c834b2e55ab04d4d6379a292c4ab6d88988cc465b63b1c",
                "md5": "bc3ee17f56fe570c2d18e40e30bea824",
                "sha256": "d691398e37725d2b9a9cfcc1acdbfcb9af1a1a81503eebeac15380f49d5985d9"
            },
            "downloads": -1,
            "filename": "hcai-datasets-0.1.14.tar.gz",
            "has_sig": false,
            "md5_digest": "bc3ee17f56fe570c2d18e40e30bea824",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6, <4",
            "size": 1241908,
            "upload_time": "2023-09-12T07:01:32",
            "upload_time_iso_8601": "2023-09-12T07:01:32.389035Z",
            "url": "https://files.pythonhosted.org/packages/1a/2f/2dc289621664f2c834b2e55ab04d4d6379a292c4ab6d88988cc465b63b1c/hcai-datasets-0.1.14.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-12 07:01:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hcmlab",
    "github_project": "nova-server",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "circle": true,
    "requirements": [],
    "lcname": "hcai-datasets"
}

Dominik Schiller