torchdatasetutil

Name	torchdatasetutil JSON
Version	0.1.30 JSON
	download
home_page	https://github.com/bhlarson/torchdatasetutil
Summary	Utilities to load and use pytorch datasets stored in Minio S3
upload_time	2023-01-15 01:23:15
maintainer
docs_url	None
author	Brad Larson
requires_python
license
keywords	python machine learning utilities
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Torch Dataset Utilities

The python library [torchdatasetutils](https://pypi.org/project/torchdatasetutil/) produces torch [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) classes and utility functions for several imaging datasets.  This currently includes sets of images and annotations from [CVAT](https://github.com/openvinotoolkit/cvat), [COCO dataset](https://cocodataset.org/).  "torchdatasetutil" uses an s3 object storage to hold dataset data.  This enables training and test to be performed on nodes different from where the dataset is stored with application defined credentials.  It uses torch PyTorch worker threads to prefetch data for efficient GPU or CPU training and inference.

"torchdatasetutils" takes as an input the [pymlutil](https://pypi.org/project/pymlutil/).s3 object to access the object storage.

Two json or yaml dictionaries are loaded from the object storage to identify and process the dataset: the dataset description and class dictionary.  The the dataset description is unique for each type of dataset.  The class dictionary is common to all datasets and describes data transformation and data augmentation.

## Library structure
- pymlutil.s3: access to object storage
- [torchdatasetutil](https://pypi.org/project/torchdatasetutil/)
    - [gitcoco.getcoco](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil/getcoco.py#L25): function to load the [COCO dataset](https://cocodataset.org/) from internet archives into object storage
    - [cocostore](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil/cocostore.py)
        - [CocoStore](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil/cocostore.py#L17): class providing a python iterator over the coco dataset in object storage
        - [CocoDataset](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil/cocostore.py)" class implementing the pytorch [Dataset class](https://pytorch.org/docs/stable/data.html#dataset-types) for the CocoStore iterator
    - [imstore](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil/imstore.py)

See [torchdatasetutil.ipynb](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil.ipynb) for library interface and usage

## Class Dictionary

## COCO Dataset
To load coco dataset you must have a credentials yaml file identifying the final s3 location and credentials for the dataset with the following keys:

```yaml
s3:
- name: store
  type: trainer
  address: <address>:<port>
  access key: <access key>
  secret key: <secret key>
  tls: false
  cert verify: false
  cert path: null
  sets:
    dataset: {"bucket":"imgml","prefix":"data", "dataset_filter":"" }
    trainingset: {"bucket":"imgml","prefix":"training", "dataset_filter":"" }
    model: {"bucket":"imgml","prefix":"model", "dataset_filter":"" }
    test: {"bucket":"imgml","prefix":"test", "dataset_filter":"" }
```

Call torchdatasetutil.getcoco to retrieve the COCO dataset and stage it into object storage
```cmd
python3 -m torchdatasetutil.getcoco
```

To train with the coco dataset, first create dataset loaders
```python
from torchdatasetutil.cocostore import CreateCocoLoaders

# Create dataset loaders
dataset_bucket = s3def['sets']['dataset']['bucket']
if args.dataset=='coco':
    class_dictionary = s3.GetDict(s3def['sets']['dataset']['bucket'],args.coco_class_dict)
    loaders = CreateCocoLoaders(s3, dataset_bucket, 
        class_dict=args.coco_class_dict, 
        batch_size=args.batch_size,
        num_workers=args.num_workers,
        cuda = args.cuda,
        height = args.height,
        width = args.width,
    )

# Identify training and test loaders
trainloader = next(filter(lambda d: d.get('set') == 'train', loaders), None)
testloader = next(filter(lambda d: d.get('set') == 'test' or d.get('set') == 'val', loaders), None)

# Iterate through the dataset
for i, data in tqdm(enumerate(trainloader['dataloader']), 
                    bar_format='{desc:<8.5}{percentage:3.0f}%|{bar:50}{r_bar}', 
                    total=trainloader['batches'], desc="Train batches", disable=args.job):

    # Extract dataset data
    inputs, labels, mean, stdev = data

    # Remaining steps

```

# Cityscapes Dataset
To download cityscapes, your cityscapes credentials must be included in you credentials yaml file with the following structure

```yaml
cityscapes:
  username: <username>
  password: <password>
```
Call torchdatasetutil.getcityscapes to retrieve the cityscapes dataset and stage it into object storage
```cmd
python3 -m torchdatasetutil.getcityscapes
```
```python
if args.dataset=='cityscapes':
    class_dictionary = s3.GetDict(s3def['sets']['dataset']['bucket'],args.cityscapes_class_dict)
    loaders = CreateCityscapesLoaders(s3, s3def, 
        src = args.cityscapes_data,
        dest = args.dataset_path+'/cityscapes',
        class_dictionary = class_dictionary,
        batch_size = args.batch_size, 
        num_workers=args.num_workers,
        height=args.height,
        width=args.width, 
    )
```

# Imagenet:
1. Data from kaggle:
    # Data from https://www.kaggle.com/competitions/imagenet-object-localization-challenge/data?select=LOC_sample_submission.csv
1. Extract and move validation folder data:
    https://discuss.pytorch.org/t/issues-with-dataloader-for-imagenet-should-i-use-datasets-imagefolder-or-datasets-imagenet/115742/7
1. Zip ILSVRC/Data/CLS-LOC/ to ILSVRC2012_devkit_t12.tar.gz
    ```cmd
    tar -czvf ILSVRC2012_devkit_t12.tar.gz ILSVRC/Data/CLS-LOC
    ```
1. Imagenet [directories](https://github.com/HoldenCaulfieldRye/caffe/blob/master/data/ilsvrc12/synset_words.txt)
1. Imagenet [indexes](https://deeplearning.cms.waikato.ac.nz/user-guide/class-maps/IMAGENET/)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bhlarson/torchdatasetutil",
    "name": "torchdatasetutil",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,Machine Learning,Utilities",
    "author": "Brad Larson",
    "author_email": "<bhlarson@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/88/c8/17d268101d6051cae5a7711a991136002fc6bb43d57e161c6cdc0a7831a7/torchdatasetutil-0.1.30.tar.gz",
    "platform": null,
    "description": "# Torch Dataset Utilities\n\nThe python library [torchdatasetutils](https://pypi.org/project/torchdatasetutil/) produces torch [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) classes and utility functions for several imaging datasets.  This currently includes sets of images and annotations from [CVAT](https://github.com/openvinotoolkit/cvat), [COCO dataset](https://cocodataset.org/).  \"torchdatasetutil\" uses an s3 object storage to hold dataset data.  This enables training and test to be performed on nodes different from where the dataset is stored with application defined credentials.  It uses torch PyTorch worker threads to prefetch data for efficient GPU or CPU training and inference.\n\n\"torchdatasetutils\" takes as an input the [pymlutil](https://pypi.org/project/pymlutil/).s3 object to access the object storage.\n\nTwo json or yaml dictionaries are loaded from the object storage to identify and process the dataset: the dataset description and class dictionary.  The the dataset description is unique for each type of dataset.  The class dictionary is common to all datasets and describes data transformation and data augmentation.\n\n## Library structure\n- pymlutil.s3: access to object storage\n- [torchdatasetutil](https://pypi.org/project/torchdatasetutil/)\n    - [gitcoco.getcoco](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil/getcoco.py#L25): function to load the [COCO dataset](https://cocodataset.org/) from internet archives into object storage\n    - [cocostore](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil/cocostore.py)\n        - [CocoStore](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil/cocostore.py#L17): class providing a python iterator over the coco dataset in object storage\n        - [CocoDataset](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil/cocostore.py)\" class implementing the pytorch [Dataset class](https://pytorch.org/docs/stable/data.html#dataset-types) for the CocoStore iterator\n    - [imstore](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil/imstore.py)\n\nSee [torchdatasetutil.ipynb](https://github.com/bhlarson/torchdatasetutil/blob/main/torchdatasetutil.ipynb) for library interface and usage\n\n## Class Dictionary\n\n## COCO Dataset\nTo load coco dataset you must have a credentials yaml file identifying the final s3 location and credentials for the dataset with the following keys:\n\n```yaml\ns3:\n- name: store\n  type: trainer\n  address: <address>:<port>\n  access key: <access key>\n  secret key: <secret key>\n  tls: false\n  cert verify: false\n  cert path: null\n  sets:\n    dataset: {\"bucket\":\"imgml\",\"prefix\":\"data\", \"dataset_filter\":\"\" }\n    trainingset: {\"bucket\":\"imgml\",\"prefix\":\"training\", \"dataset_filter\":\"\" }\n    model: {\"bucket\":\"imgml\",\"prefix\":\"model\", \"dataset_filter\":\"\" }\n    test: {\"bucket\":\"imgml\",\"prefix\":\"test\", \"dataset_filter\":\"\" }\n```\n\nCall torchdatasetutil.getcoco to retrieve the COCO dataset and stage it into object storage\n```cmd\npython3 -m torchdatasetutil.getcoco\n```\n\nTo train with the coco dataset, first create dataset loaders\n```python\nfrom torchdatasetutil.cocostore import CreateCocoLoaders\n\n# Create dataset loaders\ndataset_bucket = s3def['sets']['dataset']['bucket']\nif args.dataset=='coco':\n    class_dictionary = s3.GetDict(s3def['sets']['dataset']['bucket'],args.coco_class_dict)\n    loaders = CreateCocoLoaders(s3, dataset_bucket, \n        class_dict=args.coco_class_dict, \n        batch_size=args.batch_size,\n        num_workers=args.num_workers,\n        cuda = args.cuda,\n        height = args.height,\n        width = args.width,\n    )\n\n# Identify training and test loaders\ntrainloader = next(filter(lambda d: d.get('set') == 'train', loaders), None)\ntestloader = next(filter(lambda d: d.get('set') == 'test' or d.get('set') == 'val', loaders), None)\n\n# Iterate through the dataset\nfor i, data in tqdm(enumerate(trainloader['dataloader']), \n                    bar_format='{desc:<8.5}{percentage:3.0f}%|{bar:50}{r_bar}', \n                    total=trainloader['batches'], desc=\"Train batches\", disable=args.job):\n\n    # Extract dataset data\n    inputs, labels, mean, stdev = data\n\n    # Remaining steps\n\n```\n\n# Cityscapes Dataset\nTo download cityscapes, your cityscapes credentials must be included in you credentials yaml file with the following structure\n\n```yaml\ncityscapes:\n  username: <username>\n  password: <password>\n```\nCall torchdatasetutil.getcityscapes to retrieve the cityscapes dataset and stage it into object storage\n```cmd\npython3 -m torchdatasetutil.getcityscapes\n```\n```python\nif args.dataset=='cityscapes':\n    class_dictionary = s3.GetDict(s3def['sets']['dataset']['bucket'],args.cityscapes_class_dict)\n    loaders = CreateCityscapesLoaders(s3, s3def, \n        src = args.cityscapes_data,\n        dest = args.dataset_path+'/cityscapes',\n        class_dictionary = class_dictionary,\n        batch_size = args.batch_size, \n        num_workers=args.num_workers,\n        height=args.height,\n        width=args.width, \n    )\n```\n\n# Imagenet:\n1. Data from kaggle:\n    # Data from https://www.kaggle.com/competitions/imagenet-object-localization-challenge/data?select=LOC_sample_submission.csv\n1. Extract and move validation folder data:\n    https://discuss.pytorch.org/t/issues-with-dataloader-for-imagenet-should-i-use-datasets-imagefolder-or-datasets-imagenet/115742/7\n1. Zip ILSVRC/Data/CLS-LOC/ to ILSVRC2012_devkit_t12.tar.gz\n    ```cmd\n    tar -czvf ILSVRC2012_devkit_t12.tar.gz ILSVRC/Data/CLS-LOC\n    ```\n1. Imagenet [directories](https://github.com/HoldenCaulfieldRye/caffe/blob/master/data/ilsvrc12/synset_words.txt)\n1. Imagenet [indexes](https://deeplearning.cms.waikato.ac.nz/user-guide/class-maps/IMAGENET/)\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Utilities to load and use pytorch datasets stored in Minio S3",
    "version": "0.1.30",
    "split_keywords": [
        "python",
        "machine learning",
        "utilities"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e8102a9524b6395a56eb30e909ef0446b916db1e08c6b97ce3175a16b362f06c",
                "md5": "42de7774ee520da23f97eb58d48b5f37",
                "sha256": "afe00c5661b719b6be6b8597150f3cf72640e1e48a26141d5542539193852205"
            },
            "downloads": -1,
            "filename": "torchdatasetutil-0.1.30-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "42de7774ee520da23f97eb58d48b5f37",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 54596,
            "upload_time": "2023-01-15T01:23:13",
            "upload_time_iso_8601": "2023-01-15T01:23:13.263454Z",
            "url": "https://files.pythonhosted.org/packages/e8/10/2a9524b6395a56eb30e909ef0446b916db1e08c6b97ce3175a16b362f06c/torchdatasetutil-0.1.30-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "88c817d268101d6051cae5a7711a991136002fc6bb43d57e161c6cdc0a7831a7",
                "md5": "45db5202ab0abaf1aa9aed5ea215c05b",
                "sha256": "1bc6d8e49b56fd6674a74deeacc331fa2283b17e02d58bc93bceaa4f95690669"
            },
            "downloads": -1,
            "filename": "torchdatasetutil-0.1.30.tar.gz",
            "has_sig": false,
            "md5_digest": "45db5202ab0abaf1aa9aed5ea215c05b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 45967,
            "upload_time": "2023-01-15T01:23:15",
            "upload_time_iso_8601": "2023-01-15T01:23:15.072130Z",
            "url": "https://files.pythonhosted.org/packages/88/c8/17d268101d6051cae5a7711a991136002fc6bb43d57e161c6cdc0a7831a7/torchdatasetutil-0.1.30.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-15 01:23:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "bhlarson",
    "github_project": "torchdatasetutil",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "torchdatasetutil"
}

Brad Larson