dlio-benchmark


Namedlio-benchmark JSON
Version 2.0.0 PyPI version JSON
download
home_pagehttps://github.com/argonne-lcf/dlio_benchmark
SummaryAn I/O benchmark for deep Learning applications
upload_time2024-08-28 00:32:24
maintainerNone
docs_urlNone
authorHuihuo Zheng, Hariharan Devarajan (Hari)
requires_pythonNone
licenseNone
keywords deep learning i/o benchmark npz pytorch benchmark tensorflow benchmark
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Deep Learning I/O (DLIO) Benchmark
![test status](https://github.com/argonne-lcf/dlio_benchmark/actions/workflows/ci.yml/badge.svg)

This README provides an abbreviated documentation of the DLIO code. Please refer to https://dlio-benchmark.readthedocs.io for full user documentation. 

## Overview

DLIO is an I/O benchmark for Deep Learning. DLIO is aimed at emulating the I/O behavior of various deep learning applications. The benchmark is delivered as an executable that can be configured for various I/O patterns. It uses a modular design to incorporate more data loaders, data formats, datasets, and configuration parameters. It emulates modern deep learning applications using Benchmark Runner, Data Generator, Format Handler, and I/O Profiler modules. 

## Installation and running DLIO
### Bare metal installation 

```bash
git clone https://github.com/argonne-lcf/dlio_benchmark
cd dlio_benchmark/
pip install .
dlio_benchmark ++workload.workflow.generate_data=True
```

### Bare metal installation with profiler

```bash
git clone https://github.com/argonne-lcf/dlio_benchmark
cd dlio_benchmark/
pip install .[pydftracer]
```

## Container
```bash
git clone https://github.com/argonne-lcf/dlio_benchmark
cd dlio_benchmark/
docker build -t dlio .
docker run -t dlio dlio_benchmark ++workload.workflow.generate_data=True
``` 

You can also pull rebuilt container from docker hub (might not reflect the most recent change of the code): 
```bash
docker docker.io/zhenghh04/dlio:latest
docker run -t docker.io/zhenghh04/dlio:latest python ./dlio_benchmark/main.py ++workload.workflow.generate_data=True
```
If your running on a different architecture, refer to the Dockerfile to build the dlio_benchmark container from scratch.

One can also run interactively inside the container
```bash
docker run -t docker.io/zhenghh04/dlio:latest /bin/bash
root@30358dd47935:/workspace/dlio$ python ./dlio_benchmark/main.py ++workload.workflow.generate_data=True
```

## PowerPC
PowerPC requires installation through anaconda.
```bash
# Setup required channels
conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/

# create and activate environment
conda env create --prefix ./dlio_env_ppc --file environment-ppc.yaml --force
conda activate ./dlio_env_ppc
# install other dependencies
python -m pip install .
```

## Lassen, LLNL
For specific instructions on how to install and run the benchmark on Lassen please refer to: [Install Lassen](https://dlio-benchmark.readthedocs.io/en/latest/instruction_lassen.html)

## Running the benchmark

A DLIO run is split in 3 phases: 
- Generate synthetic data that DLIO will use
- Run the benchmark using the previously generated data
- Post-process the results to generate a report

The configurations of a workload can be specified through a yaml file. Examples of yaml files can be found in [dlio_benchmark/configs/workload/](./dlio_benchmark/configs/workload). 

One can specify the workload through the ```workload=``` option on the command line. Specific configuration fields can then be overridden following the ```hydra``` framework convention (e.g. ```++workload.framework=tensorflow```). 

First, generate the data
  ```bash
  mpirun -np 8 dlio_benchmark workload=unet3d ++workload.workflow.generate_data=True ++workload.workflow.train=False
  ```
If possible, one can flush the filesystem caches in order to properly capture device I/O
  ```bash
  sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
  ```
Finally, run the benchmark
  ```bash
  mpirun -np 8 dlio_benchmark workload=unet3d
  ```
Finally, run the benchmark with Tracer
  ```bash
  export DFTRACER_ENABLE=1
  export DFTRACER_INC_METADATA=1
  mpirun -np 8 dlio_benchmark workload=unet3d
  ```

All the outputs will be stored in ```hydra_log/unet3d/$DATE-$TIME``` folder. To post process the data, one can do
```bash 
dlio_postprocessor --output-folder hydra_log/unet3d/$DATE-$TIME
```
This will generate ```DLIO_$model_report.txt``` in the output folder. 

## Workload YAML configuration file 
Workload characteristics are specified by a YAML configuration file. Below is an example of a YAML file for the UNet3D workload which is used for 3D image segmentation. 

```
# contents of unet3d.yaml
model: unet3d

framework: pytorch

workflow:
  generate_data: False
  train: True
  checkpoint: True

dataset: 
  data_folder: data/unet3d/
  format: npz
  num_files_train: 168
  num_samples_per_file: 1
  record_length: 146600628
  record_length_stdev: 68341808
  record_length_resize: 2097152
  
reader: 
  data_loader: pytorch
  batch_size: 4
  read_threads: 4
  file_shuffle: seed
  sample_shuffle: seed

train:
  epochs: 5
  computation_time: 1.3604

checkpoint:
  checkpoint_folder: checkpoints/unet3d
  checkpoint_after_epoch: 5
  epochs_between_checkpoints: 2
  model_size: 499153191
```

The full list of configurations can be found in: https://argonne-lcf.github.io/dlio_benchmark/config.html

The YAML file is loaded through hydra (https://hydra.cc/). The default setting are overridden by the configurations loaded from the YAML file. One can override the configuration through command line (https://hydra.cc/docs/advanced/override_grammar/basic/). 

## Current Limitations and Future Work

* DLIO currently assumes the samples to always be 2D images, even though one can set the size of each sample through ```--record_length```. We expect the shape of the sample to have minimal impact to the I/O itself. This yet to be validated for case by case perspective. We plan to add option to allow specifying the shape of the sample. 

* We assume the data/label pairs are stored in the same file. Storing data and labels in separate files will be supported in future.

* File format support: we only support tfrecord, hdf5, npz, csv, jpg, jpeg formats. Other data formats can be extended. 

* Data Loader support: we support reading datasets using TensorFlow tf.data data loader, PyTorch DataLoader, and a set of custom data readers implemented in ./reader. For TensorFlow tf.data data loader, PyTorch DataLoader  
  - We have complete support for tfrecord format in TensorFlow data loader. 
  - For npz, jpg, jpeg, hdf5, we currently only support one sample per file case. In other words, each sample is stored in an independent file. Multiple samples per file case will be supported in future. 

## How to contribute 
We welcome contributions from the community to the benchmark code. Specifically, we welcome contribution in the following aspects:
General new features needed including: 

* support for new workloads: if you think that your workload(s) would be interested to the public, and would like to provide the yaml file to be included in the repo, please submit an issue.  
* support for new data loaders, such as DALI loader, MxNet loader, etc
* support for new frameworks, such as MxNet
* support for noval file systems or storage, such as AWS S3. 
* support for loading new data formats. 

If you would like to contribute, please submit an issue to https://github.com/argonne-lcf/dlio_benchmark/issues, and contact ALCF DLIO team, Huihuo Zheng at huihuo.zheng@anl.gov

## Citation and Reference
The original CCGrid'21 paper describes the design and implementation of DLIO code. Please cite this paper if you use DLIO for your research. 

```
@article{devarajan2021dlio,
  title={DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications},
  author={H. Devarajan and H. Zheng and A. Kougkas and X.-H. Sun and V. Vishwanath},
  booktitle={IEEE/ACM International Symposium in Cluster, Cloud, and Internet Computing (CCGrid'21)},
  year={2021},
  volume={},
  number={81--91},
  pages={},
  publisher={IEEE/ACM}
}
```



We also encourage people to take a look at a relevant work from MLPerf Storage working group. 
```
@article{balmau2022mlperfstorage,
  title={Characterizing I/O in Machine Learning with MLPerf Storage},
  author={O. Balmau},
  booktitle={SIGMOD Record DBrainstorming},
  year={2022},
  volume={51},
  number={3},
  publisher={ACM}
}
```

## Acknowledgments

This work used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility under Contract DE-AC02-06CH11357 and is supported in part by National Science Foundation under NSF, OCI-1835764 and NSF, CSR-1814872.

## License

Apache 2.0 [LICENSE](./LICENSE)

---------------------------------------
Copyright (c) 2022, UChicago Argonne, LLC
All Rights Reserved

If you have questions about your rights to use or distribute this software, please contact Argonne Intellectual Property Office at partners@anl.gov

NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/argonne-lcf/dlio_benchmark",
    "name": "dlio-benchmark",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "deep learning, I/O, benchmark, NPZ, pytorch benchmark, tensorflow benchmark",
    "author": "Huihuo Zheng, Hariharan Devarajan (Hari)",
    "author_email": "zhenghh04@gmail.com, mani.hariharan@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f8/ad/661a0ceeda9865285d8eef1d484f451d0dd9220b1c00ec582010ab96b820/dlio_benchmark-2.0.0.tar.gz",
    "platform": null,
    "description": "# Deep Learning I/O (DLIO) Benchmark\n![test status](https://github.com/argonne-lcf/dlio_benchmark/actions/workflows/ci.yml/badge.svg)\n\nThis README provides an abbreviated documentation of the DLIO code. Please refer to https://dlio-benchmark.readthedocs.io for full user documentation. \n\n## Overview\n\nDLIO is an I/O benchmark for Deep Learning. DLIO is aimed at emulating the I/O behavior of various deep learning applications. The benchmark is delivered as an executable that can be configured for various I/O patterns. It uses a modular design to incorporate more data loaders, data formats, datasets, and configuration parameters. It emulates modern deep learning applications using Benchmark Runner, Data Generator, Format Handler, and I/O Profiler modules. \n\n## Installation and running DLIO\n### Bare metal installation \n\n```bash\ngit clone https://github.com/argonne-lcf/dlio_benchmark\ncd dlio_benchmark/\npip install .\ndlio_benchmark ++workload.workflow.generate_data=True\n```\n\n### Bare metal installation with profiler\n\n```bash\ngit clone https://github.com/argonne-lcf/dlio_benchmark\ncd dlio_benchmark/\npip install .[pydftracer]\n```\n\n## Container\n```bash\ngit clone https://github.com/argonne-lcf/dlio_benchmark\ncd dlio_benchmark/\ndocker build -t dlio .\ndocker run -t dlio dlio_benchmark ++workload.workflow.generate_data=True\n``` \n\nYou can also pull rebuilt container from docker hub (might not reflect the most recent change of the code): \n```bash\ndocker docker.io/zhenghh04/dlio:latest\ndocker run -t docker.io/zhenghh04/dlio:latest python ./dlio_benchmark/main.py ++workload.workflow.generate_data=True\n```\nIf your running on a different architecture, refer to the Dockerfile to build the dlio_benchmark container from scratch.\n\nOne can also run interactively inside the container\n```bash\ndocker run -t docker.io/zhenghh04/dlio:latest /bin/bash\nroot@30358dd47935:/workspace/dlio$ python ./dlio_benchmark/main.py ++workload.workflow.generate_data=True\n```\n\n## PowerPC\nPowerPC requires installation through anaconda.\n```bash\n# Setup required channels\nconda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/\n\n# create and activate environment\nconda env create --prefix ./dlio_env_ppc --file environment-ppc.yaml --force\nconda activate ./dlio_env_ppc\n# install other dependencies\npython -m pip install .\n```\n\n## Lassen, LLNL\nFor specific instructions on how to install and run the benchmark on Lassen please refer to: [Install Lassen](https://dlio-benchmark.readthedocs.io/en/latest/instruction_lassen.html)\n\n## Running the benchmark\n\nA DLIO run is split in 3 phases: \n- Generate synthetic data that DLIO will use\n- Run the benchmark using the previously generated data\n- Post-process the results to generate a report\n\nThe configurations of a workload can be specified through a yaml file. Examples of yaml files can be found in [dlio_benchmark/configs/workload/](./dlio_benchmark/configs/workload). \n\nOne can specify the workload through the ```workload=``` option on the command line. Specific configuration fields can then be overridden following the ```hydra``` framework convention (e.g. ```++workload.framework=tensorflow```). \n\nFirst, generate the data\n  ```bash\n  mpirun -np 8 dlio_benchmark workload=unet3d ++workload.workflow.generate_data=True ++workload.workflow.train=False\n  ```\nIf possible, one can flush the filesystem caches in order to properly capture device I/O\n  ```bash\n  sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches\n  ```\nFinally, run the benchmark\n  ```bash\n  mpirun -np 8 dlio_benchmark workload=unet3d\n  ```\nFinally, run the benchmark with Tracer\n  ```bash\n  export DFTRACER_ENABLE=1\n  export DFTRACER_INC_METADATA=1\n  mpirun -np 8 dlio_benchmark workload=unet3d\n  ```\n\nAll the outputs will be stored in ```hydra_log/unet3d/$DATE-$TIME``` folder. To post process the data, one can do\n```bash \ndlio_postprocessor --output-folder hydra_log/unet3d/$DATE-$TIME\n```\nThis will generate ```DLIO_$model_report.txt``` in the output folder. \n\n## Workload YAML configuration file \nWorkload characteristics are specified by a YAML configuration file. Below is an example of a YAML file for the UNet3D workload which is used for 3D image segmentation. \n\n```\n# contents of unet3d.yaml\nmodel: unet3d\n\nframework: pytorch\n\nworkflow:\n  generate_data: False\n  train: True\n  checkpoint: True\n\ndataset: \n  data_folder: data/unet3d/\n  format: npz\n  num_files_train: 168\n  num_samples_per_file: 1\n  record_length: 146600628\n  record_length_stdev: 68341808\n  record_length_resize: 2097152\n  \nreader: \n  data_loader: pytorch\n  batch_size: 4\n  read_threads: 4\n  file_shuffle: seed\n  sample_shuffle: seed\n\ntrain:\n  epochs: 5\n  computation_time: 1.3604\n\ncheckpoint:\n  checkpoint_folder: checkpoints/unet3d\n  checkpoint_after_epoch: 5\n  epochs_between_checkpoints: 2\n  model_size: 499153191\n```\n\nThe full list of configurations can be found in: https://argonne-lcf.github.io/dlio_benchmark/config.html\n\nThe YAML file is loaded through hydra (https://hydra.cc/). The default setting are overridden by the configurations loaded from the YAML file. One can override the configuration through command line (https://hydra.cc/docs/advanced/override_grammar/basic/). \n\n## Current Limitations and Future Work\n\n* DLIO currently assumes the samples to always be 2D images, even though one can set the size of each sample through ```--record_length```. We expect the shape of the sample to have minimal impact to the I/O itself. This yet to be validated for case by case perspective. We plan to add option to allow specifying the shape of the sample. \n\n* We assume the data/label pairs are stored in the same file. Storing data and labels in separate files will be supported in future.\n\n* File format support: we only support tfrecord, hdf5, npz, csv, jpg, jpeg formats. Other data formats can be extended. \n\n* Data Loader support: we support reading datasets using TensorFlow tf.data data loader, PyTorch DataLoader, and a set of custom data readers implemented in ./reader. For TensorFlow tf.data data loader, PyTorch DataLoader  \n  - We have complete support for tfrecord format in TensorFlow data loader. \n  - For npz, jpg, jpeg, hdf5, we currently only support one sample per file case. In other words, each sample is stored in an independent file. Multiple samples per file case will be supported in future. \n\n## How to contribute \nWe welcome contributions from the community to the benchmark code. Specifically, we welcome contribution in the following aspects:\nGeneral new features needed including: \n\n* support for new workloads: if you think that your workload(s) would be interested to the public, and would like to provide the yaml file to be included in the repo, please submit an issue.  \n* support for new data loaders, such as DALI loader, MxNet loader, etc\n* support for new frameworks, such as MxNet\n* support for noval file systems or storage, such as AWS S3. \n* support for loading new data formats. \n\nIf you would like to contribute, please submit an issue to https://github.com/argonne-lcf/dlio_benchmark/issues, and contact ALCF DLIO team, Huihuo Zheng at huihuo.zheng@anl.gov\n\n## Citation and Reference\nThe original CCGrid'21 paper describes the design and implementation of DLIO code. Please cite this paper if you use DLIO for your research. \n\n```\n@article{devarajan2021dlio,\n  title={DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications},\n  author={H. Devarajan and H. Zheng and A. Kougkas and X.-H. Sun and V. Vishwanath},\n  booktitle={IEEE/ACM International Symposium in Cluster, Cloud, and Internet Computing (CCGrid'21)},\n  year={2021},\n  volume={},\n  number={81--91},\n  pages={},\n  publisher={IEEE/ACM}\n}\n```\n\n\n\nWe also encourage people to take a look at a relevant work from MLPerf Storage working group. \n```\n@article{balmau2022mlperfstorage,\n  title={Characterizing I/O in Machine Learning with MLPerf Storage},\n  author={O. Balmau},\n  booktitle={SIGMOD Record DBrainstorming},\n  year={2022},\n  volume={51},\n  number={3},\n  publisher={ACM}\n}\n```\n\n## Acknowledgments\n\nThis work used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility under Contract DE-AC02-06CH11357 and is supported in part by National Science Foundation under NSF, OCI-1835764 and NSF, CSR-1814872.\n\n## License\n\nApache 2.0 [LICENSE](./LICENSE)\n\n---------------------------------------\nCopyright (c) 2022, UChicago Argonne, LLC\nAll Rights Reserved\n\nIf you have questions about your rights to use or distribute this software, please contact Argonne Intellectual Property Office at partners@anl.gov\n\nNOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An I/O benchmark for deep Learning applications",
    "version": "2.0.0",
    "project_urls": {
        "Bug Reports": "https://github.com/argonne-lcf/dlio_benchmark/issues",
        "Homepage": "https://github.com/argonne-lcf/dlio_benchmark",
        "Source": "https://github.com/argonne-lcf/dlio_benchmark"
    },
    "split_keywords": [
        "deep learning",
        " i/o",
        " benchmark",
        " npz",
        " pytorch benchmark",
        " tensorflow benchmark"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "93fa8c0d8e1b86d5548170be2ff206ea884fc5dfcbd35f45f57e8a0083deda72",
                "md5": "89ff032f883094a71d780c9319e33c01",
                "sha256": "50716778199f83ad67d340a1e36c9e28fa43d2c291047c5e26222592bfad1c03"
            },
            "downloads": -1,
            "filename": "dlio_benchmark-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "89ff032f883094a71d780c9319e33c01",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 130143,
            "upload_time": "2024-08-28T00:32:23",
            "upload_time_iso_8601": "2024-08-28T00:32:23.289117Z",
            "url": "https://files.pythonhosted.org/packages/93/fa/8c0d8e1b86d5548170be2ff206ea884fc5dfcbd35f45f57e8a0083deda72/dlio_benchmark-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f8ad661a0ceeda9865285d8eef1d484f451d0dd9220b1c00ec582010ab96b820",
                "md5": "33780a77a9f85a4dd95b19e4c2a5b3de",
                "sha256": "ab3bdb1304c4893c72ed4d310fc08ab97b16cb77732c77a2ee5ac7ccdb974783"
            },
            "downloads": -1,
            "filename": "dlio_benchmark-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "33780a77a9f85a4dd95b19e4c2a5b3de",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 73768,
            "upload_time": "2024-08-28T00:32:24",
            "upload_time_iso_8601": "2024-08-28T00:32:24.943586Z",
            "url": "https://files.pythonhosted.org/packages/f8/ad/661a0ceeda9865285d8eef1d484f451d0dd9220b1c00ec582010ab96b820/dlio_benchmark-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-28 00:32:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "argonne-lcf",
    "github_project": "dlio_benchmark",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "dlio-benchmark"
}
        
Elapsed time: 3.06992s