zeus-ml


Namezeus-ml JSON
Version 0.8.1 PyPI version JSON
download
home_page
SummaryA framework for deep learning energy measurement and optimization.
upload_time2024-02-25 22:03:34
maintainer
docs_urlNone
author
requires_python>=3.8
licenseApache 2.0
keywords deep-learning power energy sustainability mlsys
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
<picture>
  <source media="(prefers-color-scheme: dark)" srcset="docs/assets/img/logo_dark.svg">
  <source media="(prefers-color-scheme: light)" srcset="docs/assets/img/logo_light.svg">
  <img alt="Zeus logo" width="55%" src="docs/assets/img/logo_dark.svg">
</picture>
<h1>Deep Learning Energy Measurement and Optimization</h1>
</div>

[![NSDI23 paper](https://custom-icon-badges.herokuapp.com/badge/NSDI'23-paper-b31b1b.svg)](https://www.usenix.org/conference/nsdi23/presentation/you)
[![Docker Hub](https://badgen.net/docker/pulls/symbioticlab/zeus?icon=docker&label=Docker%20pulls)](https://hub.docker.com/r/mlenergy/zeus)
[![Slack workspace](https://badgen.net/badge/icon/Join%20workspace/611f69?icon=slack&label=Slack)](https://join.slack.com/t/zeus-ml/shared_invite/zt-1najba5mb-WExy7zoNTyaZZfTlUWoLLg)
[![Homepage build](https://github.com/ml-energy/zeus/actions/workflows/deploy_homepage.yaml/badge.svg)](https://github.com/ml-energy/zeus/actions/workflows/deploy_homepage.yaml)
[![Apache-2.0 License](https://custom-icon-badges.herokuapp.com/github/license/ml-energy/zeus?logo=law)](/LICENSE)

---
**Project News** ⚡ 

- \[2023/12\] The preprint of the Perseus paper is out [here](https://arxiv.org/abs/2312.06902)!
- \[2023/10\] We released Perseus, an energy optimizer for large model training. Get started [here](https://ml.energy/zeus/perseus/)!
- \[2023/09\] We moved to under [`ml-energy`](https://github.com/ml-energy)! Please stay tuned for new exciting projects!
- \[2023/07\] [`ZeusMonitor`](https://ml.energy/zeus/reference/monitor/#zeus.monitor.ZeusMonitor) was used to profile GPU time and energy consumption for the [ML.ENERGY leaderboard & Colosseum](https://ml.energy/leaderboard).
- \[2023/03\] [Chase](https://symbioticlab.org/publications/files/chase:ccai23/chase-ccai23.pdf), an automatic carbon optimization framework for DNN training, will appear at ICLR'23 workshop.
- \[2022/11\] [Carbon-Aware Zeus](https://taikai.network/gsf/hackathons/carbonhack22/projects/cl95qxjpa70555701uhg96r0ek6/idea) won the **second overall best solution award** at Carbon Hack 22.
---

Zeus is a framework for (1) measuring GPU energy consumption and (2) optimizing energy and time for DNN training.

### Measuring GPU energy

```python
from zeus.monitor import ZeusMonitor

monitor = ZeusMonitor(gpu_indices=[0,1,2,3])

monitor.begin_window("heavy computation")
# Four GPUs consuming energy like crazy!
measurement = monitor.end_window("heavy computation")

print(f"Energy: {measurement.total_energy} J")
print(f"Time  : {measurement.time} s")
```

### Finding the optimal GPU power limit

Zeus silently profiles different power limits during training and converges to the optimal one.

```python
from zeus.monitor import ZeusMonitor
from zeus.optimizer import GlobalPowerLimitOptimizer

monitor = ZeusMonitor(gpu_indices=[0,1,2,3])
plo = GlobalPowerLimitOptimizer(monitor)

plo.on_epoch_begin()

for x, y in train_dataloader:
    plo.on_step_begin()
    # Learn from x and y!
    plo.on_step_end()

plo.on_epoch_end()
```

### CLI power and energy monitor

```console
$ python -m zeus.monitor power
[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]
2023-08-22 22:40:00.800576
{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}
2023-08-22 22:40:01.842590
{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}
2023-08-22 22:40:02.845734
{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}
2023-08-22 22:40:03.848818
{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}
^C
Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}
```

```console
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})
```

Please refer to our NSDI’23 [paper](https://www.usenix.org/conference/nsdi23/presentation/you) and [slides](https://www.usenix.org/system/files/nsdi23_slides_chung.pdf) for details.
Checkout [Overview](https://ml.energy/zeus/overview/) for a summary.

Zeus is part of [The ML.ENERGY Initiative](https://ml.energy).

## Repository Organization

```
.
├── zeus/                # ⚡ Zeus Python package
│   ├── optimizer/       #    - GPU energy and time optimizers
│   ├── run/             #    - Tools for running Zeus on real training jobs
│   ├── policy/          #    - Optimization policies and extension interfaces
│   ├── util/            #    - Utility functions and classes
│   ├── monitor.py       #    - `ZeusMonitor`: Measure GPU time and energy of any code block
│   ├── controller.py    #    - Tools for controlling the flow of training
│   ├── callback.py      #    - Base class for Hugging Face-like training callbacks.
│   ├── simulate.py      #    - Tools for trace-driven simulation
│   ├── analyze.py       #    - Analysis functions for power logs
│   └── job.py           #    - Class for job specification
│
├── zeus_monitor/        # 🔌 GPU power monitor
│   ├── zemo/            #    -  A header-only library for querying NVML
│   └── main.cpp         #    -  Source code of the power monitor
│
├── examples/            # 🛠️ Examples of integrating Zeus
│
├── capriccio/           # 🌊 A drifting sentiment analysis dataset
│
└── trace/               # 🗃️ Train and power traces for various GPUs and DNNs
```

## Getting Started

Refer to [Getting started](https://ml.energy/zeus/getting_started) for complete instructions on environment setup, installation, and integration.

### Docker image

We provide a Docker image fully equipped with all dependencies and environments.
The only command you need is:

```sh
docker run -it \
    --gpus all                  `# Mount all GPUs` \
    --cap-add SYS_ADMIN         `# Needed to change the power limit of the GPU` \
    --ipc host                  `# PyTorch DataLoader workers need enough shm` \
    mlenergy/zeus:latest \
    bash
```

Refer to [Environment setup](https://ml.energy/zeus/getting_started/environment/) for details.

### Examples

We provide working examples for integrating and running Zeus in the `examples/` directory.


## Extending Zeus

You can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.

Refer to [Extending Zeus](https://ml.energy/zeus/extend/) for details.


## Carbon-Aware Zeus

The use of GPUs for training DNNs results in high carbon emissions and energy consumption. Building on top of Zeus, we introduce *Chase* -- a carbon-aware solution. *Chase* dynamically controls the energy consumption of GPUs; adapts to shifts in carbon intensity during DNN training, reducing carbon footprint with minimal compromises on training performance. To proactively adapt to shifting carbon intensity, a lightweight machine learning algorithm is used to forecast the carbon intensity of the upcoming time frame. For more details on Chase, please refer to our [paper](https://symbioticlab.org/publications/files/chase:ccai23/chase-ccai23.pdf) and the [chase branch](https://github.com/ml-energy/zeus/tree/chase). 


## Citation

```bibtex
@inproceedings{zeus-nsdi23,
    title     = {Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training},
    author    = {Jie You and Jae-Won Chung and Mosharaf Chowdhury},
    booktitle = {USENIX NSDI},
    year      = {2023}
}
```

## Contact
Jae-Won Chung (jwnchung@umich.edu)

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "zeus-ml",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "deep-learning,power,energy,sustainability,mlsys",
    "author": "",
    "author_email": "Jae-Won Chung <jwnchung@umich.edu>",
    "download_url": "https://files.pythonhosted.org/packages/09/b6/7426001edca0d7992e00ebb24fe708ef60241a51b1ff455f2b04f57a1d0c/zeus-ml-0.8.1.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n<picture>\n  <source media=\"(prefers-color-scheme: dark)\" srcset=\"docs/assets/img/logo_dark.svg\">\n  <source media=\"(prefers-color-scheme: light)\" srcset=\"docs/assets/img/logo_light.svg\">\n  <img alt=\"Zeus logo\" width=\"55%\" src=\"docs/assets/img/logo_dark.svg\">\n</picture>\n<h1>Deep Learning Energy Measurement and Optimization</h1>\n</div>\n\n[![NSDI23 paper](https://custom-icon-badges.herokuapp.com/badge/NSDI'23-paper-b31b1b.svg)](https://www.usenix.org/conference/nsdi23/presentation/you)\n[![Docker Hub](https://badgen.net/docker/pulls/symbioticlab/zeus?icon=docker&label=Docker%20pulls)](https://hub.docker.com/r/mlenergy/zeus)\n[![Slack workspace](https://badgen.net/badge/icon/Join%20workspace/611f69?icon=slack&label=Slack)](https://join.slack.com/t/zeus-ml/shared_invite/zt-1najba5mb-WExy7zoNTyaZZfTlUWoLLg)\n[![Homepage build](https://github.com/ml-energy/zeus/actions/workflows/deploy_homepage.yaml/badge.svg)](https://github.com/ml-energy/zeus/actions/workflows/deploy_homepage.yaml)\n[![Apache-2.0 License](https://custom-icon-badges.herokuapp.com/github/license/ml-energy/zeus?logo=law)](/LICENSE)\n\n---\n**Project News** \u26a1 \n\n- \\[2023/12\\] The preprint of the Perseus paper is out [here](https://arxiv.org/abs/2312.06902)!\n- \\[2023/10\\] We released Perseus, an energy optimizer for large model training. Get started [here](https://ml.energy/zeus/perseus/)!\n- \\[2023/09\\] We moved to under [`ml-energy`](https://github.com/ml-energy)! Please stay tuned for new exciting projects!\n- \\[2023/07\\] [`ZeusMonitor`](https://ml.energy/zeus/reference/monitor/#zeus.monitor.ZeusMonitor) was used to profile GPU time and energy consumption for the [ML.ENERGY leaderboard & Colosseum](https://ml.energy/leaderboard).\n- \\[2023/03\\] [Chase](https://symbioticlab.org/publications/files/chase:ccai23/chase-ccai23.pdf), an automatic carbon optimization framework for DNN training, will appear at ICLR'23 workshop.\n- \\[2022/11\\] [Carbon-Aware Zeus](https://taikai.network/gsf/hackathons/carbonhack22/projects/cl95qxjpa70555701uhg96r0ek6/idea) won the **second overall best solution award** at Carbon Hack 22.\n---\n\nZeus is a framework for (1) measuring GPU energy consumption and (2) optimizing energy and time for DNN training.\n\n### Measuring GPU energy\n\n```python\nfrom zeus.monitor import ZeusMonitor\n\nmonitor = ZeusMonitor(gpu_indices=[0,1,2,3])\n\nmonitor.begin_window(\"heavy computation\")\n# Four GPUs consuming energy like crazy!\nmeasurement = monitor.end_window(\"heavy computation\")\n\nprint(f\"Energy: {measurement.total_energy} J\")\nprint(f\"Time  : {measurement.time} s\")\n```\n\n### Finding the optimal GPU power limit\n\nZeus silently profiles different power limits during training and converges to the optimal one.\n\n```python\nfrom zeus.monitor import ZeusMonitor\nfrom zeus.optimizer import GlobalPowerLimitOptimizer\n\nmonitor = ZeusMonitor(gpu_indices=[0,1,2,3])\nplo = GlobalPowerLimitOptimizer(monitor)\n\nplo.on_epoch_begin()\n\nfor x, y in train_dataloader:\n    plo.on_step_begin()\n    # Learn from x and y!\n    plo.on_step_end()\n\nplo.on_epoch_end()\n```\n\n### CLI power and energy monitor\n\n```console\n$ python -m zeus.monitor power\n[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]\n2023-08-22 22:40:00.800576\n{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}\n2023-08-22 22:40:01.842590\n{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}\n2023-08-22 22:40:02.845734\n{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}\n2023-08-22 22:40:03.848818\n{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}\n^C\nTotal time (s): 4.421529293060303\nTotal energy (J):\n{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}\n```\n\n```console\n$ python -m zeus.monitor energy\n[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].\n[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.\n[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.\n^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.\nTotal energy (J):\nMeasurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})\n```\n\nPlease refer to our NSDI\u201923 [paper](https://www.usenix.org/conference/nsdi23/presentation/you) and [slides](https://www.usenix.org/system/files/nsdi23_slides_chung.pdf) for details.\nCheckout [Overview](https://ml.energy/zeus/overview/) for a summary.\n\nZeus is part of [The ML.ENERGY Initiative](https://ml.energy).\n\n## Repository Organization\n\n```\n.\n\u251c\u2500\u2500 zeus/                # \u26a1 Zeus Python package\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 optimizer/       #    - GPU energy and time optimizers\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 run/             #    - Tools for running Zeus on real training jobs\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 policy/          #    - Optimization policies and extension interfaces\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 util/            #    - Utility functions and classes\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 monitor.py       #    - `ZeusMonitor`: Measure GPU time and energy of any code block\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 controller.py    #    - Tools for controlling the flow of training\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 callback.py      #    - Base class for Hugging Face-like training callbacks.\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 simulate.py      #    - Tools for trace-driven simulation\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 analyze.py       #    - Analysis functions for power logs\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 job.py           #    - Class for job specification\n\u2502\n\u251c\u2500\u2500 zeus_monitor/        # \ud83d\udd0c GPU power monitor\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 zemo/            #    -  A header-only library for querying NVML\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 main.cpp         #    -  Source code of the power monitor\n\u2502\n\u251c\u2500\u2500 examples/            # \ud83d\udee0\ufe0f Examples of integrating Zeus\n\u2502\n\u251c\u2500\u2500 capriccio/           # \ud83c\udf0a A drifting sentiment analysis dataset\n\u2502\n\u2514\u2500\u2500 trace/               # \ud83d\uddc3\ufe0f Train and power traces for various GPUs and DNNs\n```\n\n## Getting Started\n\nRefer to [Getting started](https://ml.energy/zeus/getting_started) for complete instructions on environment setup, installation, and integration.\n\n### Docker image\n\nWe provide a Docker image fully equipped with all dependencies and environments.\nThe only command you need is:\n\n```sh\ndocker run -it \\\n    --gpus all                  `# Mount all GPUs` \\\n    --cap-add SYS_ADMIN         `# Needed to change the power limit of the GPU` \\\n    --ipc host                  `# PyTorch DataLoader workers need enough shm` \\\n    mlenergy/zeus:latest \\\n    bash\n```\n\nRefer to [Environment setup](https://ml.energy/zeus/getting_started/environment/) for details.\n\n### Examples\n\nWe provide working examples for integrating and running Zeus in the `examples/` directory.\n\n\n## Extending Zeus\n\nYou can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.\n\nRefer to [Extending Zeus](https://ml.energy/zeus/extend/) for details.\n\n\n## Carbon-Aware Zeus\n\nThe use of GPUs for training DNNs results in high carbon emissions and energy consumption. Building on top of Zeus, we introduce *Chase* -- a carbon-aware solution. *Chase* dynamically controls the energy consumption of GPUs; adapts to shifts in carbon intensity during DNN training, reducing carbon footprint with minimal compromises on training performance. To proactively adapt to shifting carbon intensity, a lightweight machine learning algorithm is used to forecast the carbon intensity of the upcoming time frame. For more details on Chase, please refer to our [paper](https://symbioticlab.org/publications/files/chase:ccai23/chase-ccai23.pdf) and the [chase branch](https://github.com/ml-energy/zeus/tree/chase). \n\n\n## Citation\n\n```bibtex\n@inproceedings{zeus-nsdi23,\n    title     = {Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training},\n    author    = {Jie You and Jae-Won Chung and Mosharaf Chowdhury},\n    booktitle = {USENIX NSDI},\n    year      = {2023}\n}\n```\n\n## Contact\nJae-Won Chung (jwnchung@umich.edu)\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "A framework for deep learning energy measurement and optimization.",
    "version": "0.8.1",
    "project_urls": {
        "Documentation": "https://ml.energy/zeus",
        "Homepage": "https://ml.energy/zeus",
        "Repository": "https://github.com/ml-energy/zeus"
    },
    "split_keywords": [
        "deep-learning",
        "power",
        "energy",
        "sustainability",
        "mlsys"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1ccc85592ea989ae26eb32d176a884890f663e3c60dd17f6f4dfdce37a341754",
                "md5": "17eae7403b999ffbe6555f7c331c84a9",
                "sha256": "67b618bd4b5973826c7c012e065263021712c8be4178d2f0bf8ef76ac895fee5"
            },
            "downloads": -1,
            "filename": "zeus_ml-0.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "17eae7403b999ffbe6555f7c331c84a9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 192629,
            "upload_time": "2024-02-25T22:03:32",
            "upload_time_iso_8601": "2024-02-25T22:03:32.120166Z",
            "url": "https://files.pythonhosted.org/packages/1c/cc/85592ea989ae26eb32d176a884890f663e3c60dd17f6f4dfdce37a341754/zeus_ml-0.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "09b67426001edca0d7992e00ebb24fe708ef60241a51b1ff455f2b04f57a1d0c",
                "md5": "a40d5133c2d472d63e230cee2cbeb692",
                "sha256": "df008efa8eafade527fd8a6c2c505450f11fdbdccf9458dc0d491ab48acbb97b"
            },
            "downloads": -1,
            "filename": "zeus-ml-0.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "a40d5133c2d472d63e230cee2cbeb692",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 142676,
            "upload_time": "2024-02-25T22:03:34",
            "upload_time_iso_8601": "2024-02-25T22:03:34.582006Z",
            "url": "https://files.pythonhosted.org/packages/09/b6/7426001edca0d7992e00ebb24fe708ef60241a51b1ff455f2b04f57a1d0c/zeus-ml-0.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-25 22:03:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ml-energy",
    "github_project": "zeus",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "zeus-ml"
}
        
Elapsed time: 0.21366s