Name | safe-gpu JSON |
Version |
2.0.0
JSON |
| download |
home_page | None |
Summary | A process-safe acquisition of exclusive GPUs |
upload_time | 2024-11-11 18:49:01 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.6 |
license | MIT License Copyright (c) 2020 BUT Speech@fit Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
gpu
resource allocation
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# safe-gpu
A module for safe acquisition of GPUs in exclusive mode.
Relevant mainly in clusters with a purely declarative gpu resource, such as many versions of SGE.
Features:
* toolkit independence (PyTorch/TensorFlow/pycuda/...), this just sets `CUDA_VISIBLE_DEVICES` properly
* included support for PyTorch and TensorFlow2 backends, open to others
* multiple GPUs acquisition
* workaround for machines with a single GPU used for display and computation alike
* open to implementation in different languages
Downsides:
* in order to really prevent the race condition, everyone on your cluster has to use this
## Instalation
In addition to manual installation, `safe-gpu` is on PyPi, so you can simply:
```
pip install safe-gpu
```
Note that `safe-gpu` does not formally depend on any backend, giving you, the user, the freedom to pick one of your liking.
## Usage
Prior to initializing CUDA (typically happens in lazy fashion when you place something on GPU), call `claim_gpus`.
```
from safe_gpu import safe_gpu
safe_gpu.claim_gpus()
```
If you want multiple GPUs, pass the desired number to `claim_gpus`:
```
safe_gpu.claim_gpus(nb_gpus)
```
Internally, `claim_gpus()` constructs a `GPUOwner` and stores it in `safe_gpu.gpu_owner`.
If preferred, user code can construct `GPUOwner` itself, but care should be taken to keep it alive until actual data is placed on said GPUs.
### Usage with Horovod
Typical Horovod usage includes starting your script in several processes, one per GPU.
Therefore, only ask for one GPU in each process:
```
safe_gpu.claim_gpus() # 1 GPU is the default, can be ommited
hvd.init()
```
### Common errors
In order to properly setup GPUs for your process, `claim_gpus` really needs be called before CUDA is initialized.
When CUDA does get initialized, it fixes your logical devices (e.g. PyTorch `cuda:1` etc.) to actual GPUs in your system.
If `CUDA_VISIBLE_DEVICES` are not set at that moment, CUDA will happily offer your process all of the visible GPUs, including those already occupied.
Most commonly, this issue occurs for users who try to play it safe and check CUDA availability beforehand:
```
if torch.cuda.is_available(): # This already initializes CUDA
safe_gpu.claim_gpus(nb_gpus) # So this can fail easily
```
If your workflow mandates on-the-fly checking of GPU availability, instead use:
```
try:
safe_gpu.claim_gpus(nb_gpus)
except safe_gpu.NvidiasmiError:
...
```
Also, horovod users can be at risk:
```
hvd.init()
torch.cuda.set_device(hvd.local_rank()) # This initializes CUDA, too
safe_gpu.claim_gpus() # Thus this is likely to fail
```
See above for proper solution.
### Other backends
The default implementation uses a PyTorch tensor to claim a GPU.
Additionally, a TensorFlow2 placeholder is provided as `safe_gpu.tensorflow_placeholder`.
If you don't want to / can't use that, provide your own GPU memory allocating function as `claim_gpus`'s parameter `placeholder_fn`.
It has to accept one parameter `device_no`, occupy a (preferably negligible) piece of memory on that device, and return a pointer to it.
Pull requests for other backends are welcome.
### Checking that it works
Together with this package, a small testing script is provided.
It exaggerates the time needed to acquire the GPU after polling nvidia-smi, making the race condition technically sure to happen.
To run the following example, get to a machine with 3 free GPUs and run two instances of the script in parallel as shown.
You should see in the output that one of them really waited for the faster one to fully acquire the GPU.
This script is not distributed along in the pip package, so please download it separately if needed.
```
$ python3 gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 2
GPUOwner1 2020-11-30 14:29:33,315 [INFO] acquiring lock
GPUOwner1 2020-11-30 14:29:33,315 [INFO] lock acquired
GPUOwner2 2020-11-30 14:29:33,361 [INFO] acquiring lock
GPUOwner1 2020-11-30 14:29:34,855 [INFO] Set CUDA_VISIBLE_DEVICES=2
GPUOwner2 2020-11-30 14:29:45,447 [INFO] lock acquired
GPUOwner1 2020-11-30 14:29:45,447 [INFO] lock released
GPUOwner2 2020-11-30 14:29:48,926 [INFO] Set CUDA_VISIBLE_DEVICES=4,5
GPUOwner1 2020-11-30 14:29:54,492 [INFO] Finished
GPUOwner2 2020-11-30 14:30:00,525 [INFO] lock released
GPUOwner2 2020-11-30 14:30:09,571 [INFO] Finished
```
Raw data
{
"_id": null,
"home_page": null,
"name": "safe-gpu",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "GPU, Resource allocation",
"author": null,
"author_email": "Karel Bene\u0161 <ibenes@fit.vutbr.cz>",
"download_url": "https://files.pythonhosted.org/packages/f5/a7/1c63efb2a5592e210df14cac2113a4a4bd0dfb4f4e266fc01def1d1732dd/safe_gpu-2.0.0.tar.gz",
"platform": null,
"description": "# safe-gpu\n\nA module for safe acquisition of GPUs in exclusive mode.\nRelevant mainly in clusters with a purely declarative gpu resource, such as many versions of SGE.\n\nFeatures:\n* toolkit independence (PyTorch/TensorFlow/pycuda/...), this just sets `CUDA_VISIBLE_DEVICES` properly\n* included support for PyTorch and TensorFlow2 backends, open to others\n* multiple GPUs acquisition\n* workaround for machines with a single GPU used for display and computation alike\n* open to implementation in different languages\n\nDownsides:\n* in order to really prevent the race condition, everyone on your cluster has to use this\n\n## Instalation\n\nIn addition to manual installation, `safe-gpu` is on PyPi, so you can simply:\n\n```\npip install safe-gpu\n```\n\nNote that `safe-gpu` does not formally depend on any backend, giving you, the user, the freedom to pick one of your liking.\n\n## Usage\nPrior to initializing CUDA (typically happens in lazy fashion when you place something on GPU), call `claim_gpus`.\n\n```\nfrom safe_gpu import safe_gpu\n\nsafe_gpu.claim_gpus()\n```\n\nIf you want multiple GPUs, pass the desired number to `claim_gpus`:\n\n```\nsafe_gpu.claim_gpus(nb_gpus)\n```\n\nInternally, `claim_gpus()` constructs a `GPUOwner` and stores it in `safe_gpu.gpu_owner`.\nIf preferred, user code can construct `GPUOwner` itself, but care should be taken to keep it alive until actual data is placed on said GPUs.\n\n### Usage with Horovod\nTypical Horovod usage includes starting your script in several processes, one per GPU.\nTherefore, only ask for one GPU in each process:\n\n```\nsafe_gpu.claim_gpus() # 1 GPU is the default, can be ommited\nhvd.init()\n```\n\n\n### Common errors\nIn order to properly setup GPUs for your process, `claim_gpus` really needs be called before CUDA is initialized.\nWhen CUDA does get initialized, it fixes your logical devices (e.g. PyTorch `cuda:1` etc.) to actual GPUs in your system.\nIf `CUDA_VISIBLE_DEVICES` are not set at that moment, CUDA will happily offer your process all of the visible GPUs, including those already occupied.\n\nMost commonly, this issue occurs for users who try to play it safe and check CUDA availability beforehand:\n```\nif torch.cuda.is_available(): # This already initializes CUDA\n safe_gpu.claim_gpus(nb_gpus) # So this can fail easily\n```\n\nIf your workflow mandates on-the-fly checking of GPU availability, instead use:\n```\ntry:\n safe_gpu.claim_gpus(nb_gpus)\nexcept safe_gpu.NvidiasmiError:\n ...\n```\n\nAlso, horovod users can be at risk:\n```\nhvd.init()\ntorch.cuda.set_device(hvd.local_rank()) # This initializes CUDA, too\nsafe_gpu.claim_gpus() # Thus this is likely to fail\n```\n\nSee above for proper solution.\n\n\n### Other backends\nThe default implementation uses a PyTorch tensor to claim a GPU.\nAdditionally, a TensorFlow2 placeholder is provided as `safe_gpu.tensorflow_placeholder`.\n\nIf you don't want to / can't use that, provide your own GPU memory allocating function as `claim_gpus`'s parameter `placeholder_fn`.\nIt has to accept one parameter `device_no`, occupy a (preferably negligible) piece of memory on that device, and return a pointer to it.\n\nPull requests for other backends are welcome.\n\n### Checking that it works\nTogether with this package, a small testing script is provided.\nIt exaggerates the time needed to acquire the GPU after polling nvidia-smi, making the race condition technically sure to happen.\n\nTo run the following example, get to a machine with 3 free GPUs and run two instances of the script in parallel as shown.\nYou should see in the output that one of them really waited for the faster one to fully acquire the GPU.\n\nThis script is not distributed along in the pip package, so please download it separately if needed.\n\n```\n$ python3 gpu-acquisitor.py --backend pytorch --id 1 --nb-gpus 1 & python3 gpu-acquisitor.py --backend pytorch --id 2 --nb-gpus 2\nGPUOwner1 2020-11-30 14:29:33,315 [INFO] acquiring lock\nGPUOwner1 2020-11-30 14:29:33,315 [INFO] lock acquired\nGPUOwner2 2020-11-30 14:29:33,361 [INFO] acquiring lock\nGPUOwner1 2020-11-30 14:29:34,855 [INFO] Set CUDA_VISIBLE_DEVICES=2\nGPUOwner2 2020-11-30 14:29:45,447 [INFO] lock acquired\nGPUOwner1 2020-11-30 14:29:45,447 [INFO] lock released\nGPUOwner2 2020-11-30 14:29:48,926 [INFO] Set CUDA_VISIBLE_DEVICES=4,5\nGPUOwner1 2020-11-30 14:29:54,492 [INFO] Finished\nGPUOwner2 2020-11-30 14:30:00,525 [INFO] lock released\nGPUOwner2 2020-11-30 14:30:09,571 [INFO] Finished\n\n```\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2020 BUT Speech@fit Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
"summary": "A process-safe acquisition of exclusive GPUs",
"version": "2.0.0",
"project_urls": {
"repository": "https://github.com/BUTSpeechFIT/safe_gpu"
},
"split_keywords": [
"gpu",
" resource allocation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7d3986fc27c3c6bf9a713c8e032c952633271f4e6a0b0b6e2a4d73a25b2a8df5",
"md5": "4e85f91b3a2fea81aa8e3aa672d22a56",
"sha256": "ac132e23e15a28419f25d127d93525fa8d3331a8838f2a6bfbbdc805807f5ddd"
},
"downloads": -1,
"filename": "safe_gpu-2.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4e85f91b3a2fea81aa8e3aa672d22a56",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 7589,
"upload_time": "2024-11-11T18:49:00",
"upload_time_iso_8601": "2024-11-11T18:49:00.084832Z",
"url": "https://files.pythonhosted.org/packages/7d/39/86fc27c3c6bf9a713c8e032c952633271f4e6a0b0b6e2a4d73a25b2a8df5/safe_gpu-2.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f5a71c63efb2a5592e210df14cac2113a4a4bd0dfb4f4e266fc01def1d1732dd",
"md5": "09274eda848dbc62a01803b8e8ad0dec",
"sha256": "8f9676697bcae9ff4491f16526ecc005eb1fb5c2a0eb0cd0416898c04c719942"
},
"downloads": -1,
"filename": "safe_gpu-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "09274eda848dbc62a01803b8e8ad0dec",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 6599,
"upload_time": "2024-11-11T18:49:01",
"upload_time_iso_8601": "2024-11-11T18:49:01.127084Z",
"url": "https://files.pythonhosted.org/packages/f5/a7/1c63efb2a5592e210df14cac2113a4a4bd0dfb4f4e266fc01def1d1732dd/safe_gpu-2.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-11 18:49:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "BUTSpeechFIT",
"github_project": "safe_gpu",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "safe-gpu"
}