# NVIDIA Resiliency Extension
The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads.
## Core Components and Capabilities
- **Fault Tolerance**
- Detection of hung ranks.
- Restarting training in-job, without the need to reallocate SLURM nodes.
- **In-Process Restarting**
- Detecting failures and enabling quick recovery.
- **Async Checkpointing**
- Providing an efficient framework for asynchronous checkpointing.
- **Local Checkpointing**
- Providing an efficient framework for local checkpointing.
- **Straggler Detection**
- Monitoring GPU and CPU performance of ranks.
- Identifying slower ranks that may impede overall training efficiency.
- **PyTorch Lightning Callbacks**
- Facilitating seamless NVRx integration with PyTorch Lightning.
## Installation
### From sources
- `git clone https://github.com/NVIDIA/nvidia-resiliency-ext`
- `cd nvidia-resiliency-ext`
- `pip install .`
### From PyPI wheel
- `pip install nvidia-resiliency-ext`
### Platform Support
| Category | Supported Versions / Requirements |
|---------------------|----------------------------------------------|
| Architecture | x86_64 |
| Operating System | Ubuntu 22.04 |
| Python Version | >= 3.10, < 3.13 |
| PyTorch Version | 2.3+ |
| CUDA & CUDA Toolkit | 12.5+ |
| NVML Driver | 550 or later |
| NCCL Version | 2.21.5+ |
**Note**: The package is designed to support Python >= 3.10, CUDA >= 11.8, PyTorch >= 2.0 and Ubuntu 20.04, but the recommended and tested environment for production is Python >= 3.10, < 3.13, CUDA 12.5+, and Ubuntu 22.04.
## Usage
For detailed documentation and usage information about each component, please refer to the https://nvidia.github.io/nvidia-resiliency-ext/.
Raw data
{
"_id": null,
"home_page": null,
"name": "nvidia-resiliency-ext",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": "NVIDIA Corporation",
"author_email": null,
"download_url": null,
"platform": null,
"description": "# NVIDIA Resiliency Extension\n\nThe NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads.\n\n## Core Components and Capabilities\n\n- **Fault Tolerance**\n - Detection of hung ranks. \n - Restarting training in-job, without the need to reallocate SLURM nodes.\n\n- **In-Process Restarting**\n - Detecting failures and enabling quick recovery.\n\n- **Async Checkpointing**\n - Providing an efficient framework for asynchronous checkpointing.\n\n- **Local Checkpointing**\n - Providing an efficient framework for local checkpointing.\n\n- **Straggler Detection**\n - Monitoring GPU and CPU performance of ranks. \n - Identifying slower ranks that may impede overall training efficiency.\n\n- **PyTorch Lightning Callbacks**\n - Facilitating seamless NVRx integration with PyTorch Lightning.\n\n## Installation\n\n### From sources\n- `git clone https://github.com/NVIDIA/nvidia-resiliency-ext`\n- `cd nvidia-resiliency-ext`\n- `pip install .`\n\n\n### From PyPI wheel\n- `pip install nvidia-resiliency-ext`\n\n### Platform Support\n\n| Category | Supported Versions / Requirements |\n|---------------------|----------------------------------------------|\n| Architecture | x86_64 |\n| Operating System | Ubuntu 22.04 |\n| Python Version | >= 3.10, < 3.13 |\n| PyTorch Version | 2.3+ |\n| CUDA & CUDA Toolkit | 12.5+ |\n| NVML Driver | 550 or later |\n| NCCL Version | 2.21.5+ |\n\n**Note**: The package is designed to support Python >= 3.10, CUDA >= 11.8, PyTorch >= 2.0 and Ubuntu 20.04, but the recommended and tested environment for production is Python >= 3.10, < 3.13, CUDA 12.5+, and Ubuntu 22.04.\n\n## Usage\n\nFor detailed documentation and usage information about each component, please refer to the https://nvidia.github.io/nvidia-resiliency-ext/.\n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "NVIDIA Resiliency Package",
"version": "0.2.1",
"project_urls": {
"Repository": "https://github.com/NVIDIA/nvidia-resiliency-ext"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b07f7031954bd994e27827b68dd93ecbd6fc548dea2560d7e28130d00ec1f712",
"md5": "b532e36ce4ab91db709331a5870ba947",
"sha256": "4abbfcb0bf37dd9ceb4cfd8d6ff806f1ace4c8f3a061c0e721e9b25cb7a6de57"
},
"downloads": -1,
"filename": "nvidia_resiliency_ext-0.2.1-cp310-cp310-manylinux_2_31_x86_64.whl",
"has_sig": false,
"md5_digest": "b532e36ce4ab91db709331a5870ba947",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 3431105,
"upload_time": "2025-02-22T00:08:59",
"upload_time_iso_8601": "2025-02-22T00:08:59.577455Z",
"url": "https://files.pythonhosted.org/packages/b0/7f/7031954bd994e27827b68dd93ecbd6fc548dea2560d7e28130d00ec1f712/nvidia_resiliency_ext-0.2.1-cp310-cp310-manylinux_2_31_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "172a53e45e8dd8cce626a1780b22a4dec03c74e5e6b3f7696171a69b96edc4e8",
"md5": "421943b5cb84043f1d43136c7ceea8d1",
"sha256": "62d787b384c983269caa610d8dfb8dcc72f9cbd2220b6345e6b640cd4eee5c1f"
},
"downloads": -1,
"filename": "nvidia_resiliency_ext-0.2.1-cp311-cp311-manylinux_2_31_x86_64.whl",
"has_sig": false,
"md5_digest": "421943b5cb84043f1d43136c7ceea8d1",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.10",
"size": 3431634,
"upload_time": "2025-02-22T00:08:21",
"upload_time_iso_8601": "2025-02-22T00:08:21.756069Z",
"url": "https://files.pythonhosted.org/packages/17/2a/53e45e8dd8cce626a1780b22a4dec03c74e5e6b3f7696171a69b96edc4e8/nvidia_resiliency_ext-0.2.1-cp311-cp311-manylinux_2_31_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a014c6d452278d0eb20a4d272182a053d33afcdf94aedf46baf732411aa36c09",
"md5": "b59738701eeba39fed41f3748660c836",
"sha256": "e03225270522911ab503350cf594117d942f1d048234e3ca9440682522b8dd99"
},
"downloads": -1,
"filename": "nvidia_resiliency_ext-0.2.1-cp312-cp312-manylinux_2_31_x86_64.whl",
"has_sig": false,
"md5_digest": "b59738701eeba39fed41f3748660c836",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.10",
"size": 3430726,
"upload_time": "2025-02-22T00:08:40",
"upload_time_iso_8601": "2025-02-22T00:08:40.493080Z",
"url": "https://files.pythonhosted.org/packages/a0/14/c6d452278d0eb20a4d272182a053d33afcdf94aedf46baf732411aa36c09/nvidia_resiliency_ext-0.2.1-cp312-cp312-manylinux_2_31_x86_64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-22 00:08:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "NVIDIA",
"github_project": "nvidia-resiliency-ext",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "nvidia-resiliency-ext"
}