nvidia-resiliency-ext


Namenvidia-resiliency-ext JSON
Version 0.4.1 PyPI version JSON
download
home_pageNone
SummaryNVIDIA Resiliency Package
upload_time2025-07-17 03:53:38
maintainerNone
docs_urlNone
authorNVIDIA Corporation
requires_python>=3.10
licenseApache 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NVIDIA Resiliency Extension

The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads. Users can modularly integrate NVRx capabilities into their own infrastructure to maximize AI training productivity at scale. NVRx maximizes goodput by enabling system-wide health checks, quickly detecting faults at runtime and resuming training automatically. NVRx minimizes loss of work by enabling fast and frequent checkpointing. 

For detailed documentation and usage information about each component, please refer to https://nvidia.github.io/nvidia-resiliency-ext/.

> ⚠️ NOTE: This project is still experimental and under active development. The code, features, and documentation are evolving rapidly. Please expect frequent updates and breaking changes. Contributions are welcome and we encourage you to watch for updates.

<img src="/docs/source/media/nvrx_core_features.png" alt="Figure highlighting core NVRx features including automatic restart, hierarchical checkpointing, fault detection and health checks" width="950" height="350">


## Core Components and Capabilities

- **[Fault Tolerance](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/fault_tolerance/index.rst)**
  - Detection of hung ranks.  
  - Restarting training in-job, without the need to reallocate SLURM nodes.

- **[In-Process Restarting](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/inprocess/index.rst)**
  - Detecting failures and enabling quick recovery.

- **[Async Checkpointing](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/checkpointing/async/index.rst)**
  - Providing an efficient framework for asynchronous checkpointing.

- **[Local Checkpointing](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/checkpointing/local/index.rst)**
  - Providing an efficient framework for local checkpointing.

- **[Straggler Detection](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/straggler_det/index.rst)**
  - Monitoring GPU and CPU performance of ranks.  
  - Identifying slower ranks that may impede overall training efficiency.

- **[PyTorch Lightning Callbacks](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/fault_tolerance/integration/ptl.rst)**
  - Facilitating seamless NVRx integration with PyTorch Lightning.

## Installation

### From sources
- `git clone https://github.com/NVIDIA/nvidia-resiliency-ext`
- `cd nvidia-resiliency-ext`
- `pip install .`


### From PyPI wheel
- `pip install nvidia-resiliency-ext`

### Platform Support

| Category             | Supported Versions / Requirements                                          |
|----------------------|----------------------------------------------------------------------------|
| Architecture         | x86_64, arm64                                                              |
| Operating System     | Ubuntu 22.04, 24.04                                                        |
| Python Version       | >= 3.10, < 3.13                                                            |
| PyTorch Version      | >= 2.3.1 (injob & chkpt), >= 2.5.1 (inprocess)                             |
| CUDA & CUDA Toolkit  | >= 12.5 (12.8 required for GPU health check)                               |
| NVML Driver          | >= 535 (570 required for GPU health check)                                 |
| NCCL Version         | >= 2.21.5 (injob & chkpt), >= 2.26.2 (inprocess)                           |



            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "nvidia-resiliency-ext",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "NVIDIA Corporation",
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# NVIDIA Resiliency Extension\n\nThe NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads. Users can modularly integrate NVRx capabilities into their own infrastructure to maximize AI training productivity at scale. NVRx maximizes goodput by enabling system-wide health checks, quickly detecting faults at runtime and resuming training automatically. NVRx minimizes loss of work by enabling fast and frequent checkpointing. \n\nFor detailed documentation and usage information about each component, please refer to https://nvidia.github.io/nvidia-resiliency-ext/.\n\n> \u26a0\ufe0f NOTE: This project is still experimental and under active development. The code, features, and documentation are evolving rapidly. Please expect frequent updates and breaking changes. Contributions are welcome and we encourage you to watch for updates.\n\n<img src=\"/docs/source/media/nvrx_core_features.png\" alt=\"Figure highlighting core NVRx features including automatic restart, hierarchical checkpointing, fault detection and health checks\" width=\"950\" height=\"350\">\n\n\n## Core Components and Capabilities\n\n- **[Fault Tolerance](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/fault_tolerance/index.rst)**\n  - Detection of hung ranks.  \n  - Restarting training in-job, without the need to reallocate SLURM nodes.\n\n- **[In-Process Restarting](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/inprocess/index.rst)**\n  - Detecting failures and enabling quick recovery.\n\n- **[Async Checkpointing](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/checkpointing/async/index.rst)**\n  - Providing an efficient framework for asynchronous checkpointing.\n\n- **[Local Checkpointing](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/checkpointing/local/index.rst)**\n  - Providing an efficient framework for local checkpointing.\n\n- **[Straggler Detection](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/straggler_det/index.rst)**\n  - Monitoring GPU and CPU performance of ranks.  \n  - Identifying slower ranks that may impede overall training efficiency.\n\n- **[PyTorch Lightning Callbacks](https://github.com/NVIDIA/nvidia-resiliency-ext/blob/main/docs/source/fault_tolerance/integration/ptl.rst)**\n  - Facilitating seamless NVRx integration with PyTorch Lightning.\n\n## Installation\n\n### From sources\n- `git clone https://github.com/NVIDIA/nvidia-resiliency-ext`\n- `cd nvidia-resiliency-ext`\n- `pip install .`\n\n\n### From PyPI wheel\n- `pip install nvidia-resiliency-ext`\n\n### Platform Support\n\n| Category             | Supported Versions / Requirements                                          |\n|----------------------|----------------------------------------------------------------------------|\n| Architecture         | x86_64, arm64                                                              |\n| Operating System     | Ubuntu 22.04, 24.04                                                        |\n| Python Version       | >= 3.10, < 3.13                                                            |\n| PyTorch Version      | >= 2.3.1 (injob & chkpt), >= 2.5.1 (inprocess)                             |\n| CUDA & CUDA Toolkit  | >= 12.5 (12.8 required for GPU health check)                               |\n| NVML Driver          | >= 535 (570 required for GPU health check)                                 |\n| NCCL Version         | >= 2.21.5 (injob & chkpt), >= 2.26.2 (inprocess)                           |\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "NVIDIA Resiliency Package",
    "version": "0.4.1",
    "project_urls": {
        "Repository": "https://github.com/NVIDIA/nvidia-resiliency-ext"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a78c6547d9fdea9730d4f69a19ca492ccbe221768f8473b82502a78a824acc3d",
                "md5": "5faf10b3e659ee386d58bd82a253d52f",
                "sha256": "cf80599411018ebbf03da64769527dee6b37746b72b8606f919b7999633770b8"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.4.1-cp310-cp310-manylinux_2_31_aarch64.whl",
            "has_sig": false,
            "md5_digest": "5faf10b3e659ee386d58bd82a253d52f",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 442891,
            "upload_time": "2025-07-17T03:53:38",
            "upload_time_iso_8601": "2025-07-17T03:53:38.878402Z",
            "url": "https://files.pythonhosted.org/packages/a7/8c/6547d9fdea9730d4f69a19ca492ccbe221768f8473b82502a78a824acc3d/nvidia_resiliency_ext-0.4.1-cp310-cp310-manylinux_2_31_aarch64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "340d520cab980949ad11bd5291784fea309bcd6654a9c97943a3a87644c1d111",
                "md5": "568badb77069b6ef846f9ccc59c8fcc8",
                "sha256": "0c23e621d598ba436549db83deeb3569c19df0194b89fe6169d62b6ead711be3"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.4.1-cp310-cp310-manylinux_2_31_x86_64.whl",
            "has_sig": false,
            "md5_digest": "568badb77069b6ef846f9ccc59c8fcc8",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 448044,
            "upload_time": "2025-07-17T03:48:30",
            "upload_time_iso_8601": "2025-07-17T03:48:30.851831Z",
            "url": "https://files.pythonhosted.org/packages/34/0d/520cab980949ad11bd5291784fea309bcd6654a9c97943a3a87644c1d111/nvidia_resiliency_ext-0.4.1-cp310-cp310-manylinux_2_31_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "46778cda264b262e2868a4e6ebcddaea112200b1e34b8d5a35a2fe3b4978d137",
                "md5": "cae91b776715d92c36e1c4f5a6f22bd8",
                "sha256": "d8ca454a8b8abef72e0ff0e33914686c263414e8891471c02a9f6af9d2d6b925"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.4.1-cp311-cp311-manylinux_2_31_aarch64.whl",
            "has_sig": false,
            "md5_digest": "cae91b776715d92c36e1c4f5a6f22bd8",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.10",
            "size": 443649,
            "upload_time": "2025-07-17T03:49:16",
            "upload_time_iso_8601": "2025-07-17T03:49:16.183567Z",
            "url": "https://files.pythonhosted.org/packages/46/77/8cda264b262e2868a4e6ebcddaea112200b1e34b8d5a35a2fe3b4978d137/nvidia_resiliency_ext-0.4.1-cp311-cp311-manylinux_2_31_aarch64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3a53029cc7493b5833cb8dfa201f15a1e422e2e1cc6308d34c5b0a90028a73fd",
                "md5": "ef22a5d4fbb5226b7e7ebb2e34efa627",
                "sha256": "dde6034f29350ac6326cdd861ceec641bdd93be0eddbf034739f4cd9452a4dd9"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.4.1-cp311-cp311-manylinux_2_31_x86_64.whl",
            "has_sig": false,
            "md5_digest": "ef22a5d4fbb5226b7e7ebb2e34efa627",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.10",
            "size": 449189,
            "upload_time": "2025-07-17T03:52:15",
            "upload_time_iso_8601": "2025-07-17T03:52:15.240847Z",
            "url": "https://files.pythonhosted.org/packages/3a/53/029cc7493b5833cb8dfa201f15a1e422e2e1cc6308d34c5b0a90028a73fd/nvidia_resiliency_ext-0.4.1-cp311-cp311-manylinux_2_31_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "700538d491962273c7905708762279f440520eb79f3c00b67a023497215ad023",
                "md5": "3e592861c9b7b249d4966a7b7394d743",
                "sha256": "b3bd5f01535574b16d0f38bca6e39afe3806c4a2896eee1b321cd944e00025a7"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.4.1-cp312-cp312-manylinux_2_31_aarch64.whl",
            "has_sig": false,
            "md5_digest": "3e592861c9b7b249d4966a7b7394d743",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.10",
            "size": 444570,
            "upload_time": "2025-07-17T03:50:58",
            "upload_time_iso_8601": "2025-07-17T03:50:58.877282Z",
            "url": "https://files.pythonhosted.org/packages/70/05/38d491962273c7905708762279f440520eb79f3c00b67a023497215ad023/nvidia_resiliency_ext-0.4.1-cp312-cp312-manylinux_2_31_aarch64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "188b4cb8aa2bbdf3705d3034c3f3dacdadb03b3b7dd3dc7f5200e64663fb477f",
                "md5": "e4f20c7b486c4f5286f62cc276f887b0",
                "sha256": "ca9f8de465af345952bedbea53c90c0e2323d88cfd830ded0e806fad91845c0e"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.4.1-cp312-cp312-manylinux_2_31_x86_64.whl",
            "has_sig": false,
            "md5_digest": "e4f20c7b486c4f5286f62cc276f887b0",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.10",
            "size": 450280,
            "upload_time": "2025-07-17T03:49:55",
            "upload_time_iso_8601": "2025-07-17T03:49:55.327454Z",
            "url": "https://files.pythonhosted.org/packages/18/8b/4cb8aa2bbdf3705d3034c3f3dacdadb03b3b7dd3dc7f5200e64663fb477f/nvidia_resiliency_ext-0.4.1-cp312-cp312-manylinux_2_31_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-17 03:53:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "NVIDIA",
    "github_project": "nvidia-resiliency-ext",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "nvidia-resiliency-ext"
}
        
Elapsed time: 1.48307s