nvidia-resiliency-ext


Namenvidia-resiliency-ext JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/NVIDIA/nvidia-resiliency-ext
SummaryNVIDIA Resiliency Package
upload_time2024-12-17 01:08:23
maintainerNone
docs_urlNone
authorNVIDIA Corporation
requires_python>=3.10
licenseApache 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NVIDIA Resiliency Extension

The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads.

## Core Components and Capabilities

- **Fault Tolerance**
  - Detection of hung ranks.  
  - Restarting training in-job, without the need to reallocate SLURM nodes.

- **In-Process Restarting**
  - Detecting failures and enabling quick recovery.

- **Async Checkpointing**
  - Providing an efficient framework for asynchronous checkpointing.

- **Local Checkpointing**
  - Providing an efficient framework for local checkpointing.

- **Straggler Detection**
  - Monitoring GPU and CPU performance of ranks.  
  - Identifying slower ranks that may impede overall training efficiency.

- **PyTorch Lightning Callbacks**
  - Facilitating seamless NVRx integration with PyTorch Lightning.

## Installation

### From sources
- `git clone https://github.com/NVIDIA/nvidia-resiliency-ext`
- `cd nvidia-resiliency-ext`
- `pip install .`


### From PyPI wheel
- `pip install nvidia-resiliency-ext`

### Platform Support

| Category            | Supported Versions / Requirements            |
|---------------------|----------------------------------------------|
| Architecture         | x86_64                                      |
| Operating System     | Ubuntu 22.04                                |
| Python Version       | >= 3.10, < 3.13                             |
| PyTorch Version      | 2.3+                                        |
| CUDA & CUDA Toolkit  | 12.5+                                       |
| NVML Driver          | 550 or later                                |
| NCCL Version         | 2.21.5+                                     |

**Note**: The package is designed to support Python >= 3.10, CUDA >= 11.8, PyTorch >= 2.0 and Ubuntu 20.04, but the recommended and tested environment for production is Python >= 3.10, < 3.13, CUDA 12.5+, and Ubuntu 22.04.

## Usage

For detailed documentation and usage information about each component, please refer to the ./docs.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/NVIDIA/nvidia-resiliency-ext",
    "name": "nvidia-resiliency-ext",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "NVIDIA Corporation",
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# NVIDIA Resiliency Extension\n\nThe NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads.\n\n## Core Components and Capabilities\n\n- **Fault Tolerance**\n  - Detection of hung ranks.  \n  - Restarting training in-job, without the need to reallocate SLURM nodes.\n\n- **In-Process Restarting**\n  - Detecting failures and enabling quick recovery.\n\n- **Async Checkpointing**\n  - Providing an efficient framework for asynchronous checkpointing.\n\n- **Local Checkpointing**\n  - Providing an efficient framework for local checkpointing.\n\n- **Straggler Detection**\n  - Monitoring GPU and CPU performance of ranks.  \n  - Identifying slower ranks that may impede overall training efficiency.\n\n- **PyTorch Lightning Callbacks**\n  - Facilitating seamless NVRx integration with PyTorch Lightning.\n\n## Installation\n\n### From sources\n- `git clone https://github.com/NVIDIA/nvidia-resiliency-ext`\n- `cd nvidia-resiliency-ext`\n- `pip install .`\n\n\n### From PyPI wheel\n- `pip install nvidia-resiliency-ext`\n\n### Platform Support\n\n| Category            | Supported Versions / Requirements            |\n|---------------------|----------------------------------------------|\n| Architecture         | x86_64                                      |\n| Operating System     | Ubuntu 22.04                                |\n| Python Version       | >= 3.10, < 3.13                             |\n| PyTorch Version      | 2.3+                                        |\n| CUDA & CUDA Toolkit  | 12.5+                                       |\n| NVML Driver          | 550 or later                                |\n| NCCL Version         | 2.21.5+                                     |\n\n**Note**: The package is designed to support Python >= 3.10, CUDA >= 11.8, PyTorch >= 2.0 and Ubuntu 20.04, but the recommended and tested environment for production is Python >= 3.10, < 3.13, CUDA 12.5+, and Ubuntu 22.04.\n\n## Usage\n\nFor detailed documentation and usage information about each component, please refer to the ./docs.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "NVIDIA Resiliency Package",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/NVIDIA/nvidia-resiliency-ext",
        "Repository": "https://github.com/NVIDIA/nvidia-resiliency-ext"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "032a2052295c3689bfcde90a41a12da8524b19d579001e367049da5ed7fe32c5",
                "md5": "ac2b6fd436075b3f316c6d22fc33604c",
                "sha256": "de24af492451a7ef48aa97ba2275434cbc2449dd6278b6d475e32906190eb697"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.2.0-cp310-cp310-manylinux_2_31_x86_64.whl",
            "has_sig": false,
            "md5_digest": "ac2b6fd436075b3f316c6d22fc33604c",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 6591221,
            "upload_time": "2024-12-17T01:08:23",
            "upload_time_iso_8601": "2024-12-17T01:08:23.893986Z",
            "url": "https://files.pythonhosted.org/packages/03/2a/2052295c3689bfcde90a41a12da8524b19d579001e367049da5ed7fe32c5/nvidia_resiliency_ext-0.2.0-cp310-cp310-manylinux_2_31_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b6f8393503d674c3fbdb17027a332d8586c385850b5ba860563794928bcd6c19",
                "md5": "fb5bcdbd7b5e15abf36dfb673d95d84a",
                "sha256": "63084c18e0e17ac4a4a47caa99f79d48e1bedbdf215328075a5766d2bfaa0cea"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.2.0-cp311-cp311-manylinux_2_31_x86_64.whl",
            "has_sig": false,
            "md5_digest": "fb5bcdbd7b5e15abf36dfb673d95d84a",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.10",
            "size": 9753649,
            "upload_time": "2024-12-17T01:08:47",
            "upload_time_iso_8601": "2024-12-17T01:08:47.387427Z",
            "url": "https://files.pythonhosted.org/packages/b6/f8/393503d674c3fbdb17027a332d8586c385850b5ba860563794928bcd6c19/nvidia_resiliency_ext-0.2.0-cp311-cp311-manylinux_2_31_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c113e6fe2f4e20106eec36d1d39e7032122815a4e5c4ed7e1d6d367e60d590f5",
                "md5": "0292f2d47c4eabbace751c8ff88768ca",
                "sha256": "c2ee0e5bc8f3dc5c5ce96888db23c9036eece416b9fb21f00f669eed77324f66"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.2.0-cp312-cp312-manylinux_2_31_x86_64.whl",
            "has_sig": false,
            "md5_digest": "0292f2d47c4eabbace751c8ff88768ca",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.10",
            "size": 6591222,
            "upload_time": "2024-12-17T01:09:10",
            "upload_time_iso_8601": "2024-12-17T01:09:10.204890Z",
            "url": "https://files.pythonhosted.org/packages/c1/13/e6fe2f4e20106eec36d1d39e7032122815a4e5c4ed7e1d6d367e60d590f5/nvidia_resiliency_ext-0.2.0-cp312-cp312-manylinux_2_31_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-17 01:08:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "NVIDIA",
    "github_project": "nvidia-resiliency-ext",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "nvidia-resiliency-ext"
}
        
Elapsed time: 0.41040s