nvidia-resiliency-ext

Name	nvidia-resiliency-ext JSON
Version	0.1.3 JSON
	download
home_page	https://github.com/NVIDIA/nvidia-resiliency-ext
Summary	NVIDIA Resiliency Package
upload_time	2024-10-15 16:48:30
maintainer	None
docs_url	None
author	NVIDIA Corporation
requires_python	>=3.10
license	Apache 2.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Nvidia Resiliency Extension

This project combines multiple resiliency-related solutions.
- Fault Tolerance package
- Straggler Detection package
- PyTorch Lightning callbacks


## Installation:

### From sources
- `git clone --recursive <this repo URL>`
- `cd <repo>`
- `pip install .`

Requirements:
- Python >= 3.10
- gcc >= 8.0
- CUDA >= 11.8

## Fault Tolerance integration guide

This section describes Fault Tolerance callback integration with a PTL-based workload (e.g. NeMo).

Let's define some terms used in this section:
- `PTL` is PyTorch Lightning
- `Fault Tolerance`, `FT` is the `fault_tolerance` package, included in `nvidia_resiliency_ext`. 
- `FT callback`, `FaultToleranceCallback` is a PTL callback defined in `ptl_resiliency` package, included in `nvidia_resiliency_ext`.
- `ft_launcher` is a launcher tool included in the FT, which is based on `torchrun`.  
- `heartbeat` is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.
- `rank monitor` is a special side-process started by `ft_launcher` that monitors heartbeats from its rank.
- `timeouts` are time intervals used by a rank monitor to detect that a rank is not alive. 
    There are 2 separate timeouts: for the initial heartbeat and the subsequent heartbeats.
- `launcher script` is a bash script that invokes `ft_launcher`.

### 0. Use `ft_launcher` to start the workload

`ft_launcher` is similar to `torchrun` but it starts a rank monitor for each started rank.  
`ft_launcher` takes the FT configuration in a YAML file (`--fault-tol-cfg-path`) or via CLI args (`--ft-param-...`).  
FT configuration items are described in `FaultToleranceConfig` docstring.

### 1. Add FT callback to the trainer

Add FT callback to PTL callbacks. 

```
fault_tol_cb = FaultToleranceCallback(
    autoresume=True,
    calculate_timeouts=True,
    logger_name="test_logger",
    exp_dir=tmp_path,
)

trainer = pl.Trainer(
    ...
    callbacks=[..., fault_tol_cb],
)
```


Core FT callback functionality is:
- Establishing a connection with a rank monitor
- Sending heartbeats during training and evaluation steps
- Disconnecting from a rank monitor

Optionally, it can also:
- Compute timeouts that will be used instead of timeouts defined in the FT config
- Create a flag file when the training is completed

FT callback initialization params:
```
def __init__(
    self,
    autoresume: bool,
    calculate_timeouts: bool,
    simulated_fault_params: Optional[Any] = None,
    exp_dir: Union[str, pathlib.Path, None] = None,
    logger_name: Optional[str] = "nemo_logger.FaultToleranceCallback",
):
    """
    Initialize callback instance.

    This is a lightweight initialization. Most of the initialization is conducted in the 'setup' hook.

    Args:
        autoresume (bool): Set to `True` if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run).
        calculate_timeouts (bool): Set to `True` if FT timeouts should be calculated based on observed heartbeat intervals.
            Calculated timeouts overwrite the timeouts from the FT config.
            Timeouts are computed at the end of a training job, if there was checkpoint loading and saving.
            For example, for training started from scratch, the timeouts are computed at the end of the second job.
        simulated_fault_params (Optional[Any], optional): Simulated fault spec. It's for debugging only. Defaults to None.
        exp_dir (Union[str, pathlib.Path, None], optional): Directory where the FT state should be saved.
            Must be available for all training jobs. NOTE: Beware that PTL/NeMo can move files written directly to `trainer.log_dir`.
            Defaults to None, in which case it defaults to `trainer.log_dir/ft_state/`.
        logger_name (Optional[str], optional): Logger name to be used.
            Defaults to "nemo_logger.FaultToleranceCallback".
    """
```             

### 2. Implementing auto-resume

Auto-resume is a feature that simplifies running a training consisting of multiple subsequent training jobs. 

NOTE: Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the `FaultToleranceCallback`. 

`FaultToleranceCallback` exposes an "interface" that allows implementing an auto-resume launcher script.  
Specifically, if `autoresume=True` the FT callback creates a special marker file when a training is completed.  
The marker file location is expected to be set in the `FAULT_TOL_FINISHED_FLAG_FILE` environment variable.

The following mechanism can be used to implement an auto-resuming launcher script:
- Launcher script starts ranks with `ft_launcher`
- `FAULT_TOL_FINISHED_FLAG_FILE` should be passed to rank processes
- When a `ft_launcher` exits, a launcher script checks if the `FAULT_TOL_FINISHED_FLAG_FILE` file was created.
    - If `FAULT_TOL_FINISHED_FLAG_FILE` exists, the auto-resume loop can be broken, as the training is completed.
    - If `FAULT_TOL_FINISHED_FLAG_FILE` does not exist, the continuation job can be issued
        (other conditions can be checked e.g. if the maximum number of failures is not reached).

## Straggler Detection integration guide

### Include `plt_resiliency.StragglerDetectionCallback` in a PTL trainer callbacks. 

```
straggler_cb_args = dict(
    report_time_interval=300.0,
    calc_relative_gpu_perf=True,
    calc_individual_gpu_perf=True,
    num_gpu_perf_scores_to_log=3,
    gpu_relative_perf_threshold=0.7,
    gpu_individual_perf_threshold=0.7,
    stop_if_detected=False,
    logger_name="test_logger",
)

straggler_det_cb = StragglerDetectionCallback(**cb_args)

trainer = pl.Trainer(
    ...
    callbacks=[..., straggler_det_cb],
)
```

`StragglerDetectionCallback` initialization params:

```
def __init__(
    self,
    report_time_interval: float,
    calc_relative_gpu_perf: bool,
    calc_individual_gpu_perf: bool,
    num_gpu_perf_scores_to_log: int,
    gpu_relative_perf_threshold: float,
    gpu_individual_perf_threshold: float,
    stop_if_detected: bool,
    logger_name: Optional[str] = "nemo_logger.StragglerDetectionCallback",
):
    """
    Initialize straggler detection callback instance.

    Args:
        report_time_interval (float): Interval [seconds] of the straggler check
        calc_relative_gpu_perf (bool): Calculate relative GPU performance
        calc_individual_gpu_perf (bool): Calculate individual GPU performance
        num_gpu_perf_scores_to_log (int): How many best and worst scores to log (0 - does not log periodically, but only if stragglers are detected)
        gpu_relative_perf_threshold (float): Threshold for relative GPU performance scores
        gpu_individual_perf_threshold (float): Threshold for individual GPU performance scores
        stop_if_detected (bool): Set to True, to terminate the workload if stragglers are detected
        logger_name (Optional[str], optional): Defaults to "nemo_logger.StragglerDetectionCallback".

    Raises:
        ValueError: If invalid config was provided.
    """
```

More info on straggler detection can be found in the straggler package's README.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/NVIDIA/nvidia-resiliency-ext",
    "name": "nvidia-resiliency-ext",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "NVIDIA Corporation",
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# Nvidia Resiliency Extension\n\nThis project combines multiple resiliency-related solutions.\n- Fault Tolerance package\n- Straggler Detection package\n- PyTorch Lightning callbacks\n\n\n## Installation:\n\n### From sources\n- `git clone --recursive <this repo URL>`\n- `cd <repo>`\n- `pip install .`\n\nRequirements:\n- Python >= 3.10\n- gcc >= 8.0\n- CUDA >= 11.8\n\n## Fault Tolerance integration guide\n\nThis section describes Fault Tolerance callback integration with a PTL-based workload (e.g. NeMo).\n\nLet's define some terms used in this section:\n- `PTL` is PyTorch Lightning\n- `Fault Tolerance`, `FT` is the `fault_tolerance` package, included in `nvidia_resiliency_ext`. \n- `FT callback`, `FaultToleranceCallback` is a PTL callback defined in `ptl_resiliency` package, included in `nvidia_resiliency_ext`.\n- `ft_launcher` is a launcher tool included in the FT, which is based on `torchrun`.  \n- `heartbeat` is a lightweight message sent from a rank to its rank monitor that indicates that a rank is alive.\n- `rank monitor` is a special side-process started by `ft_launcher` that monitors heartbeats from its rank.\n- `timeouts` are time intervals used by a rank monitor to detect that a rank is not alive. \n    There are 2 separate timeouts: for the initial heartbeat and the subsequent heartbeats.\n- `launcher script` is a bash script that invokes `ft_launcher`.\n\n### 0. Use `ft_launcher` to start the workload\n\n`ft_launcher` is similar to `torchrun` but it starts a rank monitor for each started rank.  \n`ft_launcher` takes the FT configuration in a YAML file (`--fault-tol-cfg-path`) or via CLI args (`--ft-param-...`).  \nFT configuration items are described in `FaultToleranceConfig` docstring.\n\n### 1. Add FT callback to the trainer\n\nAdd FT callback to PTL callbacks. \n\n```\nfault_tol_cb = FaultToleranceCallback(\n    autoresume=True,\n    calculate_timeouts=True,\n    logger_name=\"test_logger\",\n    exp_dir=tmp_path,\n)\n\ntrainer = pl.Trainer(\n    ...\n    callbacks=[..., fault_tol_cb],\n)\n```\n\n\nCore FT callback functionality is:\n- Establishing a connection with a rank monitor\n- Sending heartbeats during training and evaluation steps\n- Disconnecting from a rank monitor\n\nOptionally, it can also:\n- Compute timeouts that will be used instead of timeouts defined in the FT config\n- Create a flag file when the training is completed\n\nFT callback initialization params:\n```\ndef __init__(\n    self,\n    autoresume: bool,\n    calculate_timeouts: bool,\n    simulated_fault_params: Optional[Any] = None,\n    exp_dir: Union[str, pathlib.Path, None] = None,\n    logger_name: Optional[str] = \"nemo_logger.FaultToleranceCallback\",\n):\n    \"\"\"\n    Initialize callback instance.\n\n    This is a lightweight initialization. Most of the initialization is conducted in the 'setup' hook.\n\n    Args:\n        autoresume (bool): Set to `True` if the FT auto-resume feature is used (e.g., there are multiple training jobs to be run).\n        calculate_timeouts (bool): Set to `True` if FT timeouts should be calculated based on observed heartbeat intervals.\n            Calculated timeouts overwrite the timeouts from the FT config.\n            Timeouts are computed at the end of a training job, if there was checkpoint loading and saving.\n            For example, for training started from scratch, the timeouts are computed at the end of the second job.\n        simulated_fault_params (Optional[Any], optional): Simulated fault spec. It's for debugging only. Defaults to None.\n        exp_dir (Union[str, pathlib.Path, None], optional): Directory where the FT state should be saved.\n            Must be available for all training jobs. NOTE: Beware that PTL/NeMo can move files written directly to `trainer.log_dir`.\n            Defaults to None, in which case it defaults to `trainer.log_dir/ft_state/`.\n        logger_name (Optional[str], optional): Logger name to be used.\n            Defaults to \"nemo_logger.FaultToleranceCallback\".\n    \"\"\"\n```             \n\n### 2. Implementing auto-resume\n\nAuto-resume is a feature that simplifies running a training consisting of multiple subsequent training jobs. \n\nNOTE: Auto-resume is not a part of the FT package. It is entirely implemented in a launcher script and the `FaultToleranceCallback`. \n\n`FaultToleranceCallback` exposes an \"interface\" that allows implementing an auto-resume launcher script.  \nSpecifically, if `autoresume=True` the FT callback creates a special marker file when a training is completed.  \nThe marker file location is expected to be set in the `FAULT_TOL_FINISHED_FLAG_FILE` environment variable.\n\nThe following mechanism can be used to implement an auto-resuming launcher script:\n- Launcher script starts ranks with `ft_launcher`\n- `FAULT_TOL_FINISHED_FLAG_FILE` should be passed to rank processes\n- When a `ft_launcher` exits, a launcher script checks if the `FAULT_TOL_FINISHED_FLAG_FILE` file was created.\n    - If `FAULT_TOL_FINISHED_FLAG_FILE` exists, the auto-resume loop can be broken, as the training is completed.\n    - If `FAULT_TOL_FINISHED_FLAG_FILE` does not exist, the continuation job can be issued\n        (other conditions can be checked e.g. if the maximum number of failures is not reached).\n\n## Straggler Detection integration guide\n\n### Include `plt_resiliency.StragglerDetectionCallback` in a PTL trainer callbacks. \n\n```\nstraggler_cb_args = dict(\n    report_time_interval=300.0,\n    calc_relative_gpu_perf=True,\n    calc_individual_gpu_perf=True,\n    num_gpu_perf_scores_to_log=3,\n    gpu_relative_perf_threshold=0.7,\n    gpu_individual_perf_threshold=0.7,\n    stop_if_detected=False,\n    logger_name=\"test_logger\",\n)\n\nstraggler_det_cb = StragglerDetectionCallback(**cb_args)\n\ntrainer = pl.Trainer(\n    ...\n    callbacks=[..., straggler_det_cb],\n)\n```\n\n`StragglerDetectionCallback` initialization params:\n\n```\ndef __init__(\n    self,\n    report_time_interval: float,\n    calc_relative_gpu_perf: bool,\n    calc_individual_gpu_perf: bool,\n    num_gpu_perf_scores_to_log: int,\n    gpu_relative_perf_threshold: float,\n    gpu_individual_perf_threshold: float,\n    stop_if_detected: bool,\n    logger_name: Optional[str] = \"nemo_logger.StragglerDetectionCallback\",\n):\n    \"\"\"\n    Initialize straggler detection callback instance.\n\n    Args:\n        report_time_interval (float): Interval [seconds] of the straggler check\n        calc_relative_gpu_perf (bool): Calculate relative GPU performance\n        calc_individual_gpu_perf (bool): Calculate individual GPU performance\n        num_gpu_perf_scores_to_log (int): How many best and worst scores to log (0 - does not log periodically, but only if stragglers are detected)\n        gpu_relative_perf_threshold (float): Threshold for relative GPU performance scores\n        gpu_individual_perf_threshold (float): Threshold for individual GPU performance scores\n        stop_if_detected (bool): Set to True, to terminate the workload if stragglers are detected\n        logger_name (Optional[str], optional): Defaults to \"nemo_logger.StragglerDetectionCallback\".\n\n    Raises:\n        ValueError: If invalid config was provided.\n    \"\"\"\n```\n\nMore info on straggler detection can be found in the straggler package's README.\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "NVIDIA Resiliency Package",
    "version": "0.1.3",
    "project_urls": {
        "Homepage": "https://github.com/NVIDIA/nvidia-resiliency-ext",
        "Repository": "https://github.com/NVIDIA/nvidia-resiliency-ext"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ed1125854e1c68940b281532f1016eb737102e1d540f8a1d84307bca00baa497",
                "md5": "d21dd5789c80de0d44764347e5f355e5",
                "sha256": "c4c3d963f66f3ae20de5860e16204439e69b096466fb9f36d75a9bc61fd7c328"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.1.3-cp310-cp310-manylinux_2_31_x86_64.whl",
            "has_sig": false,
            "md5_digest": "d21dd5789c80de0d44764347e5f355e5",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 3354186,
            "upload_time": "2024-10-15T16:48:30",
            "upload_time_iso_8601": "2024-10-15T16:48:30.939080Z",
            "url": "https://files.pythonhosted.org/packages/ed/11/25854e1c68940b281532f1016eb737102e1d540f8a1d84307bca00baa497/nvidia_resiliency_ext-0.1.3-cp310-cp310-manylinux_2_31_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9270c3d7f91929ff76e9a95809fc28be81fbe62ff049d1277d1d9e671948a66a",
                "md5": "246e210b75c87c9984473b528cf92948",
                "sha256": "5b07eb0a65096677bfe9a71162808a5f4a41a7145e9fa57fb93955ed22f24218"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.1.3-cp311-cp311-manylinux_2_31_x86_64.whl",
            "has_sig": false,
            "md5_digest": "246e210b75c87c9984473b528cf92948",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.10",
            "size": 3354715,
            "upload_time": "2024-10-15T16:49:09",
            "upload_time_iso_8601": "2024-10-15T16:49:09.175651Z",
            "url": "https://files.pythonhosted.org/packages/92/70/c3d7f91929ff76e9a95809fc28be81fbe62ff049d1277d1d9e671948a66a/nvidia_resiliency_ext-0.1.3-cp311-cp311-manylinux_2_31_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "60064517300290520936391abd7ebbb59a7e65d047d5c8cfd2db14adf95aeff3",
                "md5": "b33ed4b9478acc07f8ec62456305516b",
                "sha256": "332997f4a9237d137a0b74be5b18e87923d2afbed86b0382334c9dba36db2652"
            },
            "downloads": -1,
            "filename": "nvidia_resiliency_ext-0.1.3-cp312-cp312-manylinux_2_31_x86_64.whl",
            "has_sig": false,
            "md5_digest": "b33ed4b9478acc07f8ec62456305516b",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.10",
            "size": 3353632,
            "upload_time": "2024-10-15T16:48:49",
            "upload_time_iso_8601": "2024-10-15T16:48:49.184287Z",
            "url": "https://files.pythonhosted.org/packages/60/06/4517300290520936391abd7ebbb59a7e65d047d5c8cfd2db14adf95aeff3/nvidia_resiliency_ext-0.1.3-cp312-cp312-manylinux_2_31_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-15 16:48:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "NVIDIA",
    "github_project": "nvidia-resiliency-ext",
    "github_not_found": true,
    "lcname": "nvidia-resiliency-ext"
}

NVIDIA Corporation