torchft-nightly


Nametorchft-nightly JSON
Version 2025.2.4 PyPI version JSON
download
home_pageNone
SummaryNone
upload_time2025-02-04 11:24:32
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://github.com/user-attachments/assets/ab57f551-7a66-4e5f-a3e6-c4d033a5863d">
    <img width="55%" src="https://github.com/user-attachments/assets/9cd7fef9-cfff-409f-a033-d53811f3a99c" alt="torchft">
  </picture>
</p>

<h3 align="center">
Easy Per Step Fault Tolerance for PyTorch
</h3>

<p align="center">
  | <a href="https://pytorch.org/torchft/"><b>Documentation</b></a>
  | <a href="https://github.com/pytorch-labs/torchft/blob/main/media/fault_tolerance_poster.pdf"><b>Poster</b></a>
  | <a href="https://docs.google.com/document/d/1OZsOsz34gRDSxYXiKkj4WqcD9x0lP9TcsfBeu_SsOY4/edit"><b>Design Doc</b></a>
  |
</p>
<p align="center">
  <a href="https://pypi.org/project/torchft-nightly/"><img alt="PyPI - Version" src="https://img.shields.io/pypi/v/torchft-nightly"></a>
</p>

---

> ⚠️ WARNING: This is an alpha prototype for PyTorch fault tolerance and may have bugs
> or breaking changes as this is actively under development. We'd love to collaborate
> and contributions are welcome. Please reach out if you're interested in torchft
> or want to discuss fault tolerance in PyTorch

This repository implements techniques for doing a per-step fault tolerance so
you can keep training if errors occur without interrupting the entire training
job.

This is based off of the large scale training techniques presented at PyTorch
Conference 2024.

[![](./media/fault_tolerance_poster.png)](./media/fault_tolerance_poster.pdf)

## Design

torchft is designed to allow for fault tolerance when using training with replicated weights such as in DDP or HSDP (FSDP with DDP).

torchft implements a lighthouse server that coordinates across the different
replica groups and then a per replica group manager and fault tolerance library
that can be used in a standard PyTorch training loop.

This allows for membership changes at the training step granularity which can
greatly improve efficiency by avoiding stop the world training on errors.

![](./media/torchft-overview.png)

## Prerequisites

Before proceeding, ensure you have the following installed:

- Rust (with necessary dependencies)
- `protobuf-compiler` and the corresponding development package for Protobuf.

Note that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command:
```sh
$ curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh
```

To install the required packages on a Debian-based system (such as Ubuntu) using apt, run:

```sh
sudo apt install protobuf-compiler libprotobuf-dev
```

or for a Red Hat-based system, run:

```sh
sudo dnf install protobuf-compiler protobuf-devel
```

## Installation

```sh
$ pip install .
```

This uses pyo3+maturin to build the package, you'll need maturin installed.

If the installation command fails to invoke `cargo update` due to an inability to fetch the manifest, it may be caused by the `proxy`, `proxySSLCert`, and `proxySSLKey` settings in your .`gitconfig` file affecting the `cargo` command. To resolve this issue, try temporarily removing these fields from your `.gitconfig` before running the installation command.

To install in editable mode w/ the Rust extensions you can use the normal pip install command:

```sh
$ pip install -e .
```

## Usage

### Lighthouse

The lighthouse is used for fault tolerance across replicated workers (DDP/FSDP)
when using synchronous training.

You can start a lighthouse server by running:

```sh
$ RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
```

### Example Training Loop (DDP)

See [train_ddp.py](./train_ddp.py) for the full example.

Invoke with:

```sh
$ TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py
```

train.py:

```py
from torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo

manager = Manager(
    pg=ProcessGroupGloo(),
    load_state_dict=...,
    state_dict=...,
)

m = nn.Linear(2, 3)
m = DistributedDataParallel(manager, m)
optimizer = Optimizer(manager, optim.AdamW(m.parameters()))

for i in range(1000):
    batch = torch.rand(2, 2, device=device)

    optimizer.zero_grad()

    out = m(batch)
    loss = out.sum()

    loss.backward()

    optimizer.step()
```

### Example Parameter Server

torchft has a fault tolerant parameter server implementation built on it's
reconfigurable ProcessGroups. This does not require/use a Lighthouse server.

See [parameter_server_test.py](./torchft/parameter_server_test.py) for an example.

## Contributing

We welcome PRs! See the [CONTRIBUTING](./CONTRIBUTING.md) file.

## License

torchft is BSD 3-Clause licensed. See [LICENSE](./LICENSE) for more details.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "torchft-nightly",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "<p align=\"center\">\n  <picture>\n    <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://github.com/user-attachments/assets/ab57f551-7a66-4e5f-a3e6-c4d033a5863d\">\n    <img width=\"55%\" src=\"https://github.com/user-attachments/assets/9cd7fef9-cfff-409f-a033-d53811f3a99c\" alt=\"torchft\">\n  </picture>\n</p>\n\n<h3 align=\"center\">\nEasy Per Step Fault Tolerance for PyTorch\n</h3>\n\n<p align=\"center\">\n  | <a href=\"https://pytorch.org/torchft/\"><b>Documentation</b></a>\n  | <a href=\"https://github.com/pytorch-labs/torchft/blob/main/media/fault_tolerance_poster.pdf\"><b>Poster</b></a>\n  | <a href=\"https://docs.google.com/document/d/1OZsOsz34gRDSxYXiKkj4WqcD9x0lP9TcsfBeu_SsOY4/edit\"><b>Design Doc</b></a>\n  |\n</p>\n<p align=\"center\">\n  <a href=\"https://pypi.org/project/torchft-nightly/\"><img alt=\"PyPI - Version\" src=\"https://img.shields.io/pypi/v/torchft-nightly\"></a>\n</p>\n\n---\n\n> \u26a0\ufe0f WARNING: This is an alpha prototype for PyTorch fault tolerance and may have bugs\n> or breaking changes as this is actively under development. We'd love to collaborate\n> and contributions are welcome. Please reach out if you're interested in torchft\n> or want to discuss fault tolerance in PyTorch\n\nThis repository implements techniques for doing a per-step fault tolerance so\nyou can keep training if errors occur without interrupting the entire training\njob.\n\nThis is based off of the large scale training techniques presented at PyTorch\nConference 2024.\n\n[![](./media/fault_tolerance_poster.png)](./media/fault_tolerance_poster.pdf)\n\n## Design\n\ntorchft is designed to allow for fault tolerance when using training with replicated weights such as in DDP or HSDP (FSDP with DDP).\n\ntorchft implements a lighthouse server that coordinates across the different\nreplica groups and then a per replica group manager and fault tolerance library\nthat can be used in a standard PyTorch training loop.\n\nThis allows for membership changes at the training step granularity which can\ngreatly improve efficiency by avoiding stop the world training on errors.\n\n![](./media/torchft-overview.png)\n\n## Prerequisites\n\nBefore proceeding, ensure you have the following installed:\n\n- Rust (with necessary dependencies)\n- `protobuf-compiler` and the corresponding development package for Protobuf.\n\nNote that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command:\n```sh\n$ curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh\n```\n\nTo install the required packages on a Debian-based system (such as Ubuntu) using apt, run:\n\n```sh\nsudo apt install protobuf-compiler libprotobuf-dev\n```\n\nor for a Red Hat-based system, run:\n\n```sh\nsudo dnf install protobuf-compiler protobuf-devel\n```\n\n## Installation\n\n```sh\n$ pip install .\n```\n\nThis uses pyo3+maturin to build the package, you'll need maturin installed.\n\nIf the installation command fails to invoke `cargo update` due to an inability to fetch the manifest, it may be caused by the `proxy`, `proxySSLCert`, and `proxySSLKey` settings in your .`gitconfig` file affecting the `cargo` command. To resolve this issue, try temporarily removing these fields from your `.gitconfig` before running the installation command.\n\nTo install in editable mode w/ the Rust extensions you can use the normal pip install command:\n\n```sh\n$ pip install -e .\n```\n\n## Usage\n\n### Lighthouse\n\nThe lighthouse is used for fault tolerance across replicated workers (DDP/FSDP)\nwhen using synchronous training.\n\nYou can start a lighthouse server by running:\n\n```sh\n$ RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000\n```\n\n### Example Training Loop (DDP)\n\nSee [train_ddp.py](./train_ddp.py) for the full example.\n\nInvoke with:\n\n```sh\n$ TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py\n```\n\ntrain.py:\n\n```py\nfrom torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo\n\nmanager = Manager(\n    pg=ProcessGroupGloo(),\n    load_state_dict=...,\n    state_dict=...,\n)\n\nm = nn.Linear(2, 3)\nm = DistributedDataParallel(manager, m)\noptimizer = Optimizer(manager, optim.AdamW(m.parameters()))\n\nfor i in range(1000):\n    batch = torch.rand(2, 2, device=device)\n\n    optimizer.zero_grad()\n\n    out = m(batch)\n    loss = out.sum()\n\n    loss.backward()\n\n    optimizer.step()\n```\n\n### Example Parameter Server\n\ntorchft has a fault tolerant parameter server implementation built on it's\nreconfigurable ProcessGroups. This does not require/use a Lighthouse server.\n\nSee [parameter_server_test.py](./torchft/parameter_server_test.py) for an example.\n\n## Contributing\n\nWe welcome PRs! See the [CONTRIBUTING](./CONTRIBUTING.md) file.\n\n## License\n\ntorchft is BSD 3-Clause licensed. See [LICENSE](./LICENSE) for more details.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "2025.2.4",
    "project_urls": {
        "Documentation": "https://pytorch-labs.github.io/torchft",
        "Issues": "https://github.com/pytorch-labs/torchft/issues",
        "Repository": "https://github.com/pytorch-labs/torchft"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "60b295b4840264ca12a014a33fa5d5453a1ea77cb6027aead0afd41e12f11033",
                "md5": "1a131d25c033baf6b914b9bfdbcbca5c",
                "sha256": "5e798ad56ce2fbeb5bd64454b2abd0c96037e39d4b6469aea8e82cfd1af67d57"
            },
            "downloads": -1,
            "filename": "torchft_nightly-2025.2.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "1a131d25c033baf6b914b9bfdbcbca5c",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.8",
            "size": 2305364,
            "upload_time": "2025-02-04T11:24:32",
            "upload_time_iso_8601": "2025-02-04T11:24:32.749262Z",
            "url": "https://files.pythonhosted.org/packages/60/b2/95b4840264ca12a014a33fa5d5453a1ea77cb6027aead0afd41e12f11033/torchft_nightly-2025.2.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1a4e1e33d836a260bd628a2fb0ae80445a9e1f497b48ccfd88e9425049ec7564",
                "md5": "a930783bac81a9265b0dcf7696c9fc12",
                "sha256": "ef731b520946f995c227ef4b416222a812bc84b9f3a55e2c20d032b3e9cf8576"
            },
            "downloads": -1,
            "filename": "torchft_nightly-2025.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "a930783bac81a9265b0dcf7696c9fc12",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.8",
            "size": 2308171,
            "upload_time": "2025-02-04T11:24:35",
            "upload_time_iso_8601": "2025-02-04T11:24:35.063066Z",
            "url": "https://files.pythonhosted.org/packages/1a/4e/1e33d836a260bd628a2fb0ae80445a9e1f497b48ccfd88e9425049ec7564/torchft_nightly-2025.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "537da5b6963297641c35fc3ad5f44b021d2b19af69ff8ede34fff180a8944329",
                "md5": "767ea6832d4a512d0b6b736c239a849d",
                "sha256": "0b12ae3223abf8146d45f06dee0a3c31ea4ffcbde48f158de8d96a76e16130b0"
            },
            "downloads": -1,
            "filename": "torchft_nightly-2025.2.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "767ea6832d4a512d0b6b736c239a849d",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.8",
            "size": 2309521,
            "upload_time": "2025-02-04T11:24:37",
            "upload_time_iso_8601": "2025-02-04T11:24:37.804471Z",
            "url": "https://files.pythonhosted.org/packages/53/7d/a5b6963297641c35fc3ad5f44b021d2b19af69ff8ede34fff180a8944329/torchft_nightly-2025.2.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9c38e98e085bbdb44e8812eca0adf899a711e325c2728e943878a0539ab88928",
                "md5": "fd8fa3fc36f9fdc2ac5892ee33390f53",
                "sha256": "8e293e8fcfb2289596072f7e122079b9657d9e5c5734f03f077e37c8ce9f4673"
            },
            "downloads": -1,
            "filename": "torchft_nightly-2025.2.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "fd8fa3fc36f9fdc2ac5892ee33390f53",
            "packagetype": "bdist_wheel",
            "python_version": "cp313",
            "requires_python": ">=3.8",
            "size": 2309071,
            "upload_time": "2025-02-04T11:24:40",
            "upload_time_iso_8601": "2025-02-04T11:24:40.987454Z",
            "url": "https://files.pythonhosted.org/packages/9c/38/e98e085bbdb44e8812eca0adf899a711e325c2728e943878a0539ab88928/torchft_nightly-2025.2.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4b88ec54775c0a80813d48438f62e530cc93acc59d1ce6dc57b17d3fa43477c6",
                "md5": "d0a7599b5ae446e742fb8414510c1936",
                "sha256": "b379f60c3ee0305e0f850cd74a69ce23712d9f5ba7407490924f8688ef83c089"
            },
            "downloads": -1,
            "filename": "torchft_nightly-2025.2.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "d0a7599b5ae446e742fb8414510c1936",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.8",
            "size": 2306970,
            "upload_time": "2025-02-04T11:24:43",
            "upload_time_iso_8601": "2025-02-04T11:24:43.078012Z",
            "url": "https://files.pythonhosted.org/packages/4b/88/ec54775c0a80813d48438f62e530cc93acc59d1ce6dc57b17d3fa43477c6/torchft_nightly-2025.2.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-04 11:24:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pytorch-labs",
    "github_project": "torchft",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "torchft-nightly"
}
        
Elapsed time: 0.47349s