torchft


Nametorchft JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryNone
upload_time2024-10-13 04:19:09
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # torch-ft
Prototype repo for PyTorch fault tolerance

This implements a lighthouse server that coordinates across the different
replica groups and then a per replica group manager and fault tolerance library
that can be used in a standard PyTorch training loop.

This allows for membership changes at the training step granularity which can
greatly improve efficiency by avoiding stop the world training on errors.

## Installation

```sh
$ pip install .
```

This uses pyo3+maturin to build the package, you'll need maturin installed.

To install in editable mode w/ the Rust extensions you can use the normal pip install command:

```sh
$ pip install -e .
```

## Lighthouse

You can start a lighthouse server by running:

```sh
$ RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 1000
```

## Example Training Loop

See [train.py](./train.py) for the full example.

Invoke with:

```sh
$ TORCHFT_MANAGER_PORT=29512 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py
```

train.py:

```py
from torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo

manager = Manager(
    pg=ProcessGroupGloo(), 
    load_state_dict=...,
    state_dict=...,
)

m = nn.Linear(2, 3)
m = DistributedDataParallel(manager, m)
optimizer = Optimizer(manager, optim.AdamW(m.parameters()))

for i in range(1000):
    batch = torch.rand(2, 2, device=device)

    optimizer.zero_grad()

    out = m(batch)
    loss = out.sum()

    loss.backward()

    optimizer.step()
```

## Running Tests / Lint

```sh
$ cargo fmt
% cargo test
```

## License

Apache 2.0 -- see [LICENSE](./LICENSE) for more details.

Copyright (c) Tristan Rice 2024


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "torchft",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# torch-ft\nPrototype repo for PyTorch fault tolerance\n\nThis implements a lighthouse server that coordinates across the different\nreplica groups and then a per replica group manager and fault tolerance library\nthat can be used in a standard PyTorch training loop.\n\nThis allows for membership changes at the training step granularity which can\ngreatly improve efficiency by avoiding stop the world training on errors.\n\n## Installation\n\n```sh\n$ pip install .\n```\n\nThis uses pyo3+maturin to build the package, you'll need maturin installed.\n\nTo install in editable mode w/ the Rust extensions you can use the normal pip install command:\n\n```sh\n$ pip install -e .\n```\n\n## Lighthouse\n\nYou can start a lighthouse server by running:\n\n```sh\n$ RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 1000\n```\n\n## Example Training Loop\n\nSee [train.py](./train.py) for the full example.\n\nInvoke with:\n\n```sh\n$ TORCHFT_MANAGER_PORT=29512 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py\n```\n\ntrain.py:\n\n```py\nfrom torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo\n\nmanager = Manager(\n    pg=ProcessGroupGloo(), \n    load_state_dict=...,\n    state_dict=...,\n)\n\nm = nn.Linear(2, 3)\nm = DistributedDataParallel(manager, m)\noptimizer = Optimizer(manager, optim.AdamW(m.parameters()))\n\nfor i in range(1000):\n    batch = torch.rand(2, 2, device=device)\n\n    optimizer.zero_grad()\n\n    out = m(batch)\n    loss = out.sum()\n\n    loss.backward()\n\n    optimizer.step()\n```\n\n## Running Tests / Lint\n\n```sh\n$ cargo fmt\n% cargo test\n```\n\n## License\n\nApache 2.0 -- see [LICENSE](./LICENSE) for more details.\n\nCopyright (c) Tristan Rice 2024\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.1.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0c26d5ac7a3b2d4a720ed2585ab0eca477967ce282a448fe6cb38fc57857547e",
                "md5": "462941e7b0d05cab9575462442c51052",
                "sha256": "ad25b21d10c5206124cf5d44aee5b30f6b0585a2e27c0946502198a019f04920"
            },
            "downloads": -1,
            "filename": "torchft-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "462941e7b0d05cab9575462442c51052",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.8",
            "size": 1959434,
            "upload_time": "2024-10-13T04:19:09",
            "upload_time_iso_8601": "2024-10-13T04:19:09.959449Z",
            "url": "https://files.pythonhosted.org/packages/0c/26/d5ac7a3b2d4a720ed2585ab0eca477967ce282a448fe6cb38fc57857547e/torchft-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5f8f3a9d4219740d9b05a4669900856e8d3b0045ebc779531e90fb6e5c2d0302",
                "md5": "a999964c2ae50a72adffd3732794af62",
                "sha256": "ce57ed03819c48824f48892da6fede5f74b133dfd3cd8672f1aba4f5a1982b67"
            },
            "downloads": -1,
            "filename": "torchft-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "a999964c2ae50a72adffd3732794af62",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.8",
            "size": 1959184,
            "upload_time": "2024-10-13T04:19:13",
            "upload_time_iso_8601": "2024-10-13T04:19:13.057693Z",
            "url": "https://files.pythonhosted.org/packages/5f/8f/3a9d4219740d9b05a4669900856e8d3b0045ebc779531e90fb6e5c2d0302/torchft-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d935dadba40bc3d38cc24ed7ba8aaa8ada86ae17ce6237a6c949d591faa98297",
                "md5": "d3d1042ffff688990fa51a2c0ccde466",
                "sha256": "33a87022eb439002c72aae0f30343f1d6ba3d6746efe7e214c9da1ec84f6b545"
            },
            "downloads": -1,
            "filename": "torchft-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "d3d1042ffff688990fa51a2c0ccde466",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.8",
            "size": 1961226,
            "upload_time": "2024-10-13T04:19:15",
            "upload_time_iso_8601": "2024-10-13T04:19:15.540231Z",
            "url": "https://files.pythonhosted.org/packages/d9/35/dadba40bc3d38cc24ed7ba8aaa8ada86ae17ce6237a6c949d591faa98297/torchft-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "31bfc8fd90952e761fb4492c12025311678230165ead1bf4ac86c16b479d3997",
                "md5": "8e5df98f36ab211757cf9aa760f169a1",
                "sha256": "4309cebe34fb5a3a0e9e9aca4fb0aff9136150c6428c6a846ea5f37a194dcff3"
            },
            "downloads": -1,
            "filename": "torchft-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "8e5df98f36ab211757cf9aa760f169a1",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 1960300,
            "upload_time": "2024-10-13T04:11:45",
            "upload_time_iso_8601": "2024-10-13T04:11:45.986983Z",
            "url": "https://files.pythonhosted.org/packages/31/bf/c8fd90952e761fb4492c12025311678230165ead1bf4ac86c16b479d3997/torchft-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bcc310fcb7822b7080b74b85efcc041bf802ace7b12c5a368410d3c0bd143691",
                "md5": "64a997925ed98d58bf49417dbe0b02a2",
                "sha256": "243626e9c6919b81666accf9a5ef3dd6d3ec56768ec27fb4b4a941d785965933"
            },
            "downloads": -1,
            "filename": "torchft-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "64a997925ed98d58bf49417dbe0b02a2",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.8",
            "size": 1959949,
            "upload_time": "2024-10-13T04:19:35",
            "upload_time_iso_8601": "2024-10-13T04:19:35.929360Z",
            "url": "https://files.pythonhosted.org/packages/bc/c3/10fcb7822b7080b74b85efcc041bf802ace7b12c5a368410d3c0bd143691/torchft-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-13 04:19:09",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "torchft"
}
        
Elapsed time: 0.37077s