tractorun

Name	tractorun JSON
Version	0.59.0 JSON
	download
home_page	None
Summary	Run distributed training in TractoAI
upload_time	2025-02-17 13:23:35
maintainer	None
docs_url	None
author	TractoAI team
requires_python	>=3.10
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

![img.png](https://raw.githubusercontent.com/tractoai/tractorun/refs/heads/main/docs/_static/img.png)

# 🚜 Tractorun

`Tractorun` is a powerful tool for distributed ML operations on the [Tracto.ai](https://tracto.ai/) platform. It helps manage and run workflows across multiple nodes with minimal changes in the user's code:
* Training and fine-tuning models. Use Tractorun to train models across multiple compute nodes efficiently.
* Offline batch inference. Perform fast and scalable model inference.
* Running arbitrary GPU operations, ideal for any computational tasks that require distributed GPU resources.

## How it works

Built on top of [Tracto.ai](https://tracto.ai/), `Tractorun` is responsible for coordinating distributed machine learning tasks. It has out-of-the-box integrations with PyTorch and Jax, also it can be easily used for any other training or inference framework.

Key advantages:
* No need to manage your cloud infrastructure, such as configuring Kubernetes cluster, or managing GPU and Infiniband drivers. Tracto.ai solves all these infrastructure problems for you.
* No need to coordinate distributed processes. Tractorun handles it based on the training configuration: the number of nodes and GPUs used.

Key features:
* Simple distributed task setup, just specify the number of nodes and GPUs.
* Convenient ways to run and configure: CLI, YAML config, and Python SDK.
* A range of powerful capabilities, including [sidecars](https://github.com/tractoai/tractorun/blob/main/docs/options.md#sidecar) for auxiliary tasks and transparent [mounting](https://github.com/tractoai/tractorun/blob/main/docs/options.md#bind-local) of local files directly into distributed operations.
* Integration with the Tracto.ai platform: use datasets and checkpoints stored in the Tracto.ai storage, build pipelines with Tractorun, MapReduce, Clickhouse, Spark, and more.

# Getting started

To use these examples, you'll need a Tracto account. If you don't have one yet, please sign up at [tracto.ai](https://tracto.ai/).

Install tractorun into your python3 environment:

`pip install --upgrade tractorun`

Configure the client to work with your cluster:
```shell
mkdir ~/.yt
cat <<EOF > ~/.yt
"proxy"={
"url"="$YT_PROXY";
};
"token"="$YT_TOKEN";
EOF
```

Please put your actual Tracto.ai cluster address to `$YT_PROXY` and your token to `$YT_TOKEN`.

# How to try

Run an example script:

```
tractorun \
--yt-path "//tmp/$USER/tractorun_getting_started" \
--bind-local './examples/pytorch/lightning_mnist_ddp_script/lightning_mnist_ddp_script.py:/lightning_mnist_ddp_script.py' \
--bind-local-lib ./tractorun \
--docker-image ghcr.io/tractoai/tractorun-examples-runtime:2025-02-10-16-14-27 \
python3 /lightning_mnist_ddp_script.py
```

# How to run

## CLI

`tractorun --help`

or with yaml config

`tractorun --run-config-path config.yaml`

You can find a relevant examples:
* CLI arguments [example](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist_ddp_script).
* YAML config [example](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist_ddp_script_config).

## Python SDK

SDK is convenient to use from Jupyter notebooks for development purposes.

You can find a relevant example in [the repository](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist).

WARNING: the local environment should be equal to the remote docker image on the TractoAI platform to use SDK.
* This requirement is met in Jupyter Notebook on the Tracto.ai platform.
* For local use, it is recommended to run the code locally in the same container as specified in the docker_image parameter in `tractorun`

# How to adapt code for tractorun

## CLI

1. Wrap all training/inference code to a function.
2. Initiate environment and Toolbox by `from tractorun.run.prepare_and_get_toolbox`

An example of adapting the mnist training from the [PyTorch repository](https://github.com/pytorch/examples/blob/cdef4d43fb1a2c6c4349daa5080e4e8731c34569/mnist/mnist_simple/main.py): https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/cli

## SDK

1. Wrap all training/inference code to a function with a `toolbox: tractorun.toolbox.Toolbox` parameter.
2. Run this function by `tractorun.run.run`.

An example of adapting the mnist training from the [PyTorch repository](https://github.com/pytorch/examples/blob/cdef4d43fb1a2c6c4349daa5080e4e8731c34569/mnist/main.py): https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/sdk

# Features

## Toolbox

`tractorun.toolbox.Toolbox` provides extra integrations with the Tracto.ai platform:
* Preconfigured client by `toolbox.yt_client`
* Basic checkpoints by `toolbox.checkpoint_manager`
* Control over the operation description in the UI by `toolbox.description_manager`
* Access to coordination information by `toolbox.coordinator`

[Toolbox page](https://github.com/tractoai/tractorun/blob/main/docs/toolbox.md) provides an overview of all available toolbox components.

## Coordination

Tractorun always sets following environment variables in each process:
* `MASTER_ADDR` - the address of the master node
* `MASTER_PORT` - the port of the master node
* `WORLD_SIZE` - the total number of processes
* `NODE_RANK` - the unique id of the current node (job in terms of Tracto.ai)
* `LOCAL_RANK` - the unique id of the current process on the current node
* `RANK` - the unique id of the current process across all nodes

### Backends

Backends configure `tractorun` to work with a specific ML framework.

Tractorun supports multiple backends:
* [Tractorch](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/tractorch) for PyTorch
* [examples](https://github.com/tractoai/tractorun/tree/main/examples/pytorch)
* [Tractorax](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/tractorax) for Jax
* [examples](https://github.com/tractoai/tractorun/tree/main/examples/jax)
* [Generic](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/generic)
* non-specialized backend, can be used as a basis for other backends

[Backend page](https://github.com/tractoai/tractorun/blob/main/docs/backend.md) provides an overview of all available backends.

# Options and settings

[Options reference](https://github.com/tractoai/tractorun/blob/main/docs/options.md) page provides an overview of all available options for `tractorun`, explaining their purpose and usage. Options can be defined by:
* CLI parameters
* yaml config
* python options

# More information

* [Examples](https://github.com/tractoai/tractorun/tree/main/examples)
* [More examples in Jupyter Notebooks](https://github.com/tractoai/tracto-examples)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tractorun",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "TractoAI team",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/7f/55/ffc4270eadb274b6ff4a62746290dc3b953f8a6f4e3741fb921fb1f5c5a8/tractorun-0.59.0.tar.gz",
    "platform": null,
    "description": "![img.png](https://raw.githubusercontent.com/tractoai/tractorun/refs/heads/main/docs/_static/img.png)\n\n# \ud83d\ude9c Tractorun\n\n`Tractorun` is a powerful tool for distributed ML operations on the [Tracto.ai](https://tracto.ai/) platform. It helps manage and run workflows across multiple nodes with minimal changes in the user's code:\n* Training and fine-tuning models. Use Tractorun to train models across multiple compute nodes efficiently.\n* Offline batch inference. Perform fast and scalable model inference.\n* Running arbitrary GPU operations, ideal for any computational tasks that require distributed GPU resources.\n\n## How it works\n\nBuilt on top of [Tracto.ai](https://tracto.ai/), `Tractorun` is responsible for coordinating distributed machine learning tasks. It has out-of-the-box integrations with PyTorch and Jax, also it can be easily used for any other training or inference framework.\n\nKey advantages:\n* No need to manage your cloud infrastructure, such as configuring Kubernetes cluster, or managing GPU and Infiniband drivers. Tracto.ai  solves all these infrastructure problems for you.\n* No need to coordinate distributed processes. Tractorun handles it based on the training configuration: the number of nodes and GPUs used.\n\nKey features:\n* Simple distributed task setup, just specify the number of nodes and GPUs.\n* Convenient ways to run and configure: CLI, YAML config, and Python SDK.\n* A range of powerful capabilities, including [sidecars](https://github.com/tractoai/tractorun/blob/main/docs/options.md#sidecar) for auxiliary tasks and transparent [mounting](https://github.com/tractoai/tractorun/blob/main/docs/options.md#bind-local) of local files directly into distributed operations.\n* Integration with the Tracto.ai platform: use datasets and checkpoints stored in the Tracto.ai storage, build pipelines with Tractorun, MapReduce, Clickhouse, Spark, and more.\n\n# Getting started\n\nTo use these examples, you'll need a Tracto account. If you don't have one yet, please sign up at [tracto.ai](https://tracto.ai/).\n\nInstall tractorun into your python3 environment:\n\n`pip install --upgrade tractorun`\n\nConfigure the client to work with your cluster:\n```shell\nmkdir ~/.yt\ncat <<EOF > ~/.yt\n\"proxy\"={\n  \"url\"=\"$YT_PROXY\";\n};\n\"token\"=\"$YT_TOKEN\";\nEOF\n```\n\nPlease put your actual Tracto.ai cluster address to `$YT_PROXY` and your token to `$YT_TOKEN`.\n\n# How to try\n\nRun an example script:\n\n```\ntractorun \\\n    --yt-path \"//tmp/$USER/tractorun_getting_started\" \\\n    --bind-local './examples/pytorch/lightning_mnist_ddp_script/lightning_mnist_ddp_script.py:/lightning_mnist_ddp_script.py' \\\n    --bind-local-lib ./tractorun \\\n    --docker-image ghcr.io/tractoai/tractorun-examples-runtime:2025-02-10-16-14-27 \\\n    python3 /lightning_mnist_ddp_script.py\n```\n\n# How to run\n\n## CLI\n\n`tractorun --help`\n\nor with yaml config\n\n`tractorun --run-config-path config.yaml`\n\nYou can find a relevant examples:\n* CLI arguments [example](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist_ddp_script).\n* YAML config [example](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist_ddp_script_config).\n\n## Python SDK\n\nSDK is convenient to use from Jupyter notebooks for development purposes.\n\nYou can find a relevant example in [the repository](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist).\n\nWARNING: the local environment should be equal to the remote docker image on the TractoAI platform to use SDK.\n* This requirement is met in Jupyter Notebook on the Tracto.ai platform.\n* For local use, it is recommended to run the code locally in the same container as specified in the docker_image parameter in `tractorun`\n\n# How to adapt code for tractorun\n\n## CLI\n\n1. Wrap all training/inference code to a function.\n2. Initiate environment and Toolbox by `from tractorun.run.prepare_and_get_toolbox`\n\nAn example of adapting the mnist training from the [PyTorch repository](https://github.com/pytorch/examples/blob/cdef4d43fb1a2c6c4349daa5080e4e8731c34569/mnist/mnist_simple/main.py): https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/cli\n\n## SDK\n\n1. Wrap all training/inference code to a function with a `toolbox: tractorun.toolbox.Toolbox` parameter.\n2. Run this function by `tractorun.run.run`.\n\nAn example of adapting the mnist training from the [PyTorch repository](https://github.com/pytorch/examples/blob/cdef4d43fb1a2c6c4349daa5080e4e8731c34569/mnist/main.py): https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/sdk\n\n# Features\n\n## Toolbox\n\n`tractorun.toolbox.Toolbox` provides extra integrations with the Tracto.ai platform:\n* Preconfigured client by `toolbox.yt_client`\n* Basic checkpoints by `toolbox.checkpoint_manager`\n* Control over the operation description in the UI by `toolbox.description_manager`\n* Access to coordination information by `toolbox.coordinator`\n\n[Toolbox page](https://github.com/tractoai/tractorun/blob/main/docs/toolbox.md) provides an overview of all available toolbox components.\n\n## Coordination\n\nTractorun always sets following environment variables in each process:\n* `MASTER_ADDR` - the address of the master node\n* `MASTER_PORT` - the port of the master node\n* `WORLD_SIZE` - the total number of processes\n* `NODE_RANK` - the unique id of the current node (job in terms of Tracto.ai)\n* `LOCAL_RANK` - the unique id of the current process on the current node\n* `RANK` - the unique id of the current process across all nodes\n\n### Backends\n\nBackends configure `tractorun` to work with a specific ML framework.\n\nTractorun supports multiple backends:\n* [Tractorch](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/tractorch) for PyTorch\n  * [examples](https://github.com/tractoai/tractorun/tree/main/examples/pytorch)\n* [Tractorax](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/tractorax) for Jax\n  * [examples](https://github.com/tractoai/tractorun/tree/main/examples/jax)\n* [Generic](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/generic)\n  * non-specialized backend, can be used as a basis for other backends\n\n[Backend page](https://github.com/tractoai/tractorun/blob/main/docs/backend.md) provides an overview of all available backends.\n\n# Options and settings\n\n[Options reference](https://github.com/tractoai/tractorun/blob/main/docs/options.md) page provides an overview of all available options for `tractorun`, explaining their purpose and usage. Options can be defined by:\n* CLI parameters\n* yaml config\n* python options\n\n# More information\n\n* [Examples](https://github.com/tractoai/tractorun/tree/main/examples)\n* [More examples in Jupyter Notebooks](https://github.com/tractoai/tracto-examples)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Run distributed training in TractoAI",
    "version": "0.59.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ba1ca10d0963868a6434a257e667b3deeac780dfb2f5177dd320ef554ac8f285",
                "md5": "88a42f561459d5729cd60a320d5e5a0d",
                "sha256": "2e85e16090271b7a3910c0e9f5c62079961e1d761292c1811cd93ded0571fbe3"
            },
            "downloads": -1,
            "filename": "tractorun-0.59.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "88a42f561459d5729cd60a320d5e5a0d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 114747,
            "upload_time": "2025-02-17T13:23:32",
            "upload_time_iso_8601": "2025-02-17T13:23:32.353670Z",
            "url": "https://files.pythonhosted.org/packages/ba/1c/a10d0963868a6434a257e667b3deeac780dfb2f5177dd320ef554ac8f285/tractorun-0.59.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7f55ffc4270eadb274b6ff4a62746290dc3b953f8a6f4e3741fb921fb1f5c5a8",
                "md5": "9996313ab0820bd869941b91fdda70af",
                "sha256": "299d9ca58a3fde7a7ab627d53538dec1396f2fe2c8e21e8e73d8d054c862c37a"
            },
            "downloads": -1,
            "filename": "tractorun-0.59.0.tar.gz",
            "has_sig": false,
            "md5_digest": "9996313ab0820bd869941b91fdda70af",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 71754,
            "upload_time": "2025-02-17T13:23:35",
            "upload_time_iso_8601": "2025-02-17T13:23:35.759958Z",
            "url": "https://files.pythonhosted.org/packages/7f/55/ffc4270eadb274b6ff4a62746290dc3b953f8a6f4e3741fb921fb1f5c5a8/tractorun-0.59.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-17 13:23:35",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "tractorun"
}

TractoAI team