| Name | tractorun JSON |
| Version |
0.59.0
JSON |
| download |
| home_page | None |
| Summary | Run distributed training in TractoAI |
| upload_time | 2025-02-17 13:23:35 |
| maintainer | None |
| docs_url | None |
| author | TractoAI team |
| requires_python | >=3.10 |
| license | None |
| keywords |
|
| VCS |
|
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|

# 🚜 Tractorun
`Tractorun` is a powerful tool for distributed ML operations on the [Tracto.ai](https://tracto.ai/) platform. It helps manage and run workflows across multiple nodes with minimal changes in the user's code:
* Training and fine-tuning models. Use Tractorun to train models across multiple compute nodes efficiently.
* Offline batch inference. Perform fast and scalable model inference.
* Running arbitrary GPU operations, ideal for any computational tasks that require distributed GPU resources.
## How it works
Built on top of [Tracto.ai](https://tracto.ai/), `Tractorun` is responsible for coordinating distributed machine learning tasks. It has out-of-the-box integrations with PyTorch and Jax, also it can be easily used for any other training or inference framework.
Key advantages:
* No need to manage your cloud infrastructure, such as configuring Kubernetes cluster, or managing GPU and Infiniband drivers. Tracto.ai solves all these infrastructure problems for you.
* No need to coordinate distributed processes. Tractorun handles it based on the training configuration: the number of nodes and GPUs used.
Key features:
* Simple distributed task setup, just specify the number of nodes and GPUs.
* Convenient ways to run and configure: CLI, YAML config, and Python SDK.
* A range of powerful capabilities, including [sidecars](https://github.com/tractoai/tractorun/blob/main/docs/options.md#sidecar) for auxiliary tasks and transparent [mounting](https://github.com/tractoai/tractorun/blob/main/docs/options.md#bind-local) of local files directly into distributed operations.
* Integration with the Tracto.ai platform: use datasets and checkpoints stored in the Tracto.ai storage, build pipelines with Tractorun, MapReduce, Clickhouse, Spark, and more.
# Getting started
To use these examples, you'll need a Tracto account. If you don't have one yet, please sign up at [tracto.ai](https://tracto.ai/).
Install tractorun into your python3 environment:
`pip install --upgrade tractorun`
Configure the client to work with your cluster:
```shell
mkdir ~/.yt
cat <<EOF > ~/.yt
"proxy"={
"url"="$YT_PROXY";
};
"token"="$YT_TOKEN";
EOF
```
Please put your actual Tracto.ai cluster address to `$YT_PROXY` and your token to `$YT_TOKEN`.
# How to try
Run an example script:
```
tractorun \
--yt-path "//tmp/$USER/tractorun_getting_started" \
--bind-local './examples/pytorch/lightning_mnist_ddp_script/lightning_mnist_ddp_script.py:/lightning_mnist_ddp_script.py' \
--bind-local-lib ./tractorun \
--docker-image ghcr.io/tractoai/tractorun-examples-runtime:2025-02-10-16-14-27 \
python3 /lightning_mnist_ddp_script.py
```
# How to run
## CLI
`tractorun --help`
or with yaml config
`tractorun --run-config-path config.yaml`
You can find a relevant examples:
* CLI arguments [example](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist_ddp_script).
* YAML config [example](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist_ddp_script_config).
## Python SDK
SDK is convenient to use from Jupyter notebooks for development purposes.
You can find a relevant example in [the repository](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist).
WARNING: the local environment should be equal to the remote docker image on the TractoAI platform to use SDK.
* This requirement is met in Jupyter Notebook on the Tracto.ai platform.
* For local use, it is recommended to run the code locally in the same container as specified in the docker_image parameter in `tractorun`
# How to adapt code for tractorun
## CLI
1. Wrap all training/inference code to a function.
2. Initiate environment and Toolbox by `from tractorun.run.prepare_and_get_toolbox`
An example of adapting the mnist training from the [PyTorch repository](https://github.com/pytorch/examples/blob/cdef4d43fb1a2c6c4349daa5080e4e8731c34569/mnist/mnist_simple/main.py): https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/cli
## SDK
1. Wrap all training/inference code to a function with a `toolbox: tractorun.toolbox.Toolbox` parameter.
2. Run this function by `tractorun.run.run`.
An example of adapting the mnist training from the [PyTorch repository](https://github.com/pytorch/examples/blob/cdef4d43fb1a2c6c4349daa5080e4e8731c34569/mnist/main.py): https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/sdk
# Features
## Toolbox
`tractorun.toolbox.Toolbox` provides extra integrations with the Tracto.ai platform:
* Preconfigured client by `toolbox.yt_client`
* Basic checkpoints by `toolbox.checkpoint_manager`
* Control over the operation description in the UI by `toolbox.description_manager`
* Access to coordination information by `toolbox.coordinator`
[Toolbox page](https://github.com/tractoai/tractorun/blob/main/docs/toolbox.md) provides an overview of all available toolbox components.
## Coordination
Tractorun always sets following environment variables in each process:
* `MASTER_ADDR` - the address of the master node
* `MASTER_PORT` - the port of the master node
* `WORLD_SIZE` - the total number of processes
* `NODE_RANK` - the unique id of the current node (job in terms of Tracto.ai)
* `LOCAL_RANK` - the unique id of the current process on the current node
* `RANK` - the unique id of the current process across all nodes
### Backends
Backends configure `tractorun` to work with a specific ML framework.
Tractorun supports multiple backends:
* [Tractorch](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/tractorch) for PyTorch
* [examples](https://github.com/tractoai/tractorun/tree/main/examples/pytorch)
* [Tractorax](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/tractorax) for Jax
* [examples](https://github.com/tractoai/tractorun/tree/main/examples/jax)
* [Generic](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/generic)
* non-specialized backend, can be used as a basis for other backends
[Backend page](https://github.com/tractoai/tractorun/blob/main/docs/backend.md) provides an overview of all available backends.
# Options and settings
[Options reference](https://github.com/tractoai/tractorun/blob/main/docs/options.md) page provides an overview of all available options for `tractorun`, explaining their purpose and usage. Options can be defined by:
* CLI parameters
* yaml config
* python options
# More information
* [Examples](https://github.com/tractoai/tractorun/tree/main/examples)
* [More examples in Jupyter Notebooks](https://github.com/tractoai/tracto-examples)
Raw data
{
"_id": null,
"home_page": null,
"name": "tractorun",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": "TractoAI team",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/7f/55/ffc4270eadb274b6ff4a62746290dc3b953f8a6f4e3741fb921fb1f5c5a8/tractorun-0.59.0.tar.gz",
"platform": null,
"description": "\n\n# \ud83d\ude9c Tractorun\n\n`Tractorun` is a powerful tool for distributed ML operations on the [Tracto.ai](https://tracto.ai/) platform. It helps manage and run workflows across multiple nodes with minimal changes in the user's code:\n* Training and fine-tuning models. Use Tractorun to train models across multiple compute nodes efficiently.\n* Offline batch inference. Perform fast and scalable model inference.\n* Running arbitrary GPU operations, ideal for any computational tasks that require distributed GPU resources.\n\n## How it works\n\nBuilt on top of [Tracto.ai](https://tracto.ai/), `Tractorun` is responsible for coordinating distributed machine learning tasks. It has out-of-the-box integrations with PyTorch and Jax, also it can be easily used for any other training or inference framework.\n\nKey advantages:\n* No need to manage your cloud infrastructure, such as configuring Kubernetes cluster, or managing GPU and Infiniband drivers. Tracto.ai solves all these infrastructure problems for you.\n* No need to coordinate distributed processes. Tractorun handles it based on the training configuration: the number of nodes and GPUs used.\n\nKey features:\n* Simple distributed task setup, just specify the number of nodes and GPUs.\n* Convenient ways to run and configure: CLI, YAML config, and Python SDK.\n* A range of powerful capabilities, including [sidecars](https://github.com/tractoai/tractorun/blob/main/docs/options.md#sidecar) for auxiliary tasks and transparent [mounting](https://github.com/tractoai/tractorun/blob/main/docs/options.md#bind-local) of local files directly into distributed operations.\n* Integration with the Tracto.ai platform: use datasets and checkpoints stored in the Tracto.ai storage, build pipelines with Tractorun, MapReduce, Clickhouse, Spark, and more.\n\n# Getting started\n\nTo use these examples, you'll need a Tracto account. If you don't have one yet, please sign up at [tracto.ai](https://tracto.ai/).\n\nInstall tractorun into your python3 environment:\n\n`pip install --upgrade tractorun`\n\nConfigure the client to work with your cluster:\n```shell\nmkdir ~/.yt\ncat <<EOF > ~/.yt\n\"proxy\"={\n \"url\"=\"$YT_PROXY\";\n};\n\"token\"=\"$YT_TOKEN\";\nEOF\n```\n\nPlease put your actual Tracto.ai cluster address to `$YT_PROXY` and your token to `$YT_TOKEN`.\n\n# How to try\n\nRun an example script:\n\n```\ntractorun \\\n --yt-path \"//tmp/$USER/tractorun_getting_started\" \\\n --bind-local './examples/pytorch/lightning_mnist_ddp_script/lightning_mnist_ddp_script.py:/lightning_mnist_ddp_script.py' \\\n --bind-local-lib ./tractorun \\\n --docker-image ghcr.io/tractoai/tractorun-examples-runtime:2025-02-10-16-14-27 \\\n python3 /lightning_mnist_ddp_script.py\n```\n\n# How to run\n\n## CLI\n\n`tractorun --help`\n\nor with yaml config\n\n`tractorun --run-config-path config.yaml`\n\nYou can find a relevant examples:\n* CLI arguments [example](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist_ddp_script).\n* YAML config [example](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist_ddp_script_config).\n\n## Python SDK\n\nSDK is convenient to use from Jupyter notebooks for development purposes.\n\nYou can find a relevant example in [the repository](https://github.com/tractoai/tractorun/tree/main/examples/pytorch/lightning_mnist).\n\nWARNING: the local environment should be equal to the remote docker image on the TractoAI platform to use SDK.\n* This requirement is met in Jupyter Notebook on the Tracto.ai platform.\n* For local use, it is recommended to run the code locally in the same container as specified in the docker_image parameter in `tractorun`\n\n# How to adapt code for tractorun\n\n## CLI\n\n1. Wrap all training/inference code to a function.\n2. Initiate environment and Toolbox by `from tractorun.run.prepare_and_get_toolbox`\n\nAn example of adapting the mnist training from the [PyTorch repository](https://github.com/pytorch/examples/blob/cdef4d43fb1a2c6c4349daa5080e4e8731c34569/mnist/mnist_simple/main.py): https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/cli\n\n## SDK\n\n1. Wrap all training/inference code to a function with a `toolbox: tractorun.toolbox.Toolbox` parameter.\n2. Run this function by `tractorun.run.run`.\n\nAn example of adapting the mnist training from the [PyTorch repository](https://github.com/pytorch/examples/blob/cdef4d43fb1a2c6c4349daa5080e4e8731c34569/mnist/main.py): https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/sdk\n\n# Features\n\n## Toolbox\n\n`tractorun.toolbox.Toolbox` provides extra integrations with the Tracto.ai platform:\n* Preconfigured client by `toolbox.yt_client`\n* Basic checkpoints by `toolbox.checkpoint_manager`\n* Control over the operation description in the UI by `toolbox.description_manager`\n* Access to coordination information by `toolbox.coordinator`\n\n[Toolbox page](https://github.com/tractoai/tractorun/blob/main/docs/toolbox.md) provides an overview of all available toolbox components.\n\n## Coordination\n\nTractorun always sets following environment variables in each process:\n* `MASTER_ADDR` - the address of the master node\n* `MASTER_PORT` - the port of the master node\n* `WORLD_SIZE` - the total number of processes\n* `NODE_RANK` - the unique id of the current node (job in terms of Tracto.ai)\n* `LOCAL_RANK` - the unique id of the current process on the current node\n* `RANK` - the unique id of the current process across all nodes\n\n### Backends\n\nBackends configure `tractorun` to work with a specific ML framework.\n\nTractorun supports multiple backends:\n* [Tractorch](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/tractorch) for PyTorch\n * [examples](https://github.com/tractoai/tractorun/tree/main/examples/pytorch)\n* [Tractorax](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/tractorax) for Jax\n * [examples](https://github.com/tractoai/tractorun/tree/main/examples/jax)\n* [Generic](https://github.com/tractoai/tractorun/tree/main/tractorun/backend/generic)\n * non-specialized backend, can be used as a basis for other backends\n\n[Backend page](https://github.com/tractoai/tractorun/blob/main/docs/backend.md) provides an overview of all available backends.\n\n# Options and settings\n\n[Options reference](https://github.com/tractoai/tractorun/blob/main/docs/options.md) page provides an overview of all available options for `tractorun`, explaining their purpose and usage. Options can be defined by:\n* CLI parameters\n* yaml config\n* python options\n\n# More information\n\n* [Examples](https://github.com/tractoai/tractorun/tree/main/examples)\n* [More examples in Jupyter Notebooks](https://github.com/tractoai/tracto-examples)\n",
"bugtrack_url": null,
"license": null,
"summary": "Run distributed training in TractoAI",
"version": "0.59.0",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ba1ca10d0963868a6434a257e667b3deeac780dfb2f5177dd320ef554ac8f285",
"md5": "88a42f561459d5729cd60a320d5e5a0d",
"sha256": "2e85e16090271b7a3910c0e9f5c62079961e1d761292c1811cd93ded0571fbe3"
},
"downloads": -1,
"filename": "tractorun-0.59.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "88a42f561459d5729cd60a320d5e5a0d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 114747,
"upload_time": "2025-02-17T13:23:32",
"upload_time_iso_8601": "2025-02-17T13:23:32.353670Z",
"url": "https://files.pythonhosted.org/packages/ba/1c/a10d0963868a6434a257e667b3deeac780dfb2f5177dd320ef554ac8f285/tractorun-0.59.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7f55ffc4270eadb274b6ff4a62746290dc3b953f8a6f4e3741fb921fb1f5c5a8",
"md5": "9996313ab0820bd869941b91fdda70af",
"sha256": "299d9ca58a3fde7a7ab627d53538dec1396f2fe2c8e21e8e73d8d054c862c37a"
},
"downloads": -1,
"filename": "tractorun-0.59.0.tar.gz",
"has_sig": false,
"md5_digest": "9996313ab0820bd869941b91fdda70af",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 71754,
"upload_time": "2025-02-17T13:23:35",
"upload_time_iso_8601": "2025-02-17T13:23:35.759958Z",
"url": "https://files.pythonhosted.org/packages/7f/55/ffc4270eadb274b6ff4a62746290dc3b953f8a6f4e3741fb921fb1f5c5a8/tractorun-0.59.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-17 13:23:35",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "tractorun"
}