torch-submit


Nametorch-submit JSON
Version 0.1.24 PyPI version JSON
download
home_pageNone
SummaryA tool for submitting and managing distributed PyTorch jobs
upload_time2024-12-21 19:32:49
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseMIT
keywords pytorch distributed job submission
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Torch Submit

## Introduction

Torch Submit is a lightweight, easy-to-use tool for running distributed PyTorch jobs across multiple machines. It's designed for researchers and developers who:

- Have access to a bunch of machines with IP addresses
- Want to run distributed PyTorch jobs without the hassle
- Don't have the time, energy, or patience to set up complex cluster management systems like SLURM or Kubernetes

Under the hood, Torch Submit uses Fabric to copy your working directory to the remote addresses and TorchRun to execute the command.

It's encouraged to read `torch_submit/executor.py` to understand how jobs are created and scheduled.

## Features

- Simple cluster configuration: Just add your machines' IP addresses
- Easy job submission: Run your PyTorch jobs with a single command
- Job management: Submit, stop, restart, and monitor your jobs
- Log tailing: Easily view the logs of your running jobs
- Optuna Integration for parallel hyperparameter optimization

## Installation

```bash
pip install torch-submit
```

or from source:

```bash
pip install -e . --prefix ~/.local
```

## Quick Start

1. Set up a cluster:
   ```bash
   torch-submit cluster create
   ```
   Follow the interactive prompts to add your machines.

2. Submit a job:
   ```bash
   torch-submit job submit --cluster my_cluster -- <entrypoint>
   # for example:
   # torch-submit job submit --cluster my_cluster -- python train.py
   # torch-submit job submit --cluster my_cluster -- python -m main.train
   ```

3. List running jobs:
   ```bash
   torch-submit job list
   ```

4. Tail logs:
   ```bash
   torch-submit logs tail <job_id>
   ```

5. Stop a job:
   ```bash
   torch-submit job stop <job_id>
   ```

6. Restart a stopped job:
   ```bash
   torch-submit job restart <job_id>
   ```

## Usage

### Cluster Management

- Create a cluster: `torch-submit cluster create`
- List clusters: `torch-submit cluster list`
- Remove a cluster: `torch-submit cluster remove <cluster_name>`

### Job Management

- Submit a job: `torch-submit job submit --cluster my_cluster -- <entrypoint>`
- List jobs: `torch-submit job list`
- Stop a job: `torch-submit job stop <job_id>`
- Restart a job: `torch-submit job restart <job_id>`

### Log Management

- Tail logs: `torch-submit job logs <job_id>`

### Optuna

The Optuna exectuor requires setting a database connection. This can be done via `torch-submit db create`. This will create a new database within the specified connection called `torch_submit`. This database should be accessible to all machines in a cluster. Study name and storage info will be accessible to to the job via "OPTUNA_STUDY_NAME" and "OPTUNA_STORAGE" environment variables.

## Configuration

Torch Submit stores cluster configurations in `~/.cache/torch-submit/config.yaml`. You can manually edit this file if needed, but it's recommended to use the CLI commands for cluster management.

## Requirements

- Python 3.7+
- PyTorch (for your actual jobs)
- SSH access to all machines in your cluster

## Contributing

We welcome contributions! Please see our Contributing Guide for more details.

## License

Torch Submit is released under the MIT License. See the LICENSE file for more details.

## Support

If you encounter any issues or have questions, please file an issue on our GitHub Issues page.

Happy distributed training!

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "torch-submit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "pytorch, distributed, job submission",
    "author": null,
    "author_email": "Tony Francis <tony@dream3d.com>",
    "download_url": "https://files.pythonhosted.org/packages/d2/e2/21499736fa2f58244d2b9c4f4b6003b1e381f96f7971230faf30bee7ee51/torch_submit-0.1.24.tar.gz",
    "platform": null,
    "description": "# Torch Submit\n\n## Introduction\n\nTorch Submit is a lightweight, easy-to-use tool for running distributed PyTorch jobs across multiple machines. It's designed for researchers and developers who:\n\n- Have access to a bunch of machines with IP addresses\n- Want to run distributed PyTorch jobs without the hassle\n- Don't have the time, energy, or patience to set up complex cluster management systems like SLURM or Kubernetes\n\nUnder the hood, Torch Submit uses Fabric to copy your working directory to the remote addresses and TorchRun to execute the command.\n\nIt's encouraged to read `torch_submit/executor.py` to understand how jobs are created and scheduled.\n\n## Features\n\n- Simple cluster configuration: Just add your machines' IP addresses\n- Easy job submission: Run your PyTorch jobs with a single command\n- Job management: Submit, stop, restart, and monitor your jobs\n- Log tailing: Easily view the logs of your running jobs\n- Optuna Integration for parallel hyperparameter optimization\n\n## Installation\n\n```bash\npip install torch-submit\n```\n\nor from source:\n\n```bash\npip install -e . --prefix ~/.local\n```\n\n## Quick Start\n\n1. Set up a cluster:\n   ```bash\n   torch-submit cluster create\n   ```\n   Follow the interactive prompts to add your machines.\n\n2. Submit a job:\n   ```bash\n   torch-submit job submit --cluster my_cluster -- <entrypoint>\n   # for example:\n   # torch-submit job submit --cluster my_cluster -- python train.py\n   # torch-submit job submit --cluster my_cluster -- python -m main.train\n   ```\n\n3. List running jobs:\n   ```bash\n   torch-submit job list\n   ```\n\n4. Tail logs:\n   ```bash\n   torch-submit logs tail <job_id>\n   ```\n\n5. Stop a job:\n   ```bash\n   torch-submit job stop <job_id>\n   ```\n\n6. Restart a stopped job:\n   ```bash\n   torch-submit job restart <job_id>\n   ```\n\n## Usage\n\n### Cluster Management\n\n- Create a cluster: `torch-submit cluster create`\n- List clusters: `torch-submit cluster list`\n- Remove a cluster: `torch-submit cluster remove <cluster_name>`\n\n### Job Management\n\n- Submit a job: `torch-submit job submit --cluster my_cluster -- <entrypoint>`\n- List jobs: `torch-submit job list`\n- Stop a job: `torch-submit job stop <job_id>`\n- Restart a job: `torch-submit job restart <job_id>`\n\n### Log Management\n\n- Tail logs: `torch-submit job logs <job_id>`\n\n### Optuna\n\nThe Optuna exectuor requires setting a database connection. This can be done via `torch-submit db create`. This will create a new database within the specified connection called `torch_submit`. This database should be accessible to all machines in a cluster. Study name and storage info will be accessible to to the job via \"OPTUNA_STUDY_NAME\" and \"OPTUNA_STORAGE\" environment variables.\n\n## Configuration\n\nTorch Submit stores cluster configurations in `~/.cache/torch-submit/config.yaml`. You can manually edit this file if needed, but it's recommended to use the CLI commands for cluster management.\n\n## Requirements\n\n- Python 3.7+\n- PyTorch (for your actual jobs)\n- SSH access to all machines in your cluster\n\n## Contributing\n\nWe welcome contributions! Please see our Contributing Guide for more details.\n\n## License\n\nTorch Submit is released under the MIT License. See the LICENSE file for more details.\n\n## Support\n\nIf you encounter any issues or have questions, please file an issue on our GitHub Issues page.\n\nHappy distributed training!\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for submitting and managing distributed PyTorch jobs",
    "version": "0.1.24",
    "project_urls": {
        "Bug Tracker": "https://github.com/dream3d-ai/torch-submit/issues",
        "Homepage": "https://github.com/dream3d-ai/torch-submit"
    },
    "split_keywords": [
        "pytorch",
        " distributed",
        " job submission"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5b1737160fca3289ba24b09c8d090a661008a99445c88e6a8e18b42ee96c93a1",
                "md5": "eb0a78b50c797bb64a42ab8ffe97e86f",
                "sha256": "b070a8dd122584b3c417e14eca088d088a52f5e4805a3b07bc26d267d6f58bf4"
            },
            "downloads": -1,
            "filename": "torch_submit-0.1.24-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "eb0a78b50c797bb64a42ab8ffe97e86f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 25812,
            "upload_time": "2024-12-21T19:32:47",
            "upload_time_iso_8601": "2024-12-21T19:32:47.497032Z",
            "url": "https://files.pythonhosted.org/packages/5b/17/37160fca3289ba24b09c8d090a661008a99445c88e6a8e18b42ee96c93a1/torch_submit-0.1.24-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d2e221499736fa2f58244d2b9c4f4b6003b1e381f96f7971230faf30bee7ee51",
                "md5": "ae07b4a5e1d57a875e06988fe7f5c085",
                "sha256": "6a35831e7caffcd0587cf7177c37d17ec50c2b2ff22f34813331fd857b95b7d9"
            },
            "downloads": -1,
            "filename": "torch_submit-0.1.24.tar.gz",
            "has_sig": false,
            "md5_digest": "ae07b4a5e1d57a875e06988fe7f5c085",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 34096,
            "upload_time": "2024-12-21T19:32:49",
            "upload_time_iso_8601": "2024-12-21T19:32:49.706025Z",
            "url": "https://files.pythonhosted.org/packages/d2/e2/21499736fa2f58244d2b9c4f4b6003b1e381f96f7971230faf30bee7ee51/torch_submit-0.1.24.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-21 19:32:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dream3d-ai",
    "github_project": "torch-submit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "torch-submit"
}
        
Elapsed time: 0.64843s