SumsJob


NameSumsJob JSON
Version 0.7.2 PyPI version JSON
download
home_pagehttps://github.com/lululxvi/sumsjob
SummaryA simple Linux command-line utility which submits a job to one of the multiple GPU servers
upload_time2024-08-16 22:59:58
maintainerNone
docs_urlNone
authorLu Lu
requires_pythonNone
licenseGPL-3.0
keywords command-line utility multiple servers gpu job submission
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # &Sigma;&Sigma;<sub>Job</sub>

[![PyPI version](https://badge.fury.io/py/SumsJob.svg)](https://badge.fury.io/py/SumsJob)
[![Downloads](https://pepy.tech/badge/sumsjob)](https://pepy.tech/project/sumsjob)
[![License](https://img.shields.io/github/license/lululxvi/sumsjob)](https://github.com/lululxvi/sumsjob/blob/master/LICENSE)

&Sigma;&Sigma;<sub>Job</sub> or Sums<sub>Job</sub> (**S**imple **U**tility for **M**ultiple-**S**ervers **Job** **Sub**mission) is a simple Linux command-line utility which submits a job to one of the multiple servers each with limited GPUs. &Sigma;&Sigma;<sub>Job</sub> provides similar key functions for multiple servers as [Slurm Workload Manager](https://slurm.schedmd.com) for supercomputers and computer clusters. It provides the following key functions:

- report the state of GPUs on all servers,
- submit a job to servers for execution in noninteractive mode, i.e., the job will be running in the background of the server,
- submit a job to servers for execution in interactive mode, just as the job is running in your local machine,
- display all running jobs,
- cancel running jobs.

## Motivation

Assume you have a few GPU servers: `server1`, `server2`, ... When you need to run a code from your computer, you will

1. Select one server and log in

       $ ssh LAN (You may need to first log in a local area network)
       $ ssh server1

1. Check GPU status. If no free GPU, go to step 1

   `$ nvidia-smi` or `$ gpustat`

1. Copy the code from your computer to the server

       $ scp -r codes server1:~/project/codes

1. Run the code in the server

       $ cd ~/project/codes
       $ CUDA_VISIBLE_DEVICES=0 python main.py

1. Transfer back the results

       $ scp server1:~/project/codes/results.dat .

These steps are boring. &Sigma;&Sigma;<sub>Job</sub> makes all these steps automatic.

## Features

- Simple to use
- Two modes: noninteractive mode, and interactive mode
- Noninteractive mode: the job will be running in the background of the server
    + You can turn off your local machine
- Interactive mode: just as the job is running in your local machine
    + Display the output of the program in the terminal of your local machine in real time
    + Kill the job by Ctrl-C

## Commands

- [sinfo](#-sinfo): Report the state of GPUs on all servers.
- [srun](#-srun-jobfile-jobname): Submit a job to GPU servers for execution.
- [sacct](#-sacct): Display all running jobs ordered by the start time.
- [scancel](#-scancel-jobname): Cancel a running job.

### `$ sinfo`

Report the state of GPUs on all servers. For example,

```
$ sinfo
chitu                       Fri Dec 31 20:05:24 2021  470.74
[0] NVIDIA GeForce RTX 3080 | 27'C,   0 % |  2190 / 10018 MB | shuaim:python3/3589(2190M)
[1] NVIDIA GeForce RTX 3080 | 53'C,   7 % |  2159 / 10014 MB | lu:python/241697(2159M)

dilu                           Fri Dec 31 20:05:26 2021  470.74
[0] NVIDIA GeForce RTX 3080 Ti | 65'C,  73 % |  1672 / 12045 MB | chenxiwu:python/352456(1672M)
[1] NVIDIA GeForce RTX 3080 Ti | 54'C,  83 % |  1610 / 12053 MB | chenxiwu:python/352111(1610M)

Available GPU: chitu [0]
```

### `$ srun jobfile [jobname]`

Submit a job to GPU servers for execution. Automatically do the following steps:

1. Find a GPU with low utilization and sufficient memory (the criterion is in the configuration file).
    - If currently no GPU available, it will wait for some time (`-p PERIOD_RETRY`) and then try again, until reaching the maximum retries (`-n NUM_RETRY`).
    - You can also specify the server and GPU by `-s SERVER` and `--gpuid GPUID`.
1. Copy the code to the server.
1. Run the job on it in noninteractive mode (default) or interactive mode (with `-i`).
1. Save the output in a log file.
1. For interactive mode, when the code finishes, transfer back the result files and the log file.

- `jobfile` : File to be run
- `jobname` : Job name, and also the folder name of the job. If not provided, a random number will be used.

Options:

- `-h`, `--help` : Show this help message and exit
- `-i`, `--interact` : Run the job in interactive mode
- `-s SERVER`, `--server SERVER` : Server host name
- `--gpuid GPUID` : GPU ID to be used; -1 to use CPU only
- `-n NUM_RETRY`, `--num_retry NUM_RETRY` : Number of times to retry the submission (Default: 1000)
- `-p PERIOD_RETRY`, `--period_retry PERIOD_RETRY` : Waiting time (seconds) between two retries after each retry failure (Default: 600)

### `$ sacct`

Display all running jobs ordered by the start time. For example,

```
$ sacct
Server   JobName          Start
-------- ---------------- ----------------------
chitu    job1             12/31/2021 07:41:08 PM
chitu    job2             12/31/2021 08:14:54 PM
dilu     job3             12/31/2021 08:15:23 PM
```

### `$ scancel jobname`

Cancel a running job.

- `jobname` : Job name.

## Installation

&Sigma;&Sigma;<sub>Job</sub> requires Python 3.7 or later. Install with `pip`:

```
$ pip install sumsjob
```

You also need to do the following:

- Make sure you can `ssh` to each server, ideally without typing the password by SSH keys.
- Install [gpustat](https://github.com/wookayin/gpustat) in each server.
- Create a configuration file at `~/.sumsjob/config.py`. Use [config.py](https://github.com/lululxvi/sumsjob/blob/master/sumsjob/config.py) as a template, and modify the values to your configurations.
- Make sure `~/.local/bin` is in your `$PATH`.

Then run `sinfo` to check if everything works.

## License

[GNU GPLv3](LICENSE)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/lululxvi/sumsjob",
    "name": "SumsJob",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Command-line utility, Multiple servers, GPU, Job submission",
    "author": "Lu Lu",
    "author_email": "lululxvi@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/47/c6/3f04f4e0db388e6e24e9effa1d0d389c20c52848f4d8b8a0607b4c7df8b5/sumsjob-0.7.2.tar.gz",
    "platform": null,
    "description": "# &Sigma;&Sigma;<sub>Job</sub>\n\n[![PyPI version](https://badge.fury.io/py/SumsJob.svg)](https://badge.fury.io/py/SumsJob)\n[![Downloads](https://pepy.tech/badge/sumsjob)](https://pepy.tech/project/sumsjob)\n[![License](https://img.shields.io/github/license/lululxvi/sumsjob)](https://github.com/lululxvi/sumsjob/blob/master/LICENSE)\n\n&Sigma;&Sigma;<sub>Job</sub> or Sums<sub>Job</sub> (**S**imple **U**tility for **M**ultiple-**S**ervers **Job** **Sub**mission) is a simple Linux command-line utility which submits a job to one of the multiple servers each with limited GPUs. &Sigma;&Sigma;<sub>Job</sub> provides similar key functions for multiple servers as [Slurm Workload Manager](https://slurm.schedmd.com) for supercomputers and computer clusters. It provides the following key functions:\n\n- report the state of GPUs on all servers,\n- submit a job to servers for execution in noninteractive mode, i.e., the job will be running in the background of the server,\n- submit a job to servers for execution in interactive mode, just as the job is running in your local machine,\n- display all running jobs,\n- cancel running jobs.\n\n## Motivation\n\nAssume you have a few GPU servers: `server1`, `server2`, ... When you need to run a code from your computer, you will\n\n1. Select one server and log in\n\n       $ ssh LAN (You may need to first log in a local area network)\n       $ ssh server1\n\n1. Check GPU status. If no free GPU, go to step 1\n\n   `$ nvidia-smi` or `$ gpustat`\n\n1. Copy the code from your computer to the server\n\n       $ scp -r codes server1:~/project/codes\n\n1. Run the code in the server\n\n       $ cd ~/project/codes\n       $ CUDA_VISIBLE_DEVICES=0 python main.py\n\n1. Transfer back the results\n\n       $ scp server1:~/project/codes/results.dat .\n\nThese steps are boring. &Sigma;&Sigma;<sub>Job</sub> makes all these steps automatic.\n\n## Features\n\n- Simple to use\n- Two modes: noninteractive mode, and interactive mode\n- Noninteractive mode: the job will be running in the background of the server\n    + You can turn off your local machine\n- Interactive mode: just as the job is running in your local machine\n    + Display the output of the program in the terminal of your local machine in real time\n    + Kill the job by Ctrl-C\n\n## Commands\n\n- [sinfo](#-sinfo): Report the state of GPUs on all servers.\n- [srun](#-srun-jobfile-jobname): Submit a job to GPU servers for execution.\n- [sacct](#-sacct): Display all running jobs ordered by the start time.\n- [scancel](#-scancel-jobname): Cancel a running job.\n\n### `$ sinfo`\n\nReport the state of GPUs on all servers. For example,\n\n```\n$ sinfo\nchitu                       Fri Dec 31 20:05:24 2021  470.74\n[0] NVIDIA GeForce RTX 3080 | 27'C,   0 % |  2190 / 10018 MB | shuaim:python3/3589(2190M)\n[1] NVIDIA GeForce RTX 3080 | 53'C,   7 % |  2159 / 10014 MB | lu:python/241697(2159M)\n\ndilu                           Fri Dec 31 20:05:26 2021  470.74\n[0] NVIDIA GeForce RTX 3080 Ti | 65'C,  73 % |  1672 / 12045 MB | chenxiwu:python/352456(1672M)\n[1] NVIDIA GeForce RTX 3080 Ti | 54'C,  83 % |  1610 / 12053 MB | chenxiwu:python/352111(1610M)\n\nAvailable GPU: chitu [0]\n```\n\n### `$ srun jobfile [jobname]`\n\nSubmit a job to GPU servers for execution. Automatically do the following steps:\n\n1. Find a GPU with low utilization and sufficient memory (the criterion is in the configuration file).\n    - If currently no GPU available, it will wait for some time (`-p PERIOD_RETRY`) and then try again, until reaching the maximum retries (`-n NUM_RETRY`).\n    - You can also specify the server and GPU by `-s SERVER` and `--gpuid GPUID`.\n1. Copy the code to the server.\n1. Run the job on it in noninteractive mode (default) or interactive mode (with `-i`).\n1. Save the output in a log file.\n1. For interactive mode, when the code finishes, transfer back the result files and the log file.\n\n- `jobfile` : File to be run\n- `jobname` : Job name, and also the folder name of the job. If not provided, a random number will be used.\n\nOptions:\n\n- `-h`, `--help` : Show this help message and exit\n- `-i`, `--interact` : Run the job in interactive mode\n- `-s SERVER`, `--server SERVER` : Server host name\n- `--gpuid GPUID` : GPU ID to be used; -1 to use CPU only\n- `-n NUM_RETRY`, `--num_retry NUM_RETRY` : Number of times to retry the submission (Default: 1000)\n- `-p PERIOD_RETRY`, `--period_retry PERIOD_RETRY` : Waiting time (seconds) between two retries after each retry failure (Default: 600)\n\n### `$ sacct`\n\nDisplay all running jobs ordered by the start time. For example,\n\n```\n$ sacct\nServer   JobName          Start\n-------- ---------------- ----------------------\nchitu    job1             12/31/2021 07:41:08 PM\nchitu    job2             12/31/2021 08:14:54 PM\ndilu     job3             12/31/2021 08:15:23 PM\n```\n\n### `$ scancel jobname`\n\nCancel a running job.\n\n- `jobname` : Job name.\n\n## Installation\n\n&Sigma;&Sigma;<sub>Job</sub> requires Python 3.7 or later. Install with `pip`:\n\n```\n$ pip install sumsjob\n```\n\nYou also need to do the following:\n\n- Make sure you can `ssh` to each server, ideally without typing the password by SSH keys.\n- Install [gpustat](https://github.com/wookayin/gpustat) in each server.\n- Create a configuration file at `~/.sumsjob/config.py`. Use [config.py](https://github.com/lululxvi/sumsjob/blob/master/sumsjob/config.py) as a template, and modify the values to your configurations.\n- Make sure `~/.local/bin` is in your `$PATH`.\n\nThen run `sinfo` to check if everything works.\n\n## License\n\n[GNU GPLv3](LICENSE)\n",
    "bugtrack_url": null,
    "license": "GPL-3.0",
    "summary": "A simple Linux command-line utility which submits a job to one of the multiple GPU servers",
    "version": "0.7.2",
    "project_urls": {
        "Download": "https://github.com/lululxvi/deepxde/tarball/v0.7.2",
        "Homepage": "https://github.com/lululxvi/sumsjob"
    },
    "split_keywords": [
        "command-line utility",
        " multiple servers",
        " gpu",
        " job submission"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9d0ea6b4a78a95cbb48a89130dc33f993e8b93d8ee2832f89876dac012288a3e",
                "md5": "34736530ee84b2abc13cd023eae69c07",
                "sha256": "cbc04c7fe5eed1d141bf7e48ce7dcff679680a7cf164639957080719e6b13be4"
            },
            "downloads": -1,
            "filename": "SumsJob-0.7.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "34736530ee84b2abc13cd023eae69c07",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 22728,
            "upload_time": "2024-08-16T22:59:56",
            "upload_time_iso_8601": "2024-08-16T22:59:56.935298Z",
            "url": "https://files.pythonhosted.org/packages/9d/0e/a6b4a78a95cbb48a89130dc33f993e8b93d8ee2832f89876dac012288a3e/SumsJob-0.7.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "47c63f04f4e0db388e6e24e9effa1d0d389c20c52848f4d8b8a0607b4c7df8b5",
                "md5": "475a9150475319343100d8d76bb99273",
                "sha256": "e5171221b35e19704c2bd9872ccc8930be0ab1ae63978d511b2c2cfd4d8f70e8"
            },
            "downloads": -1,
            "filename": "sumsjob-0.7.2.tar.gz",
            "has_sig": false,
            "md5_digest": "475a9150475319343100d8d76bb99273",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 20986,
            "upload_time": "2024-08-16T22:59:58",
            "upload_time_iso_8601": "2024-08-16T22:59:58.561556Z",
            "url": "https://files.pythonhosted.org/packages/47/c6/3f04f4e0db388e6e24e9effa1d0d389c20c52848f4d8b8a0607b4c7df8b5/sumsjob-0.7.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-16 22:59:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lululxvi",
    "github_project": "sumsjob",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sumsjob"
}
        
Elapsed time: 4.09762s