# ΣΣ<sub>Job</sub>
[![PyPI version](https://badge.fury.io/py/SumsJob.svg)](https://badge.fury.io/py/SumsJob)
[![Downloads](https://pepy.tech/badge/sumsjob)](https://pepy.tech/project/sumsjob)
[![License](https://img.shields.io/github/license/lululxvi/sumsjob)](https://github.com/lululxvi/sumsjob/blob/master/LICENSE)
ΣΣ<sub>Job</sub> or Sums<sub>Job</sub> (**S**imple **U**tility for **M**ultiple-**S**ervers **Job** **Sub**mission) is a simple Linux command-line utility which submits a job to one of the multiple servers each with limited GPUs. ΣΣ<sub>Job</sub> provides similar key functions for multiple servers as [Slurm Workload Manager](https://slurm.schedmd.com) for supercomputers and computer clusters. It provides the following key functions:
- report the state of GPUs on all servers,
- submit a job to servers for execution in noninteractive mode, i.e., the job will be running in the background of the server,
- submit a job to servers for execution in interactive mode, just as the job is running in your local machine,
- display all running jobs,
- cancel running jobs.
## Motivation
Assume you have a few GPU servers: `server1`, `server2`, ... When you need to run a code from your computer, you will
1. Select one server and log in
$ ssh LAN (You may need to first log in a local area network)
$ ssh server1
1. Check GPU status. If no free GPU, go to step 1
`$ nvidia-smi` or `$ gpustat`
1. Copy the code from your computer to the server
$ scp -r codes server1:~/project/codes
1. Run the code in the server
$ cd ~/project/codes
$ CUDA_VISIBLE_DEVICES=0 python main.py
1. Transfer back the results
$ scp server1:~/project/codes/results.dat .
These steps are boring. ΣΣ<sub>Job</sub> makes all these steps automatic.
## Features
- Simple to use
- Two modes: noninteractive mode, and interactive mode
- Noninteractive mode: the job will be running in the background of the server
+ You can turn off your local machine
- Interactive mode: just as the job is running in your local machine
+ Display the output of the program in the terminal of your local machine in real time
+ Kill the job by Ctrl-C
## Commands
- [sinfo](#-sinfo): Report the state of GPUs on all servers.
- [srun](#-srun-jobfile-jobname): Submit a job to GPU servers for execution.
- [sacct](#-sacct): Display all running jobs ordered by the start time.
- [scancel](#-scancel-jobname): Cancel a running job.
### `$ sinfo`
Report the state of GPUs on all servers. For example,
```
$ sinfo
chitu Fri Dec 31 20:05:24 2021 470.74
[0] NVIDIA GeForce RTX 3080 | 27'C, 0 % | 2190 / 10018 MB | shuaim:python3/3589(2190M)
[1] NVIDIA GeForce RTX 3080 | 53'C, 7 % | 2159 / 10014 MB | lu:python/241697(2159M)
dilu Fri Dec 31 20:05:26 2021 470.74
[0] NVIDIA GeForce RTX 3080 Ti | 65'C, 73 % | 1672 / 12045 MB | chenxiwu:python/352456(1672M)
[1] NVIDIA GeForce RTX 3080 Ti | 54'C, 83 % | 1610 / 12053 MB | chenxiwu:python/352111(1610M)
Available GPU: chitu [0]
```
### `$ srun jobfile [jobname]`
Submit a job to GPU servers for execution. Automatically do the following steps:
1. Find a GPU with low utilization and sufficient memory (the criterion is in the configuration file).
- If currently no GPU available, it will wait for some time (`-p PERIOD_RETRY`) and then try again, until reaching the maximum retries (`-n NUM_RETRY`).
- You can also specify the server and GPU by `-s SERVER` and `--gpuid GPUID`.
1. Copy the code to the server.
1. Run the job on it in noninteractive mode (default) or interactive mode (with `-i`).
1. Save the output in a log file.
1. For interactive mode, when the code finishes, transfer back the result files and the log file.
- `jobfile` : File to be run
- `jobname` : Job name, and also the folder name of the job. If not provided, a random number will be used.
Options:
- `-h`, `--help` : Show this help message and exit
- `-i`, `--interact` : Run the job in interactive mode
- `-s SERVER`, `--server SERVER` : Server host name
- `--gpuid GPUID` : GPU ID to be used; -1 to use CPU only
- `-n NUM_RETRY`, `--num_retry NUM_RETRY` : Number of times to retry the submission (Default: 1000)
- `-p PERIOD_RETRY`, `--period_retry PERIOD_RETRY` : Waiting time (seconds) between two retries after each retry failure (Default: 600)
### `$ sacct`
Display all running jobs ordered by the start time. For example,
```
$ sacct
Server JobName Start
-------- ---------------- ----------------------
chitu job1 12/31/2021 07:41:08 PM
chitu job2 12/31/2021 08:14:54 PM
dilu job3 12/31/2021 08:15:23 PM
```
### `$ scancel jobname`
Cancel a running job.
- `jobname` : Job name.
## Installation
ΣΣ<sub>Job</sub> requires Python 3.7 or later. Install with `pip`:
```
$ pip install sumsjob
```
You also need to do the following:
- Make sure you can `ssh` to each server, ideally without typing the password by SSH keys.
- Install [gpustat](https://github.com/wookayin/gpustat) in each server.
- Create a configuration file at `~/.sumsjob/config.py`. Use [config.py](https://github.com/lululxvi/sumsjob/blob/master/sumsjob/config.py) as a template, and modify the values to your configurations.
- Make sure `~/.local/bin` is in your `$PATH`.
Then run `sinfo` to check if everything works.
## License
[GNU GPLv3](LICENSE)
Raw data
{
"_id": null,
"home_page": "https://github.com/lululxvi/sumsjob",
"name": "SumsJob",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Command-line utility, Multiple servers, GPU, Job submission",
"author": "Lu Lu",
"author_email": "lululxvi@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/47/c6/3f04f4e0db388e6e24e9effa1d0d389c20c52848f4d8b8a0607b4c7df8b5/sumsjob-0.7.2.tar.gz",
"platform": null,
"description": "# ΣΣ<sub>Job</sub>\n\n[![PyPI version](https://badge.fury.io/py/SumsJob.svg)](https://badge.fury.io/py/SumsJob)\n[![Downloads](https://pepy.tech/badge/sumsjob)](https://pepy.tech/project/sumsjob)\n[![License](https://img.shields.io/github/license/lululxvi/sumsjob)](https://github.com/lululxvi/sumsjob/blob/master/LICENSE)\n\nΣΣ<sub>Job</sub> or Sums<sub>Job</sub> (**S**imple **U**tility for **M**ultiple-**S**ervers **Job** **Sub**mission) is a simple Linux command-line utility which submits a job to one of the multiple servers each with limited GPUs. ΣΣ<sub>Job</sub> provides similar key functions for multiple servers as [Slurm Workload Manager](https://slurm.schedmd.com) for supercomputers and computer clusters. It provides the following key functions:\n\n- report the state of GPUs on all servers,\n- submit a job to servers for execution in noninteractive mode, i.e., the job will be running in the background of the server,\n- submit a job to servers for execution in interactive mode, just as the job is running in your local machine,\n- display all running jobs,\n- cancel running jobs.\n\n## Motivation\n\nAssume you have a few GPU servers: `server1`, `server2`, ... When you need to run a code from your computer, you will\n\n1. Select one server and log in\n\n $ ssh LAN (You may need to first log in a local area network)\n $ ssh server1\n\n1. Check GPU status. If no free GPU, go to step 1\n\n `$ nvidia-smi` or `$ gpustat`\n\n1. Copy the code from your computer to the server\n\n $ scp -r codes server1:~/project/codes\n\n1. Run the code in the server\n\n $ cd ~/project/codes\n $ CUDA_VISIBLE_DEVICES=0 python main.py\n\n1. Transfer back the results\n\n $ scp server1:~/project/codes/results.dat .\n\nThese steps are boring. ΣΣ<sub>Job</sub> makes all these steps automatic.\n\n## Features\n\n- Simple to use\n- Two modes: noninteractive mode, and interactive mode\n- Noninteractive mode: the job will be running in the background of the server\n + You can turn off your local machine\n- Interactive mode: just as the job is running in your local machine\n + Display the output of the program in the terminal of your local machine in real time\n + Kill the job by Ctrl-C\n\n## Commands\n\n- [sinfo](#-sinfo): Report the state of GPUs on all servers.\n- [srun](#-srun-jobfile-jobname): Submit a job to GPU servers for execution.\n- [sacct](#-sacct): Display all running jobs ordered by the start time.\n- [scancel](#-scancel-jobname): Cancel a running job.\n\n### `$ sinfo`\n\nReport the state of GPUs on all servers. For example,\n\n```\n$ sinfo\nchitu Fri Dec 31 20:05:24 2021 470.74\n[0] NVIDIA GeForce RTX 3080 | 27'C, 0 % | 2190 / 10018 MB | shuaim:python3/3589(2190M)\n[1] NVIDIA GeForce RTX 3080 | 53'C, 7 % | 2159 / 10014 MB | lu:python/241697(2159M)\n\ndilu Fri Dec 31 20:05:26 2021 470.74\n[0] NVIDIA GeForce RTX 3080 Ti | 65'C, 73 % | 1672 / 12045 MB | chenxiwu:python/352456(1672M)\n[1] NVIDIA GeForce RTX 3080 Ti | 54'C, 83 % | 1610 / 12053 MB | chenxiwu:python/352111(1610M)\n\nAvailable GPU: chitu [0]\n```\n\n### `$ srun jobfile [jobname]`\n\nSubmit a job to GPU servers for execution. Automatically do the following steps:\n\n1. Find a GPU with low utilization and sufficient memory (the criterion is in the configuration file).\n - If currently no GPU available, it will wait for some time (`-p PERIOD_RETRY`) and then try again, until reaching the maximum retries (`-n NUM_RETRY`).\n - You can also specify the server and GPU by `-s SERVER` and `--gpuid GPUID`.\n1. Copy the code to the server.\n1. Run the job on it in noninteractive mode (default) or interactive mode (with `-i`).\n1. Save the output in a log file.\n1. For interactive mode, when the code finishes, transfer back the result files and the log file.\n\n- `jobfile` : File to be run\n- `jobname` : Job name, and also the folder name of the job. If not provided, a random number will be used.\n\nOptions:\n\n- `-h`, `--help` : Show this help message and exit\n- `-i`, `--interact` : Run the job in interactive mode\n- `-s SERVER`, `--server SERVER` : Server host name\n- `--gpuid GPUID` : GPU ID to be used; -1 to use CPU only\n- `-n NUM_RETRY`, `--num_retry NUM_RETRY` : Number of times to retry the submission (Default: 1000)\n- `-p PERIOD_RETRY`, `--period_retry PERIOD_RETRY` : Waiting time (seconds) between two retries after each retry failure (Default: 600)\n\n### `$ sacct`\n\nDisplay all running jobs ordered by the start time. For example,\n\n```\n$ sacct\nServer JobName Start\n-------- ---------------- ----------------------\nchitu job1 12/31/2021 07:41:08 PM\nchitu job2 12/31/2021 08:14:54 PM\ndilu job3 12/31/2021 08:15:23 PM\n```\n\n### `$ scancel jobname`\n\nCancel a running job.\n\n- `jobname` : Job name.\n\n## Installation\n\nΣΣ<sub>Job</sub> requires Python 3.7 or later. Install with `pip`:\n\n```\n$ pip install sumsjob\n```\n\nYou also need to do the following:\n\n- Make sure you can `ssh` to each server, ideally without typing the password by SSH keys.\n- Install [gpustat](https://github.com/wookayin/gpustat) in each server.\n- Create a configuration file at `~/.sumsjob/config.py`. Use [config.py](https://github.com/lululxvi/sumsjob/blob/master/sumsjob/config.py) as a template, and modify the values to your configurations.\n- Make sure `~/.local/bin` is in your `$PATH`.\n\nThen run `sinfo` to check if everything works.\n\n## License\n\n[GNU GPLv3](LICENSE)\n",
"bugtrack_url": null,
"license": "GPL-3.0",
"summary": "A simple Linux command-line utility which submits a job to one of the multiple GPU servers",
"version": "0.7.2",
"project_urls": {
"Download": "https://github.com/lululxvi/deepxde/tarball/v0.7.2",
"Homepage": "https://github.com/lululxvi/sumsjob"
},
"split_keywords": [
"command-line utility",
" multiple servers",
" gpu",
" job submission"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9d0ea6b4a78a95cbb48a89130dc33f993e8b93d8ee2832f89876dac012288a3e",
"md5": "34736530ee84b2abc13cd023eae69c07",
"sha256": "cbc04c7fe5eed1d141bf7e48ce7dcff679680a7cf164639957080719e6b13be4"
},
"downloads": -1,
"filename": "SumsJob-0.7.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "34736530ee84b2abc13cd023eae69c07",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 22728,
"upload_time": "2024-08-16T22:59:56",
"upload_time_iso_8601": "2024-08-16T22:59:56.935298Z",
"url": "https://files.pythonhosted.org/packages/9d/0e/a6b4a78a95cbb48a89130dc33f993e8b93d8ee2832f89876dac012288a3e/SumsJob-0.7.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "47c63f04f4e0db388e6e24e9effa1d0d389c20c52848f4d8b8a0607b4c7df8b5",
"md5": "475a9150475319343100d8d76bb99273",
"sha256": "e5171221b35e19704c2bd9872ccc8930be0ab1ae63978d511b2c2cfd4d8f70e8"
},
"downloads": -1,
"filename": "sumsjob-0.7.2.tar.gz",
"has_sig": false,
"md5_digest": "475a9150475319343100d8d76bb99273",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 20986,
"upload_time": "2024-08-16T22:59:58",
"upload_time_iso_8601": "2024-08-16T22:59:58.561556Z",
"url": "https://files.pythonhosted.org/packages/47/c6/3f04f4e0db388e6e24e9effa1d0d389c20c52848f4d8b8a0607b4c7df8b5/sumsjob-0.7.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-16 22:59:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lululxvi",
"github_project": "sumsjob",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "sumsjob"
}