slacc

Name	slacc JSON
Version	1.0.1 JSON
	download
home_page
Summary	Easily control SLURM from python on the VACC.
upload_time	2023-05-03 17:12:33
maintainer
docs_url	None
author
requires_python	>=3.9
license	MIT
keywords	slurm hpc
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center">
    <img width="250" alt="Logo" src="https://user-images.githubusercontent.com/3115640/235985755-b6473e9f-e997-46a4-ac5b-b1333ab35470.png">
</p>

# SLurm on vACC (slacc)
> "Take it easy and let the VACC work for you." - the wise grad

Isn't it fun to have a nice GPU in your computer and to experiment left and right? You just write some python scripts
and then launch them on your machine, wait, repeat. Life is simple and beautiful.

But then inevitably it happens. You want to launch more scripts, longer scripts, test different parameters... 

So you have to go beg the SLURM gods to assist you. Evil bash scripts start popping up everywhere, requesting resources, launching jobs....

Wouldn't it be nice to just abstract that pain away and remain in python land?

## Prerequisites

- Python >=3.9

That's it.

## Installation
### Step 1. Get the thing.
```bash
pip install slacc
```
>Note: as of May 2023 the `spack` software on the VACC doesn't like python 3.10, so it is suggested to install `slacc` in a 3.9 environment
>If you are using `conda` (recommended) you can use grab an installer for 3.9 from the [miniconda](https://docs.conda.io/en/latest/miniconda.html) page
>
><img src="https://user-images.githubusercontent.com/3115640/235952021-489cc26a-d153-46be-89d4-3e3500ca1ac1.png" width="600"/>
>
>This will make your conda `base` environment use python 3.9
>
>Alternatively you can create an custom environment with `conda create -n <pick_a_name> python=3.9` and run `pip install slacc` in this new environment.

### Step 2. Customize the thing.
After installing `slacc` (it should be quick since there are no dependencies) run 
```bash
sconfig
```
This will copy the default `config.json` file that ships with `slacc`, to `$HOME/.config/slacc/config.json`.
Feel free to customize it to your needs, or keep it as it is. The most important thing is to make sure that the conda environments declared in config.json
```
{ 
  "dggpu":{ "env": "conda activate dgenv",
            ...
}
```
match the ones available on your system.

# How does it work?

This package provides 2 main commands you can use in your CLI: `slaunch` and `sinteract` (plus `sconfig` to make a copy default configs)

## 1. Slurm LAUNCH (slaunch)

This is a wrapper around SLURM's [sbatch](https://slurm.schedmd.com/sbatch.html) to make it way easier to launch python scripts asynchronously.

Let's say that you wanted to run this locally:
```shell
python train.py --lr=0.01
```
then to run the same thing on the VACC you would do:
```shell
slaunch dggpu train.py --lr=0.01
```
_Voilá!_ Behind the scenes, the launcher created an sbatch script for you like:
```bash
#!/bin/bash
#SBATCH --time=2-00:00:00
#SBATCH --account=myproject
#SBATCH --partition=big-gpu
#SBATCH --gres=gpu:p100:1
#SBATCH --mem=8000

conda activate dgenv
python train.py --lr=0.01
```
Even better, it makes it easy to launch parameter sweeps as a job array.

### Basic Usage

> :warning: Before using, you will want to edit the resource configs to suit your needs, as described in the
> [Resource Configurations](#resource-configurations) section.

It has the following syntax:
```
slaunch [-h] [--runs RUNS] [--argfile ARGFILE] [-d RUNDIR] [-f] RESOURCE SCRIPT [FLAGS...]
```
where `RESOURCE` is the name of an entry in `config.json` (see [Resource Configurations](#resource-configurations)).

For example, this:
```shell
slaunch dggpu --runs 2 dummy_gpujob.py --epochs=10 --seed 42
```
is equivalent to twice running this:
```shell
python dummy_gpujob.py --epochs=10 --seed 42
```

Furthermore, any settings accepted by `sbatch` may be supplied on the command line. These will override the options
provided by `config.json`, if given. For example, to add more memory to an existing config:
```
slaunch dggpu --mem=64G dummy_gpujob.py
```
**NOTE:** The `sbatch` options must be supplied in `--name=value` fashion, with an
equals sign; `--name value` will *not* parse correctly. For any other options
(including script flags) you may use either format.

### Run Directory

A recommended way to run jobs is to isolate different experiments to separate folders, so that all the related inputs
and outputs can be stored in one place. This can be done with the `-d/--rundir` option:
```shell
slaunch dggpu -d experiments/my-next-breakthrough train.py --config train-config.yml
```
In this example, all experiments are stored in the corresponding repository, under the `experiments/` folder. The script
runs in this folder, where it expects to find a `train-config.yml` file. Say the script also generates a
`trained_models/` folder. After running, the experiment folder will contain:
```
__ experiments/my-next-breakthrough/
  |__ train-config.yml
  |__ slurm-12345678.out
  |__ trained_models/
     |__ model-1000.pth
     |__ model-2000.pth
     |__ ...
```

### Running Job Arrays

`slaunch` can also run a full sweep of jobs as a single job array.

> :warning: **Careful with program outputs when using this method!**
> For instance, if you are running a program that outputs trained models, you will need to supply each run with a
> separate output folder so they don't overwrite each other.

You can run the same exact job N times via `-r/--runs`:
```shell
slaunch bdgpu --runs 10 eval.py --lr 0.01 --num-steps 10 --plots
```

Or you can run a sweep over different configurations, by providing each configuration as a separate line in an
"argfile":
```shell
slaunch bdgpu --argfile sweep-args.txt eval.py --plots
```
Where the argfile looks something like this:
```
--lr 0.1  --num-steps 10 -o outfile1
--lr 0.1  --num-steps 15 -o outfile2
--lr 0.03 --num-steps 10 -o outfile3
--lr 0.03 --num-steps 15 -o outfile4
--lr 0.01 --num-steps 10 -o outfile5
--lr 0.01 --num-steps 15 -o outfile6
```

In both cases, these will be launched as a [job array](https://slurm.schedmd.com/job_array.html), making it easier to
track and manage the jobs as a single unit.

## 2. Slurm INTERACTive (sinteract)

This is a wrapper around `srun` that allows you to easily start an interactive shell on one of the SLURM nodes.  The
shell you launch will be granted the resources of the [resource config](#resource-configurations) you provide.

Example:
```shell
sinteract bdgpu
```

## [Resource Configurations](config.json)

The `config.json` file provides a list of pre-defined resource configurations which the user can use to launch their
SLURM jobs. This is helpful to save sets of `sbatch` or `srun` options that the user uses frequently. **You will want
to edit these to add your own configurations which are suitable for your common tasks.** However, if you need to change
minor things like the amount of memory from job to job, you can always adjust that on the command line.

Each entry has the following structure:
```
{
  NAME:
    ENV
    RESOURCES
}
```

- `NAME`: a unique identifier for the config.
- `ENV`: specifies how to set up virtual environments, if needed.
- `RESOURCES`: specifies the options to pass to SLURM.

Here is a concrete example:
```json
{
  "dggpu": {
    "env": "conda activate dgenv",
    "resources": {
      "time": "2-00:00:00",
      "partition": "big-gpu",
      "cpus-per-task": 1,
      "gres": "gpu:v100:1",
      "mem": "8000"
    }
  },
  "bigcpu": {
    "env": "conda activate myenv",
    "device": "cpu",
    "resources": {
      "time": "1-00:00:00",
      "partition": "bluemoon",
      "cpus-per-task": 1,
      "mem": "8000"
    }
  }
}
```
A few default configurations are provided as part of the package [config.json](src/slacc/config.json).

:warning: There are 3 places where slacc looks for configuration files. If the same resource is defined in multiple places, only the one with highest priority is considered:
1. (LOW PRIO) [Defaults](src/slacc/config.json) provided by slacc (use this if you are happy with the defaults and don't want to change anything)
2. (MED PRIO) $HOME/.config/slacc/config.json (use this if you want to create custom configurations that you are planning to re-use)
3. (MAX PRIO) Directory containing the job script, e.g. when launching ~/scratch/agi_net/train.py looks for ~/scratch/agi_net/config.json (use this if you want each individual run to use a different configuration)

:hammer_and_wrench: Use `sconfig` to copy the default settings to $HOME/.config/slacc/config.json.

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "slacc",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "slurm,hpc",
    "author": "",
    "author_email": "lfrati <lfrati.github@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/0d/63/add372e9dc71f35f9673be994d2e591458734f61c7b7743a316925b18102/slacc-1.0.1.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n    <img width=\"250\" alt=\"Logo\" src=\"https://user-images.githubusercontent.com/3115640/235985755-b6473e9f-e997-46a4-ac5b-b1333ab35470.png\">\n</p>\n\n# SLurm on vACC (slacc)\n> \"Take it easy and let the VACC work for you.\" - the wise grad\n\nIsn't it fun to have a nice GPU in your computer and to experiment left and right? You just write some python scripts\nand then launch them on your machine, wait, repeat. Life is simple and beautiful.\n\nBut then inevitably it happens. You want to launch more scripts, longer scripts, test different parameters... \n\nSo you have to go beg the SLURM gods to assist you. Evil bash scripts start popping up everywhere, requesting resources, launching jobs....\n\nWouldn't it be nice to just abstract that pain away and remain in python land?\n\n## Prerequisites\n\n- Python >=3.9\n\nThat's it.\n\n## Installation\n### Step 1. Get the thing.\n```bash\npip install slacc\n```\n>Note: as of May 2023 the `spack` software on the VACC doesn't like python 3.10, so it is suggested to install `slacc` in a 3.9 environment\n>If you are using `conda` (recommended) you can use grab an installer for 3.9 from the [miniconda](https://docs.conda.io/en/latest/miniconda.html) page\n>\n><img src=\"https://user-images.githubusercontent.com/3115640/235952021-489cc26a-d153-46be-89d4-3e3500ca1ac1.png\" width=\"600\"/>\n>\n>This will make your conda `base` environment use python 3.9\n>\n>Alternatively you can create an custom environment with `conda create -n <pick_a_name> python=3.9` and run `pip install slacc` in this new environment.\n\n### Step 2. Customize the thing.\nAfter installing `slacc` (it should be quick since there are no dependencies) run \n```bash\nsconfig\n```\nThis will copy the default `config.json` file that ships with `slacc`, to `$HOME/.config/slacc/config.json`.\nFeel free to customize it to your needs, or keep it as it is. The most important thing is to make sure that the conda environments declared in config.json\n```\n{ \n  \"dggpu\":{ \"env\": \"conda activate dgenv\",\n            ...\n}\n```\nmatch the ones available on your system.\n\n# How does it work?\n\nThis package provides 2 main commands you can use in your CLI: `slaunch` and `sinteract` (plus `sconfig` to make a copy default configs)\n\n## 1. Slurm LAUNCH (slaunch)\n\nThis is a wrapper around SLURM's [sbatch](https://slurm.schedmd.com/sbatch.html) to make it way easier to launch python scripts asynchronously.\n\nLet's say that you wanted to run this locally:\n```shell\npython train.py --lr=0.01\n```\nthen to run the same thing on the VACC you would do:\n```shell\nslaunch dggpu train.py --lr=0.01\n```\n_Voil\u00e1!_ Behind the scenes, the launcher created an sbatch script for you like:\n```bash\n#!/bin/bash\n#SBATCH --time=2-00:00:00\n#SBATCH --account=myproject\n#SBATCH --partition=big-gpu\n#SBATCH --gres=gpu:p100:1\n#SBATCH --mem=8000\n\nconda activate dgenv\npython train.py --lr=0.01\n```\nEven better, it makes it easy to launch parameter sweeps as a job array.\n\n### Basic Usage\n\n> :warning: Before using, you will want to edit the resource configs to suit your needs, as described in the\n> [Resource Configurations](#resource-configurations) section.\n\nIt has the following syntax:\n```\nslaunch [-h] [--runs RUNS] [--argfile ARGFILE] [-d RUNDIR] [-f] RESOURCE SCRIPT [FLAGS...]\n```\nwhere `RESOURCE` is the name of an entry in `config.json` (see [Resource Configurations](#resource-configurations)).\n\nFor example, this:\n```shell\nslaunch dggpu --runs 2 dummy_gpujob.py --epochs=10 --seed 42\n```\nis equivalent to twice running this:\n```shell\npython dummy_gpujob.py --epochs=10 --seed 42\n```\n\nFurthermore, any settings accepted by `sbatch` may be supplied on the command line. These will override the options\nprovided by `config.json`, if given. For example, to add more memory to an existing config:\n```\nslaunch dggpu --mem=64G dummy_gpujob.py\n```\n**NOTE:** The `sbatch` options must be supplied in `--name=value` fashion, with an\nequals sign; `--name value` will *not* parse correctly. For any other options\n(including script flags) you may use either format.\n\n### Run Directory\n\nA recommended way to run jobs is to isolate different experiments to separate folders, so that all the related inputs\nand outputs can be stored in one place. This can be done with the `-d/--rundir` option:\n```shell\nslaunch dggpu -d experiments/my-next-breakthrough train.py --config train-config.yml\n```\nIn this example, all experiments are stored in the corresponding repository, under the `experiments/` folder. The script\nruns in this folder, where it expects to find a `train-config.yml` file. Say the script also generates a\n`trained_models/` folder. After running, the experiment folder will contain:\n```\n__ experiments/my-next-breakthrough/\n  |__ train-config.yml\n  |__ slurm-12345678.out\n  |__ trained_models/\n     |__ model-1000.pth\n     |__ model-2000.pth\n     |__ ...\n```\n\n### Running Job Arrays\n\n`slaunch` can also run a full sweep of jobs as a single job array.\n\n> :warning: **Careful with program outputs when using this method!**\n> For instance, if you are running a program that outputs trained models, you will need to supply each run with a\n> separate output folder so they don't overwrite each other.\n\nYou can run the same exact job N times via `-r/--runs`:\n```shell\nslaunch bdgpu --runs 10 eval.py --lr 0.01 --num-steps 10 --plots\n```\n\nOr you can run a sweep over different configurations, by providing each configuration as a separate line in an\n\"argfile\":\n```shell\nslaunch bdgpu --argfile sweep-args.txt eval.py --plots\n```\nWhere the argfile looks something like this:\n```\n--lr 0.1  --num-steps 10 -o outfile1\n--lr 0.1  --num-steps 15 -o outfile2\n--lr 0.03 --num-steps 10 -o outfile3\n--lr 0.03 --num-steps 15 -o outfile4\n--lr 0.01 --num-steps 10 -o outfile5\n--lr 0.01 --num-steps 15 -o outfile6\n```\n\nIn both cases, these will be launched as a [job array](https://slurm.schedmd.com/job_array.html), making it easier to\ntrack and manage the jobs as a single unit.\n\n## 2. Slurm INTERACTive (sinteract)\n\nThis is a wrapper around `srun` that allows you to easily start an interactive shell on one of the SLURM nodes.  The\nshell you launch will be granted the resources of the [resource config](#resource-configurations) you provide.\n\nExample:\n```shell\nsinteract bdgpu\n```\n\n## [Resource Configurations](config.json)\n\nThe `config.json` file provides a list of pre-defined resource configurations which the user can use to launch their\nSLURM jobs. This is helpful to save sets of `sbatch` or `srun` options that the user uses frequently. **You will want\nto edit these to add your own configurations which are suitable for your common tasks.** However, if you need to change\nminor things like the amount of memory from job to job, you can always adjust that on the command line.\n\nEach entry has the following structure:\n```\n{\n  NAME:\n    ENV\n    RESOURCES\n}\n```\n\n- `NAME`: a unique identifier for the config.\n- `ENV`: specifies how to set up virtual environments, if needed.\n- `RESOURCES`: specifies the options to pass to SLURM.\n\nHere is a concrete example:\n```json\n{\n  \"dggpu\": {\n    \"env\": \"conda activate dgenv\",\n    \"resources\": {\n      \"time\": \"2-00:00:00\",\n      \"partition\": \"big-gpu\",\n      \"cpus-per-task\": 1,\n      \"gres\": \"gpu:v100:1\",\n      \"mem\": \"8000\"\n    }\n  },\n  \"bigcpu\": {\n    \"env\": \"conda activate myenv\",\n    \"device\": \"cpu\",\n    \"resources\": {\n      \"time\": \"1-00:00:00\",\n      \"partition\": \"bluemoon\",\n      \"cpus-per-task\": 1,\n      \"mem\": \"8000\"\n    }\n  }\n}\n```\nA few default configurations are provided as part of the package [config.json](src/slacc/config.json).\n\n:warning: There are 3 places where slacc looks for configuration files. If the same resource is defined in multiple places, only the one with highest priority is considered:\n1. (LOW PRIO) [Defaults](src/slacc/config.json) provided by slacc (use this if you are happy with the defaults and don't want to change anything)\n2. (MED PRIO) $HOME/.config/slacc/config.json (use this if you want to create custom configurations that you are planning to re-use)\n3. (MAX PRIO) Directory containing the job script, e.g. when launching ~/scratch/agi_net/train.py looks for ~/scratch/agi_net/config.json (use this if you want each individual run to use a different configuration)\n\n:hammer_and_wrench: Use `sconfig` to copy the default settings to $HOME/.config/slacc/config.json.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Easily control SLURM from python on the VACC.",
    "version": "1.0.1",
    "project_urls": {
        "Homepage": "https://github.com/lfrati/slacc"
    },
    "split_keywords": [
        "slurm",
        "hpc"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dbb0210177a8d02695615430e4c80e3530d3c014e7cd81c80dd1ee8d2d4d977e",
                "md5": "f4280da5c62f6d145edf1f87eb570d05",
                "sha256": "96ba966f9080cbe70165c193ec2eea2ea42c738bd373a058aaa24e3b41206d67"
            },
            "downloads": -1,
            "filename": "slacc-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f4280da5c62f6d145edf1f87eb570d05",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 10464,
            "upload_time": "2023-05-03T17:12:31",
            "upload_time_iso_8601": "2023-05-03T17:12:31.341490Z",
            "url": "https://files.pythonhosted.org/packages/db/b0/210177a8d02695615430e4c80e3530d3c014e7cd81c80dd1ee8d2d4d977e/slacc-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0d63add372e9dc71f35f9673be994d2e591458734f61c7b7743a316925b18102",
                "md5": "7170f38fa57748f355ee48b98391857c",
                "sha256": "15681efb7ae1eecb9091637cdced31ecc6891fc8f80e5326943559785e034cef"
            },
            "downloads": -1,
            "filename": "slacc-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "7170f38fa57748f355ee48b98391857c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 13187,
            "upload_time": "2023-05-03T17:12:33",
            "upload_time_iso_8601": "2023-05-03T17:12:33.534882Z",
            "url": "https://files.pythonhosted.org/packages/0d/63/add372e9dc71f35f9673be994d2e591458734f61c7b7743a316925b18102/slacc-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-03 17:12:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lfrati",
    "github_project": "slacc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "slacc"
}