cloudmesh-ee


Namecloudmesh-ee JSON
Version 5.0.5 PyPI version JSON
download
home_page
SummaryThe cloudmesh compute coordinator
upload_time2024-01-01 00:08:42
maintainer
docs_urlNone
author
requires_python>=3.8
licenseApache License Version 2.0, January 2004 http://www.apache.org/licenses/ Copyright 2021,2022 Gregor von Laszewski, University of Virginia Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
keywords helper library cloudmesh
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Cloudmesh ee

A general purpose HPC Template and Experiment management system


## Background

Hyper Performance Computation Clusters (HPCs) are designed around a
timesharing principle and are powered by queue-based execution
ecosystems such as SchedMD's
[SLURM](https://slurm.schedmd.com/overview.html) and IBM's Platform
Load Sharing Facility
([LSF](https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=overview-lsf-introduction)).
While these ecosystems provide a great deal of control and extension
for planning, scheduling, and batching jobs, they are limited in their
ability to support parameterization in a scheduled task.  While there
are facilities in place to execute jobs on an Array, the ability to do
permutation based experments are limited to what you integrate into
your own batch script.  Even then, parameterization of values are only
made availabile as environment variables, which can be limited
depending on your OS or selected programming language.  In many cases
limitations set by the deployment trhough the compute center also
hinder optimal use while restrictions are placed on duration and
number of parallel accessible resources. In some cases these
restrictions are soo established that removing them is impractical and
takes weks to implement on temporary basis.

Cloudmesh Experiment Executor (ee) is a framework that wraps the SLURM 
batch processor into a templated framework such that experiments can 
be generated based on configuration files focusing on the livecycle 
of generating many permutations of experiments with standard tooling, 
so that you can focus more on modeling your experiments than how to 
orchestrate them with tools.  A number of batch scripts can be 
generated that than can be executed according to center policies.

## Dependencies

When you install cloudmesh-ee, you will also be installing a
minimum baseline of the `cms` command (as part of the Cloudmesh
ecosystem).  For more details on Cloudmesh, see its documentation on
[read the docs](https://cloudmesh.github.io/cloudmesh-manual/). However
all instalation can be done thorugh pip. After instalation, you will
need to initialize cloudmesh with the command

```bash
$ cms help
```

While SLURM is not needed to run the `cloudmesh ee` command, the
generated output will not execute unless your system has slurm installed
and you are able to run jobs via the `slurm sbatch` command.

## Documentation

### Running Cloudmesh ee

The `cloudmesh ee` command takes one of two forms of execution.  It is started with 

```bash
$ cms ee <command> <parameters>
```

Where the command invokes a partiuclar action and parameters include a
number of parameters for the command These commands allow you to
inspect the generated output to confirm your parameterization
functions as expected and as intended.

In general, configuration arguments that appear in multiple locations are
prioritized in the following order (highest priority first)

1. CLI Arguments with `cms ee`
2. Configuration Files
3. Preset values

### Generating Experiments with the CLI

The `generate` command is used to generate your experiments based upon either a passed
configuration file, or via CLI arguments.  You can issue the command using
either of the below forms:

```text
cms ee generate SOURCE --name=NAME [--verbose] [--mode=MODE] [--config=CONFIG] [--attributes=PARAMS] [--out=DESTINATION] [--dryrun] [--noos] [--nocm] [--dir=DIR] [--experiment=EXPERIMENT]
cms ee generate --setup=FILE [SOURCE] [--verbose] [--mode=MODE]  [--config=CONFIG] [--attributes=PARAMS] [--out=DESTINATION] [--dryrun] [--noos] [--nocm] [--dir=DIR] [--experiment=EXPERIMENT] [--name=NAME]
```

If you have prepared a configuration file that conforms to the schema
defined in [Setup Config](#setup-config), then you can use the second
form which overrides the default values.

* `--name=NAME` - Supplies a name for this experiment.  Note that the name
  must not match any existing files or directories where you are currently
  executing the command
* `--verbose` - Enables additional logging useful when troubleshooting the
  program.
* `--mode=MODE` - specifies how the output should be generated.  One of: f,h,d.
  * `f` or `flat` - specifies a "flat" mode, where slurm scripts are generated in a flattened structure, all in one directory.
  * `h` or `hierarchical` - specifies a "hierarchical" mode, where experiments are nested into unique directories from each other.
  * `d` or `debug` - instructs the command to not generate any output.
* `--config=CONFIG` - specifies key-value pairs to be used across all files for substitution.  This can be a python, yaml, or json file.
* `--attributes=PARAMS` - specifies key-value pairs that can be listed at the command line and used as substitution across all experiments.  Note this command leverages [cloudmesh's parameter expansion specification](https://cloudmesh.github.io/cloudmesh-manual/autoapi/cloudmeshcommon/cloudmesh/common/parameter/index.html) for different types of expansion rules.
* `--out=DESTINATION` - specifies the directory to write the generated scripts out to.
* `--dryrun` - Runs the command without performing any operations
* `--noos` - Prevents the interleaving of OS environemnt variables into the subsitution logic
* `--dir=DIR` - specifies the directory to write the generated scripts out to.
* `--experiment=EXPERIMENT` - specifies a listing of key-value parameters that establish a unique experiment for each combination of values (a cartisian product across all values for each key).

* `--setup=FILE` - provides all the above configuration options within a configuration
  file to simplify executions.

### Form 2 - Generating Submission Scripts

```text
ee generate submit --name=NAME [--verbose]
```

This command uses the output of the
[generate command](#command-1---generating-experiments) and generates
a shell script that can be used to submit your previously generated
outputs to SLURM as a sequence of sbatch commands.

* `--name=NAME` - specifies the name used in the
  [generate command](#command-1---generating-experiments).
  The generate command will inspect the `<NAME>.json` file and build the
  necessary commands to run all permutations that the cloudmesh ee
  command generated.

Note that this command only generates the script, and you must run the
outputted file in your shell for the commands to be issued to SLURM and
run your jobs.


**Sample YAML File**

This command requires a YAML file which is configured for the host and gpu.
The YAML file also points to the desired slurm template.

```yaml
slurm_template: 'slurm_template.slurm'

ee_setup:
  <hostname>-<gpu>:
    - card_name: "a100"
    - time: "05:00:00"
    - num_cpus: 6
    - num_gpus: 1

  rivanna-v100:
    - card_name: "v100"
    - time: "06:00:00"
    - num_cpus: 6
    - num_gpus: 1

```

example:

```
cms ee slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4  --noos --dir=example --experiment=\"epoch=[1-3] x=[1,4] y=[10,11]\"
ee slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4 --noos --dir=example --experiment="epoch=[1-3] x=[1,4] y=[10,11]"
# ERROR: Importing python not yet implemented
epoch=1 x=1 y=10  sbatch example/slurm.sh
epoch=1 x=1 y=11  sbatch example/slurm.sh
epoch=1 x=4 y=10  sbatch example/slurm.sh
epoch=1 x=4 y=11  sbatch example/slurm.sh
epoch=2 x=1 y=10  sbatch example/slurm.sh
epoch=2 x=1 y=11  sbatch example/slurm.sh
epoch=2 x=4 y=10  sbatch example/slurm.sh
epoch=2 x=4 y=11  sbatch example/slurm.sh
epoch=3 x=1 y=10  sbatch example/slurm.sh
epoch=3 x=1 y=11  sbatch example/slurm.sh
epoch=3 x=4 y=10  sbatch example/slurm.sh
epoch=3 x=4 y=11  sbatch example/slurm.sh
Timer: 0.0022s Load: 0.0013s ee slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4 --noos --dir=example --experiment="epoch=[1-3] x=[1,4] y=[10,11]"
```

## Slurm on a single computer ubuntu 20.04

### Install 

see https://drtailor.medium.com/how-to-setup-slurm-on-ubuntu-20-04-for-single-node-work-scheduling-6cc909574365

32 Processors (threads)

```bash
sudo apt update -y
sudo apt install slurmd slurmctld -y

sudo chmod 777 /etc/slurm-llnl

# make sure to use the HOSTNAME

sudo cat << EOF > /etc/slurm-llnl/slurm.conf
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=localcluster
SlurmctldHost=$HOSTNAME
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
# COMPUTE NODES # THis machine has 128GB main memory
NodeName=$HOSTNAME CPUs=32 RealMemory==128762 State=UNKNOWN
PartitionName=local Nodes=ALL Default=YES MaxTime=INFINITE State=UP
EOF

sudo chmod 755 /etc/slurm-llnl/
```

### Start
```
sudo systemctl start slurmctld
sudo systemctl start slurmd
# sudo scontrol update nodename=$HOSTNAME state=idle
sudo scontrol update nodename=$HOSTNAME state=resume
```

### Stop

```
sudo systemctl stop slurmd
sudo systemctl stop slurmctld
```

### Info

```
sinfo
sinfo -R
sinfo -a
```

### Job

save into gregor.slurm

```
#!/bin/bash

#SBATCH --job-name=gregors_test          # Job name
#SBATCH --mail-type=END,FAIL             # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=laszewski@gmail.com  # Where to send mail	
#SBATCH --ntasks=1                       # Run on a single CPU
####  XBATCH --mem=1gb                        # Job memory request
#SBATCH --time=00:05:00                  # Time limit hrs:min:sec
#SBATCH --output=sgregors_test_%j.log    # Standard output and error log

pwd; hostname; date

echo "Gregors Test"
date
sleep(30)
date
```

Run with 

```
sbatch gregor.slurm
watch -n 1 squeue
```

BUG

```
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2    LocalQ gregors_    green PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

```

### sbatch slurm management commands for localhost

start slurm daemons

```bash
cms ee slurm start
```

stop surm deamons

```bash
cms ee slurm stop
```

BUG:

```bash
srun gregor.slurm

srun: Required node not available (down, drained or reserved)
srun: job 7 queued and waiting for resources
```

```
sudo scontrol update nodename=localhost state=POWER_UP

Valid states are: NoResp DRAIN FAIL FUTURE RESUME POWER_DOWN POWER_UP UNDRAIN

```

### Cheatsheet

* <https://slurm.schedmd.com/pdfs/summary.pdf>

## Acknowledgements

Continued work was in part funded by the NSF
CyberTraining: CIC: CyberTraining for Students and Technologies
from Generation Z with the awadrd numbers 1829704 and 2200409.



## Manual Page

<!-- START-MANUAL -->
```
Command ee
==========

::

  Usage:
        ee generate submit --name=NAME [--job_type=JOB_TYPE] [--verbose]
        ee generate --source=SOURCE --name=NAME
                        [--out=OUT]
                        [--verbose]
                        [--mode=MODE]
                        [--config=CONFIG]
                        [--attributes=PARAMS]
                        [--output_dir=OUTPUT_DIR]
                        [--dryrun]
                        [--noos]
                        [--os=OS]
                        [--nocm]
                        [--source_dir=SOURCE_DIR]
                        [--experiment=EXPERIMENT]
                        [--flat]
                        [--copycode=CODE]
        ee list [DIRECTORY]
        ee slurm start
        ee slurm stop
        ee slurm info
        ee seq --yaml=YAML|--json=JSON

  Expermine Executor (ee) allows the creation of parameterized batch
  scripts. The initial support includes slurm, but we intend
  also to support LSF. Parameters can be specified on the
  commandline or in configuration files. Configuration files
  can be formulated as json,yaml, python, or jupyter
  notebooks.

  Parameters defined in this file are then used in the slurm
  batch script and substituted with their values. A special
  parameter called experiment defines a number of variables
  that are permuted on when used allowing multiple batch
  scripts to be defined easily to conduct parameter studies.

  Please note that the setup flag is deprecated and is in
  future versions fully covered while just using the config
  file.

  Arguments:
      FILENAME       name of a slurm script generated with ee
      CONFIG_FILE    yaml file with configuration
      ACCOUNT        account name for host system
      SOURCE         name for input script slurm.in.sh, lsf.in.sh,
                     script.in.sh or similar
      PARAMS         parameter lists for experimentation
      GPU            name of gpu

  Options:
      -h                        help
      --copycode=CODE           a list including files and directories to be copied into the destination dir
      --config=CONFIG...        a list of comma seperated configuration files in yaml or json format.
                                The endings must be .json or .yaml
      --type=JOB_TYPE           The method to generate submission scripts.
                                One of slurm, lsf. [default: slurm]
      --attributes=PARAMS       a list of coma separated attribute value pairs
                                to set parameters that are used. [default: None]
      --output_dir=OUTPUT_DIR   The directory where the result is written to
      --source_dir=SOURCE_DIR   location of the input directory [default: .]
      --account=ACCOUNT         TBD
      --gpu=GPU                 The name of the GPU. Tyoically k80, v100, a100, rtx3090, rtx3080
      --noos                    ignores environment variable substitution from the shell. This
                                can be helpfull when debugging as the list is quite lareg
      --nocm                    cloudmesh as a variable dictionary build in. Any vaiable referred to
                                by cloudmesh. and its name is replaced from the
                                cloudmesh variables
      --experiment=EXPERIMENT   This specifies all parameters that are used to create
                                permutations of them.
                                They are comma separated key value pairs
      --mode=MODE               one of "debug", "hierachical". One can also just
                                use "d", "h" [default: h]
      --name=NAME               Name of the experiment configuration file
      --os=OS                   Selected OS variables
      --flat                    produce flatdict
      --dryrun                  flag to do a dryrun and not create files and
                                directories [default: False]
      --verbose                 Print more information when executing [default: False]

  Description:

    > Examples:
    >
    > cms ee generate slurm.in.sh --verbose \\
    >     --config=a.py,b.json,c.yaml \\
    >     --attributes=a=1,b=4 \\
    >     --dryrun --noos --input_dir=example \\
    >     --experiment=\"epoch=[1-3] x=[1,4] y=[10,11]\" \\
    >     --name=a --mode=h
    >
    > cms ee generate slurm.in.sh \\
    >    --config=a.py,b.json,c.yaml \\
    >    --attributes=a=1,b=4 \\
    >    --noos \\
    >    --input_dir=example \\
    >    --experiment=\"epoch=[1-3] x=[1,4] y=[10,11]\" \\
    >    --name=a \\
    >    --mode=h
    >            >
    > cms ee generate slurm.in.sh --experiments-file=experiments.yaml --name=a
    >
    > cms ee generate submit --name=a

```
<!-- STOP-MANUAL -->

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "cloudmesh-ee",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Gregor von Laszewski <laszewski@gmail.com>",
    "keywords": "helper library,cloudmesh",
    "author": "",
    "author_email": "Gregor von Laszewski <laszewski@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/00/6b/024246c3bf3a7a5c6ba959249baca3b239a0c3ac7518ac8f21240f0ed6fa/cloudmesh-ee-5.0.5.tar.gz",
    "platform": null,
    "description": "# Cloudmesh ee\n\nA general purpose HPC Template and Experiment management system\n\n\n## Background\n\nHyper Performance Computation Clusters (HPCs) are designed around a\ntimesharing principle and are powered by queue-based execution\necosystems such as SchedMD's\n[SLURM](https://slurm.schedmd.com/overview.html) and IBM's Platform\nLoad Sharing Facility\n([LSF](https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=overview-lsf-introduction)).\nWhile these ecosystems provide a great deal of control and extension\nfor planning, scheduling, and batching jobs, they are limited in their\nability to support parameterization in a scheduled task.  While there\nare facilities in place to execute jobs on an Array, the ability to do\npermutation based experments are limited to what you integrate into\nyour own batch script.  Even then, parameterization of values are only\nmade availabile as environment variables, which can be limited\ndepending on your OS or selected programming language.  In many cases\nlimitations set by the deployment trhough the compute center also\nhinder optimal use while restrictions are placed on duration and\nnumber of parallel accessible resources. In some cases these\nrestrictions are soo established that removing them is impractical and\ntakes weks to implement on temporary basis.\n\nCloudmesh Experiment Executor (ee) is a framework that wraps the SLURM \nbatch processor into a templated framework such that experiments can \nbe generated based on configuration files focusing on the livecycle \nof generating many permutations of experiments with standard tooling, \nso that you can focus more on modeling your experiments than how to \norchestrate them with tools.  A number of batch scripts can be \ngenerated that than can be executed according to center policies.\n\n## Dependencies\n\nWhen you install cloudmesh-ee, you will also be installing a\nminimum baseline of the `cms` command (as part of the Cloudmesh\necosystem).  For more details on Cloudmesh, see its documentation on\n[read the docs](https://cloudmesh.github.io/cloudmesh-manual/). However\nall instalation can be done thorugh pip. After instalation, you will\nneed to initialize cloudmesh with the command\n\n```bash\n$ cms help\n```\n\nWhile SLURM is not needed to run the `cloudmesh ee` command, the\ngenerated output will not execute unless your system has slurm installed\nand you are able to run jobs via the `slurm sbatch` command.\n\n## Documentation\n\n### Running Cloudmesh ee\n\nThe `cloudmesh ee` command takes one of two forms of execution.  It is started with \n\n```bash\n$ cms ee <command> <parameters>\n```\n\nWhere the command invokes a partiuclar action and parameters include a\nnumber of parameters for the command These commands allow you to\ninspect the generated output to confirm your parameterization\nfunctions as expected and as intended.\n\nIn general, configuration arguments that appear in multiple locations are\nprioritized in the following order (highest priority first)\n\n1. CLI Arguments with `cms ee`\n2. Configuration Files\n3. Preset values\n\n### Generating Experiments with the CLI\n\nThe `generate` command is used to generate your experiments based upon either a passed\nconfiguration file, or via CLI arguments.  You can issue the command using\neither of the below forms:\n\n```text\ncms ee generate SOURCE --name=NAME [--verbose] [--mode=MODE] [--config=CONFIG] [--attributes=PARAMS] [--out=DESTINATION] [--dryrun] [--noos] [--nocm] [--dir=DIR] [--experiment=EXPERIMENT]\ncms ee generate --setup=FILE [SOURCE] [--verbose] [--mode=MODE]  [--config=CONFIG] [--attributes=PARAMS] [--out=DESTINATION] [--dryrun] [--noos] [--nocm] [--dir=DIR] [--experiment=EXPERIMENT] [--name=NAME]\n```\n\nIf you have prepared a configuration file that conforms to the schema\ndefined in [Setup Config](#setup-config), then you can use the second\nform which overrides the default values.\n\n* `--name=NAME` - Supplies a name for this experiment.  Note that the name\n  must not match any existing files or directories where you are currently\n  executing the command\n* `--verbose` - Enables additional logging useful when troubleshooting the\n  program.\n* `--mode=MODE` - specifies how the output should be generated.  One of: f,h,d.\n  * `f` or `flat` - specifies a \"flat\" mode, where slurm scripts are generated in a flattened structure, all in one directory.\n  * `h` or `hierarchical` - specifies a \"hierarchical\" mode, where experiments are nested into unique directories from each other.\n  * `d` or `debug` - instructs the command to not generate any output.\n* `--config=CONFIG` - specifies key-value pairs to be used across all files for substitution.  This can be a python, yaml, or json file.\n* `--attributes=PARAMS` - specifies key-value pairs that can be listed at the command line and used as substitution across all experiments.  Note this command leverages [cloudmesh's parameter expansion specification](https://cloudmesh.github.io/cloudmesh-manual/autoapi/cloudmeshcommon/cloudmesh/common/parameter/index.html) for different types of expansion rules.\n* `--out=DESTINATION` - specifies the directory to write the generated scripts out to.\n* `--dryrun` - Runs the command without performing any operations\n* `--noos` - Prevents the interleaving of OS environemnt variables into the subsitution logic\n* `--dir=DIR` - specifies the directory to write the generated scripts out to.\n* `--experiment=EXPERIMENT` - specifies a listing of key-value parameters that establish a unique experiment for each combination of values (a cartisian product across all values for each key).\n\n* `--setup=FILE` - provides all the above configuration options within a configuration\n  file to simplify executions.\n\n### Form 2 - Generating Submission Scripts\n\n```text\nee generate submit --name=NAME [--verbose]\n```\n\nThis command uses the output of the\n[generate command](#command-1---generating-experiments) and generates\na shell script that can be used to submit your previously generated\noutputs to SLURM as a sequence of sbatch commands.\n\n* `--name=NAME` - specifies the name used in the\n  [generate command](#command-1---generating-experiments).\n  The generate command will inspect the `<NAME>.json` file and build the\n  necessary commands to run all permutations that the cloudmesh ee\n  command generated.\n\nNote that this command only generates the script, and you must run the\noutputted file in your shell for the commands to be issued to SLURM and\nrun your jobs.\n\n\n**Sample YAML File**\n\nThis command requires a YAML file which is configured for the host and gpu.\nThe YAML file also points to the desired slurm template.\n\n```yaml\nslurm_template: 'slurm_template.slurm'\n\nee_setup:\n  <hostname>-<gpu>:\n    - card_name: \"a100\"\n    - time: \"05:00:00\"\n    - num_cpus: 6\n    - num_gpus: 1\n\n  rivanna-v100:\n    - card_name: \"v100\"\n    - time: \"06:00:00\"\n    - num_cpus: 6\n    - num_gpus: 1\n\n```\n\nexample:\n\n```\ncms ee slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4  --noos --dir=example --experiment=\\\"epoch=[1-3] x=[1,4] y=[10,11]\\\"\nee slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4 --noos --dir=example --experiment=\"epoch=[1-3] x=[1,4] y=[10,11]\"\n# ERROR: Importing python not yet implemented\nepoch=1 x=1 y=10  sbatch example/slurm.sh\nepoch=1 x=1 y=11  sbatch example/slurm.sh\nepoch=1 x=4 y=10  sbatch example/slurm.sh\nepoch=1 x=4 y=11  sbatch example/slurm.sh\nepoch=2 x=1 y=10  sbatch example/slurm.sh\nepoch=2 x=1 y=11  sbatch example/slurm.sh\nepoch=2 x=4 y=10  sbatch example/slurm.sh\nepoch=2 x=4 y=11  sbatch example/slurm.sh\nepoch=3 x=1 y=10  sbatch example/slurm.sh\nepoch=3 x=1 y=11  sbatch example/slurm.sh\nepoch=3 x=4 y=10  sbatch example/slurm.sh\nepoch=3 x=4 y=11  sbatch example/slurm.sh\nTimer: 0.0022s Load: 0.0013s ee slurm.in.sh --config=a.py,b.json,c.yaml --attributes=a=1,b=4 --noos --dir=example --experiment=\"epoch=[1-3] x=[1,4] y=[10,11]\"\n```\n\n## Slurm on a single computer ubuntu 20.04\n\n### Install \n\nsee https://drtailor.medium.com/how-to-setup-slurm-on-ubuntu-20-04-for-single-node-work-scheduling-6cc909574365\n\n32 Processors (threads)\n\n```bash\nsudo apt update -y\nsudo apt install slurmd slurmctld -y\n\nsudo chmod 777 /etc/slurm-llnl\n\n# make sure to use the HOSTNAME\n\nsudo cat << EOF > /etc/slurm-llnl/slurm.conf\n# slurm.conf file generated by configurator.html.\n# Put this file on all nodes of your cluster.\n# See the slurm.conf man page for more information.\n#\nClusterName=localcluster\nSlurmctldHost=$HOSTNAME\nMpiDefault=none\nProctrackType=proctrack/linuxproc\nReturnToService=2\nSlurmctldPidFile=/var/run/slurmctld.pid\nSlurmctldPort=6817\nSlurmdPidFile=/var/run/slurmd.pid\nSlurmdPort=6818\nSlurmdSpoolDir=/var/lib/slurm-llnl/slurmd\nSlurmUser=slurm\nStateSaveLocation=/var/lib/slurm-llnl/slurmctld\nSwitchType=switch/none\nTaskPlugin=task/none\n#\n# TIMERS\nInactiveLimit=0\nKillWait=30\nMinJobAge=300\nSlurmctldTimeout=120\nSlurmdTimeout=300\nWaittime=0\n# SCHEDULING\nSchedulerType=sched/backfill\nSelectType=select/cons_tres\nSelectTypeParameters=CR_Core\n#\n#AccountingStoragePort=\nAccountingStorageType=accounting_storage/none\nJobCompType=jobcomp/none\nJobAcctGatherFrequency=30\nJobAcctGatherType=jobacct_gather/none\nSlurmctldDebug=info\nSlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log\nSlurmdDebug=info\nSlurmdLogFile=/var/log/slurm-llnl/slurmd.log\n#\n# COMPUTE NODES # THis machine has 128GB main memory\nNodeName=$HOSTNAME CPUs=32 RealMemory==128762 State=UNKNOWN\nPartitionName=local Nodes=ALL Default=YES MaxTime=INFINITE State=UP\nEOF\n\nsudo chmod 755 /etc/slurm-llnl/\n```\n\n### Start\n```\nsudo systemctl start slurmctld\nsudo systemctl start slurmd\n# sudo scontrol update nodename=$HOSTNAME state=idle\nsudo scontrol update nodename=$HOSTNAME state=resume\n```\n\n### Stop\n\n```\nsudo systemctl stop slurmd\nsudo systemctl stop slurmctld\n```\n\n### Info\n\n```\nsinfo\nsinfo -R\nsinfo -a\n```\n\n### Job\n\nsave into gregor.slurm\n\n```\n#!/bin/bash\n\n#SBATCH --job-name=gregors_test          # Job name\n#SBATCH --mail-type=END,FAIL             # Mail events (NONE, BEGIN, END, FAIL, ALL)\n#SBATCH --mail-user=laszewski@gmail.com  # Where to send mail\t\n#SBATCH --ntasks=1                       # Run on a single CPU\n####  XBATCH --mem=1gb                        # Job memory request\n#SBATCH --time=00:05:00                  # Time limit hrs:min:sec\n#SBATCH --output=sgregors_test_%j.log    # Standard output and error log\n\npwd; hostname; date\n\necho \"Gregors Test\"\ndate\nsleep(30)\ndate\n```\n\nRun with \n\n```\nsbatch gregor.slurm\nwatch -n 1 squeue\n```\n\nBUG\n\n```\nJOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)\n                 2    LocalQ gregors_    green PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)\n\n```\n\n### sbatch slurm management commands for localhost\n\nstart slurm daemons\n\n```bash\ncms ee slurm start\n```\n\nstop surm deamons\n\n```bash\ncms ee slurm stop\n```\n\nBUG:\n\n```bash\nsrun gregor.slurm\n\nsrun: Required node not available (down, drained or reserved)\nsrun: job 7 queued and waiting for resources\n```\n\n```\nsudo scontrol update nodename=localhost state=POWER_UP\n\nValid states are: NoResp DRAIN FAIL FUTURE RESUME POWER_DOWN POWER_UP UNDRAIN\n\n```\n\n### Cheatsheet\n\n* <https://slurm.schedmd.com/pdfs/summary.pdf>\n\n## Acknowledgements\n\nContinued work was in part funded by the NSF\nCyberTraining: CIC: CyberTraining for Students and Technologies\nfrom Generation Z with the awadrd numbers 1829704 and 2200409.\n\n\n\n## Manual Page\n\n<!-- START-MANUAL -->\n```\nCommand ee\n==========\n\n::\n\n  Usage:\n        ee generate submit --name=NAME [--job_type=JOB_TYPE] [--verbose]\n        ee generate --source=SOURCE --name=NAME\n                        [--out=OUT]\n                        [--verbose]\n                        [--mode=MODE]\n                        [--config=CONFIG]\n                        [--attributes=PARAMS]\n                        [--output_dir=OUTPUT_DIR]\n                        [--dryrun]\n                        [--noos]\n                        [--os=OS]\n                        [--nocm]\n                        [--source_dir=SOURCE_DIR]\n                        [--experiment=EXPERIMENT]\n                        [--flat]\n                        [--copycode=CODE]\n        ee list [DIRECTORY]\n        ee slurm start\n        ee slurm stop\n        ee slurm info\n        ee seq --yaml=YAML|--json=JSON\n\n  Expermine Executor (ee) allows the creation of parameterized batch\n  scripts. The initial support includes slurm, but we intend\n  also to support LSF. Parameters can be specified on the\n  commandline or in configuration files. Configuration files\n  can be formulated as json,yaml, python, or jupyter\n  notebooks.\n\n  Parameters defined in this file are then used in the slurm\n  batch script and substituted with their values. A special\n  parameter called experiment defines a number of variables\n  that are permuted on when used allowing multiple batch\n  scripts to be defined easily to conduct parameter studies.\n\n  Please note that the setup flag is deprecated and is in\n  future versions fully covered while just using the config\n  file.\n\n  Arguments:\n      FILENAME       name of a slurm script generated with ee\n      CONFIG_FILE    yaml file with configuration\n      ACCOUNT        account name for host system\n      SOURCE         name for input script slurm.in.sh, lsf.in.sh,\n                     script.in.sh or similar\n      PARAMS         parameter lists for experimentation\n      GPU            name of gpu\n\n  Options:\n      -h                        help\n      --copycode=CODE           a list including files and directories to be copied into the destination dir\n      --config=CONFIG...        a list of comma seperated configuration files in yaml or json format.\n                                The endings must be .json or .yaml\n      --type=JOB_TYPE           The method to generate submission scripts.\n                                One of slurm, lsf. [default: slurm]\n      --attributes=PARAMS       a list of coma separated attribute value pairs\n                                to set parameters that are used. [default: None]\n      --output_dir=OUTPUT_DIR   The directory where the result is written to\n      --source_dir=SOURCE_DIR   location of the input directory [default: .]\n      --account=ACCOUNT         TBD\n      --gpu=GPU                 The name of the GPU. Tyoically k80, v100, a100, rtx3090, rtx3080\n      --noos                    ignores environment variable substitution from the shell. This\n                                can be helpfull when debugging as the list is quite lareg\n      --nocm                    cloudmesh as a variable dictionary build in. Any vaiable referred to\n                                by cloudmesh. and its name is replaced from the\n                                cloudmesh variables\n      --experiment=EXPERIMENT   This specifies all parameters that are used to create\n                                permutations of them.\n                                They are comma separated key value pairs\n      --mode=MODE               one of \"debug\", \"hierachical\". One can also just\n                                use \"d\", \"h\" [default: h]\n      --name=NAME               Name of the experiment configuration file\n      --os=OS                   Selected OS variables\n      --flat                    produce flatdict\n      --dryrun                  flag to do a dryrun and not create files and\n                                directories [default: False]\n      --verbose                 Print more information when executing [default: False]\n\n  Description:\n\n    > Examples:\n    >\n    > cms ee generate slurm.in.sh --verbose \\\\\n    >     --config=a.py,b.json,c.yaml \\\\\n    >     --attributes=a=1,b=4 \\\\\n    >     --dryrun --noos --input_dir=example \\\\\n    >     --experiment=\\\"epoch=[1-3] x=[1,4] y=[10,11]\\\" \\\\\n    >     --name=a --mode=h\n    >\n    > cms ee generate slurm.in.sh \\\\\n    >    --config=a.py,b.json,c.yaml \\\\\n    >    --attributes=a=1,b=4 \\\\\n    >    --noos \\\\\n    >    --input_dir=example \\\\\n    >    --experiment=\\\"epoch=[1-3] x=[1,4] y=[10,11]\\\" \\\\\n    >    --name=a \\\\\n    >    --mode=h\n    >            >\n    > cms ee generate slurm.in.sh --experiments-file=experiments.yaml --name=a\n    >\n    > cms ee generate submit --name=a\n\n```\n<!-- STOP-MANUAL -->\n",
    "bugtrack_url": null,
    "license": "Apache License Version 2.0, January 2004 http://www.apache.org/licenses/  Copyright 2021,2022 Gregor von Laszewski, University of Virginia  Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at  http://www.apache.org/licenses/LICENSE-2.0  Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ",
    "summary": "The cloudmesh compute coordinator",
    "version": "5.0.5",
    "project_urls": {
        "Changelog": "https://github.com/cloudmesh/cloudmesh-ee/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/cloudmesh/cloudmesh-ee/blob/main/README.md",
        "Homepage": "https://github.com/cloudmesh/cloudmesh-ee",
        "Issues": "https://github.com/cloudmesh/cloudmesh-ee/issues",
        "Repository": "https://github.com/cloudmesh/cloudmesh-ee.git"
    },
    "split_keywords": [
        "helper library",
        "cloudmesh"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "85a225ef727fa136c215b44d0b5084b660c57d857ad73a9a4601b402900131bc",
                "md5": "19aa5f1ed177c465eecabf128911a9ea",
                "sha256": "fba03ed48092448dee889f996a6dec530bdbe98bc58d4e4284eb61454cc2a7d9"
            },
            "downloads": -1,
            "filename": "cloudmesh_ee-5.0.5-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "19aa5f1ed177c465eecabf128911a9ea",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.8",
            "size": 25385,
            "upload_time": "2024-01-01T00:08:40",
            "upload_time_iso_8601": "2024-01-01T00:08:40.455514Z",
            "url": "https://files.pythonhosted.org/packages/85/a2/25ef727fa136c215b44d0b5084b660c57d857ad73a9a4601b402900131bc/cloudmesh_ee-5.0.5-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "006b024246c3bf3a7a5c6ba959249baca3b239a0c3ac7518ac8f21240f0ed6fa",
                "md5": "453bad28935cd6261011f42a0d19b0a2",
                "sha256": "21ed17c897e2f262af789f2445e6eb63e3fa6bf63bda623fbc81f7ef621154bf"
            },
            "downloads": -1,
            "filename": "cloudmesh-ee-5.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "453bad28935cd6261011f42a0d19b0a2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 30531,
            "upload_time": "2024-01-01T00:08:42",
            "upload_time_iso_8601": "2024-01-01T00:08:42.675005Z",
            "url": "https://files.pythonhosted.org/packages/00/6b/024246c3bf3a7a5c6ba959249baca3b239a0c3ac7518ac8f21240f0ed6fa/cloudmesh-ee-5.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-01 00:08:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cloudmesh",
    "github_project": "cloudmesh-ee",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "cloudmesh-ee"
}
        
Elapsed time: 2.80299s