jobspec


Namejobspec JSON
Version 0.0.12 PyPI version JSON
download
home_pagehttps://github.com/compspec/jobspec
SummaryJobspec specification and translation layer for cluster work
upload_time2024-03-19 22:35:18
maintainerVanessa Sochat
docs_urlNone
authorVanessa Sochat
requires_pythonNone
licenseLICENSE
keywords cluster orchestration transformer jobspec flux
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # jobspec (python)

> Translation layer for a jobspec specification to cluster execution

[![PyPI version](https://badge.fury.io/py/jobspec.svg)](https://badge.fury.io/py/jobspec)

![https://github.com/compspec/jobspec/blob/main/img/jobspec-bot.png?raw=true](https://github.com/compspec/jobspec/blob/main/img/jobspec-bot.png?raw=true)

This library includes a cluster agnostic language to setup a job (one unit of work in a jobspec).
It is a transformational layer, or a simple language that converts steps needed to prepare a job
for a specific clusters scheduler. If you think it looks too simple then I'd say it's a success,

## Usage

A transformer provides one or more steps for the jobpsec to be transformed and understood for a particular
execution environment.

### Steps

Steps include:

| Name   | Description |
|--------|-------------|
| write  | write a file in the staging directory |
| set    | a step to define a global setting  |
| copy   | copy a file into staging (currently just local) |
| submit | submit the job |
| batch  | submit the job with a batch command (more common in HPC) |
| auth   | authenticate with some service |

Note that for the above, we assume a shared filesystem unless stage directs that this isn't the case.
These are the basic steps that @vsoch needs now for scheduling experiments, and more can be added (or tweaked) if needed.

### Settings

Any "set" directive can be used to set a more global setting on the transform. For example:

 - stage: defines the staging directory. If not set, will be a temporary directory that is created
 - sharedfs: true or false to say that the filesystem is shared or not (defaults to false)

For `sharedfs` it would be ideal to have a setting that is specific to the transformer, but unfortunately this could be true or false
for flux, so it has to be set. But this might be an interesting compatibility thing to test.

### Example

This example will assume receiving a Jobspec on a flux cluster.

#### 1. Start Flux

Start up the development environment to find yourself in a container with flux. Start a test instance:

```bash
flux start --test-size=4
```

Note that we have 4 faux nodes and 40 faux cores.

```bash
flux resource list
```
```console
     STATE NNODES   NCORES    NGPUS NODELIST
      free      4       40        0 194c2b9f4f3c,194c2b9f4f3c,194c2b9f4f3c,194c2b9f4f3c
 allocated      0        0        0
      down      0        0        0
```

Ensure you have jobspec installed! Yes, we are vscode, installing to the container, so we use sudo. YOLO.

```bash
sudo pip install -e .
```

#### 2. Command Line Examples

We are going to run the [examples/hello-world-jobspec.yaml](examples/hello-world-jobspec.yaml). This setup is way overly
complex for this because we don't actually need to do any staging or special work, but it's an example, so intended to be so.
Also note that the design of this file is subject to change. For example, we don't have to include the transform directly in the
jobspec - it can be a file that the jobspec writes, and then the command is issued. I like it better as a piece of it, so am putting
it there for the time being, mostly because it looks nicer. I'm sure someone will disagree with me about that.

```bash
# Example showing without watching (waiting) and showing output
jobspec run ./examples/hello-world-jobspec.yaml

# Example that shows waiting for output
jobspec run ./examples/hello-world-wait-jobspec.yaml

# Example with batch using flux
jobspec run ./examples/hello-world-batch.yaml
```

Note that the default transformer is flux, so the above are equivalent to:

```bash
jobspec run -t flux ./examples/hello-world-wait-jobspec.yaml
jobspec run --transformer flux ./examples/hello-world-wait-jobspec.yaml
```

#### 3. Python Examples

It could also be the case that you want something running inside a lead broker instance to receive Jobspecs incrementally and then
run them. This Python example can help with that by showing how to accomplish the same, but from within Python.

```bash
python3 ./examples/flux/receive-job.py
```
```console
$ python3 examples/flux/receive-job.py
=> step write                               OK
=> step submit    f7aChzM3u                 OK
=> step write                               OK
=> step submit    f7aDYuwMH                 OK
```

Just for fun (posterity) I briefly tried having emoji here:

![img/emoji.png](img/emoji.png)

### Details

As an example, although you *could* submit a job with a command ready to go - assuming your cluster has the
software needed and files, and you just want to run it, assuming submission to a cluster you haven't
setup on, you might need the following logic:

1. Write a script to file that is intended to install something.
2. Stage this file across nodes.
3. Submit the script to all nodes to do the install.
4. Write a script to file for your actual job.
5. Again, stage this file across nodes (assuming no share filesystem)
6. Submit the job, either as a submit or batch directive to a workload manager.

The way that you do this with every workload manager (or cluster, more generally) is going to vary
quite a bit. However, with a transformation - a mapping of abstract steps to a specific cluster
workload manager, you can write those steps out very simply:

```yaml
transform:

  - step: write
    filename: install.sh
    executable: true

  - step: submit
    filename: install.sh
    wait: true

  - step: write
    filename: job.sh
    executable: true

  - step: submit
    filename: job.sh
    wait: true
```

The above assumes we have a shared filesystem, and by not setting the stage manually:

```yaml
- step: set
  key: stage
  value: /tmp/path-for-workflow
```

We will use a custom one. If we didn't have a shared filesystem we would need to provide that detail. It's really akin
to a subsystem detail, because a job that assumes a shared fs won't be compatible.

```yaml
- step: set
  key: sharedfs
  value: false
```

Whenever there is a copy (not shown) this assumes the receiving cluster has some cluster-specific method for copy or
file mapping, even in the case without a shared filesystem. It could be ssh, or a filemap, or something else.
For an ephemeral cluster API, it might be an interaction with a storage provider, or just adding the file to an API call that
will (in and of itself) do that creation, akin to a startup script for an instance in Terraform. It really doesn't matter -
the user can expect the file to be written and shared across nodes. This is not intended to be a workflow or build tool -
it simply is a transformational layer that a jobspec can provide to setup a specific cluster environment. It works with a
jobspec in that you define your filenames (scripts) in the tasks->scripts directive. It also uses a plugin design, so a
cluster or institution can write a custom transformer to install, and it will be discovered
by name. This is intended to work with the prototype [rainbow](https://github.com/converged-computing/rainbow) scheduler.
Jobspec is an entity of [flux-framework](https://flux-framework.org).

### Frequently Asked Questions

#### Why not rely on Flux internals?

If we lived in a universe of just flux, sure we wouldn't need this. But the world is more than Flux, and we want to extend our Jobspec to that world.
So we want a Jobspec to be able to handle a transformation of some logic (the above) into an execution that might not involve flux at all. It could be another workload manager (e.g., Slurm),
Kubernetes, or it could be a service that submits to some cloud batch API.

#### What are all the steps allowed?

They are currently shown in the example above, and better documentation will be written. Arguably, any transformation backend does not
need to support every kind of step, however if you provide a Jobspec to a transformer with a step not supported, you'll get an error.

#### Where are the different transformers defined?

We currently have our primary (core) transformers here in [jobspec/transformer](jobspec/transformer), however a registry that discovers jobspec-* named Python modules can allow an out of tree install and use of a transfomrmer. This use case is anticipating clusters with some custom or private logic that cannot be shared in a public GitHub repository.

#### How do you know this is a good idea?

I don't, or won't until I try it for experiments. I decided to try something like it after several days of preparing for experiments,and realizing that this transformation layer was entirely missing.

### Means of Interaction

There are several likely means of interacting with this library:

- As a service that runs at some frequency to receive jobs (written as a loop in Python in some context)
- As a cron job that does the same (an entry to crontab to run "jobspec" at some frequency)
- As a one off run (a single run of the above)

For the example usage here, and since the project I am working on is concerned with Flux, we will start with the simplest case - a client that is running inside a flux instance (meaning it can import flux) that reads in a jobspec with a section that defines a set of transforms, and then issues the commands to stage the setup and use flux to run the work defined by the jobspec.

## Developer

### Organization

While you can write an external transformer (as a plugin) a set of core transformers are provided here:

 - [jobspec/transformer](jobspec/transformer): core transformer classes that ship internally here.

### Writing a Transformer

For now, the easiest thing to do is add a single file (named by your transformer) to [jobspec/transformer](jobspec/transformer)
and copy the precedence in the file. A transformer minimally is a class with a name, description, and some number of steps.
You can then use provided steps in [jobspec/steps](jobstep/steps) or use the `StepBase` to write your own. At the end of
your transformer file you simply need to register the steps you want to use:

```python
# A transformer can register shared steps, or custom steps
Transformer.register_step(steps.WriterStep)
Transformer.register_step(batch)
Transformer.register_step(submit)
Transformer.register_step(stage)
```

If there is a skip you want the user to be able to define (but skip it for your transformer, for whatever reason you might have)
just register the empty step with the name you want to skip. As an example, let's say my transforer has no concept of a stage
(sharing a file across separate nodes) given that it has a shared filesystem. I might want to do:

```python
import jobspec.steps as steps

# This will not fail validation that the step is unknowb, but skip it
Transformer.register_step(steps.EmptyStep, name="stage")
```


## License

HPCIC DevTools is distributed under the terms of the MIT license.
All new contributions must be made under this license.

See [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE),
[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and
[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details.

SPDX-License-Identifier: (MIT)

LLNL-CODE- 842614

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/compspec/jobspec",
    "name": "jobspec",
    "maintainer": "Vanessa Sochat",
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "cluster, orchestration, transformer, jobspec, flux",
    "author": "Vanessa Sochat",
    "author_email": "vsoch@users.noreply.github.com",
    "download_url": "https://files.pythonhosted.org/packages/22/57/b1f021318ba0692ac99d59114bf30d5732046a9a970823c3b52f29c497c9/jobspec-0.0.12.tar.gz",
    "platform": null,
    "description": "# jobspec (python)\n\n> Translation layer for a jobspec specification to cluster execution\n\n[![PyPI version](https://badge.fury.io/py/jobspec.svg)](https://badge.fury.io/py/jobspec)\n\n![https://github.com/compspec/jobspec/blob/main/img/jobspec-bot.png?raw=true](https://github.com/compspec/jobspec/blob/main/img/jobspec-bot.png?raw=true)\n\nThis library includes a cluster agnostic language to setup a job (one unit of work in a jobspec).\nIt is a transformational layer, or a simple language that converts steps needed to prepare a job\nfor a specific clusters scheduler. If you think it looks too simple then I'd say it's a success,\n\n## Usage\n\nA transformer provides one or more steps for the jobpsec to be transformed and understood for a particular\nexecution environment.\n\n### Steps\n\nSteps include:\n\n| Name   | Description |\n|--------|-------------|\n| write  | write a file in the staging directory |\n| set    | a step to define a global setting  |\n| copy   | copy a file into staging (currently just local) |\n| submit | submit the job |\n| batch  | submit the job with a batch command (more common in HPC) |\n| auth   | authenticate with some service |\n\nNote that for the above, we assume a shared filesystem unless stage directs that this isn't the case.\nThese are the basic steps that @vsoch needs now for scheduling experiments, and more can be added (or tweaked) if needed.\n\n### Settings\n\nAny \"set\" directive can be used to set a more global setting on the transform. For example:\n\n - stage: defines the staging directory. If not set, will be a temporary directory that is created\n - sharedfs: true or false to say that the filesystem is shared or not (defaults to false)\n\nFor `sharedfs` it would be ideal to have a setting that is specific to the transformer, but unfortunately this could be true or false\nfor flux, so it has to be set. But this might be an interesting compatibility thing to test.\n\n### Example\n\nThis example will assume receiving a Jobspec on a flux cluster.\n\n#### 1. Start Flux\n\nStart up the development environment to find yourself in a container with flux. Start a test instance:\n\n```bash\nflux start --test-size=4\n```\n\nNote that we have 4 faux nodes and 40 faux cores.\n\n```bash\nflux resource list\n```\n```console\n     STATE NNODES   NCORES    NGPUS NODELIST\n      free      4       40        0 194c2b9f4f3c,194c2b9f4f3c,194c2b9f4f3c,194c2b9f4f3c\n allocated      0        0        0\n      down      0        0        0\n```\n\nEnsure you have jobspec installed! Yes, we are vscode, installing to the container, so we use sudo. YOLO.\n\n```bash\nsudo pip install -e .\n```\n\n#### 2. Command Line Examples\n\nWe are going to run the [examples/hello-world-jobspec.yaml](examples/hello-world-jobspec.yaml). This setup is way overly\ncomplex for this because we don't actually need to do any staging or special work, but it's an example, so intended to be so.\nAlso note that the design of this file is subject to change. For example, we don't have to include the transform directly in the\njobspec - it can be a file that the jobspec writes, and then the command is issued. I like it better as a piece of it, so am putting\nit there for the time being, mostly because it looks nicer. I'm sure someone will disagree with me about that.\n\n```bash\n# Example showing without watching (waiting) and showing output\njobspec run ./examples/hello-world-jobspec.yaml\n\n# Example that shows waiting for output\njobspec run ./examples/hello-world-wait-jobspec.yaml\n\n# Example with batch using flux\njobspec run ./examples/hello-world-batch.yaml\n```\n\nNote that the default transformer is flux, so the above are equivalent to:\n\n```bash\njobspec run -t flux ./examples/hello-world-wait-jobspec.yaml\njobspec run --transformer flux ./examples/hello-world-wait-jobspec.yaml\n```\n\n#### 3. Python Examples\n\nIt could also be the case that you want something running inside a lead broker instance to receive Jobspecs incrementally and then\nrun them. This Python example can help with that by showing how to accomplish the same, but from within Python.\n\n```bash\npython3 ./examples/flux/receive-job.py\n```\n```console\n$ python3 examples/flux/receive-job.py\n=> step write                               OK\n=> step submit    f7aChzM3u                 OK\n=> step write                               OK\n=> step submit    f7aDYuwMH                 OK\n```\n\nJust for fun (posterity) I briefly tried having emoji here:\n\n![img/emoji.png](img/emoji.png)\n\n### Details\n\nAs an example, although you *could* submit a job with a command ready to go - assuming your cluster has the\nsoftware needed and files, and you just want to run it, assuming submission to a cluster you haven't\nsetup on, you might need the following logic:\n\n1. Write a script to file that is intended to install something.\n2. Stage this file across nodes.\n3. Submit the script to all nodes to do the install.\n4. Write a script to file for your actual job.\n5. Again, stage this file across nodes (assuming no share filesystem)\n6. Submit the job, either as a submit or batch directive to a workload manager.\n\nThe way that you do this with every workload manager (or cluster, more generally) is going to vary\nquite a bit. However, with a transformation - a mapping of abstract steps to a specific cluster\nworkload manager, you can write those steps out very simply:\n\n```yaml\ntransform:\n\n  - step: write\n    filename: install.sh\n    executable: true\n\n  - step: submit\n    filename: install.sh\n    wait: true\n\n  - step: write\n    filename: job.sh\n    executable: true\n\n  - step: submit\n    filename: job.sh\n    wait: true\n```\n\nThe above assumes we have a shared filesystem, and by not setting the stage manually:\n\n```yaml\n- step: set\n  key: stage\n  value: /tmp/path-for-workflow\n```\n\nWe will use a custom one. If we didn't have a shared filesystem we would need to provide that detail. It's really akin\nto a subsystem detail, because a job that assumes a shared fs won't be compatible.\n\n```yaml\n- step: set\n  key: sharedfs\n  value: false\n```\n\nWhenever there is a copy (not shown) this assumes the receiving cluster has some cluster-specific method for copy or\nfile mapping, even in the case without a shared filesystem. It could be ssh, or a filemap, or something else.\nFor an ephemeral cluster API, it might be an interaction with a storage provider, or just adding the file to an API call that\nwill (in and of itself) do that creation, akin to a startup script for an instance in Terraform. It really doesn't matter -\nthe user can expect the file to be written and shared across nodes. This is not intended to be a workflow or build tool -\nit simply is a transformational layer that a jobspec can provide to setup a specific cluster environment. It works with a\njobspec in that you define your filenames (scripts) in the tasks->scripts directive. It also uses a plugin design, so a\ncluster or institution can write a custom transformer to install, and it will be discovered\nby name. This is intended to work with the prototype [rainbow](https://github.com/converged-computing/rainbow) scheduler.\nJobspec is an entity of [flux-framework](https://flux-framework.org).\n\n### Frequently Asked Questions\n\n#### Why not rely on Flux internals?\n\nIf we lived in a universe of just flux, sure we wouldn't need this. But the world is more than Flux, and we want to extend our Jobspec to that world.\nSo we want a Jobspec to be able to handle a transformation of some logic (the above) into an execution that might not involve flux at all. It could be another workload manager (e.g., Slurm),\nKubernetes, or it could be a service that submits to some cloud batch API.\n\n#### What are all the steps allowed?\n\nThey are currently shown in the example above, and better documentation will be written. Arguably, any transformation backend does not\nneed to support every kind of step, however if you provide a Jobspec to a transformer with a step not supported, you'll get an error.\n\n#### Where are the different transformers defined?\n\nWe currently have our primary (core) transformers here in [jobspec/transformer](jobspec/transformer), however a registry that discovers jobspec-* named Python modules can allow an out of tree install and use of a transfomrmer. This use case is anticipating clusters with some custom or private logic that cannot be shared in a public GitHub repository.\n\n#### How do you know this is a good idea?\n\nI don't, or won't until I try it for experiments. I decided to try something like it after several days of preparing for experiments,and realizing that this transformation layer was entirely missing.\n\n### Means of Interaction\n\nThere are several likely means of interacting with this library:\n\n- As a service that runs at some frequency to receive jobs (written as a loop in Python in some context)\n- As a cron job that does the same (an entry to crontab to run \"jobspec\" at some frequency)\n- As a one off run (a single run of the above)\n\nFor the example usage here, and since the project I am working on is concerned with Flux, we will start with the simplest case - a client that is running inside a flux instance (meaning it can import flux) that reads in a jobspec with a section that defines a set of transforms, and then issues the commands to stage the setup and use flux to run the work defined by the jobspec.\n\n## Developer\n\n### Organization\n\nWhile you can write an external transformer (as a plugin) a set of core transformers are provided here:\n\n - [jobspec/transformer](jobspec/transformer): core transformer classes that ship internally here.\n\n### Writing a Transformer\n\nFor now, the easiest thing to do is add a single file (named by your transformer) to [jobspec/transformer](jobspec/transformer)\nand copy the precedence in the file. A transformer minimally is a class with a name, description, and some number of steps.\nYou can then use provided steps in [jobspec/steps](jobstep/steps) or use the `StepBase` to write your own. At the end of\nyour transformer file you simply need to register the steps you want to use:\n\n```python\n# A transformer can register shared steps, or custom steps\nTransformer.register_step(steps.WriterStep)\nTransformer.register_step(batch)\nTransformer.register_step(submit)\nTransformer.register_step(stage)\n```\n\nIf there is a skip you want the user to be able to define (but skip it for your transformer, for whatever reason you might have)\njust register the empty step with the name you want to skip. As an example, let's say my transforer has no concept of a stage\n(sharing a file across separate nodes) given that it has a shared filesystem. I might want to do:\n\n```python\nimport jobspec.steps as steps\n\n# This will not fail validation that the step is unknowb, but skip it\nTransformer.register_step(steps.EmptyStep, name=\"stage\")\n```\n\n\n## License\n\nHPCIC DevTools is distributed under the terms of the MIT license.\nAll new contributions must be made under this license.\n\nSee [LICENSE](https://github.com/converged-computing/cloud-select/blob/main/LICENSE),\n[COPYRIGHT](https://github.com/converged-computing/cloud-select/blob/main/COPYRIGHT), and\n[NOTICE](https://github.com/converged-computing/cloud-select/blob/main/NOTICE) for details.\n\nSPDX-License-Identifier: (MIT)\n\nLLNL-CODE- 842614\n",
    "bugtrack_url": null,
    "license": "LICENSE",
    "summary": "Jobspec specification and translation layer for cluster work",
    "version": "0.0.12",
    "project_urls": {
        "Homepage": "https://github.com/compspec/jobspec"
    },
    "split_keywords": [
        "cluster",
        " orchestration",
        " transformer",
        " jobspec",
        " flux"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f1e26c3f8e62545b9ba975d02ddc1b1cd12052ba5fb91e652c6889dbd8d25d9a",
                "md5": "42205bf815ce8cb16e600bd2f0bac237",
                "sha256": "aa9efc9b6ffb804c8dd8b2b9f6bc391adf942e5c827e717de14b790a19b3e70a"
            },
            "downloads": -1,
            "filename": "jobspec-0.0.12-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "42205bf815ce8cb16e600bd2f0bac237",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 27890,
            "upload_time": "2024-03-19T22:35:17",
            "upload_time_iso_8601": "2024-03-19T22:35:17.492405Z",
            "url": "https://files.pythonhosted.org/packages/f1/e2/6c3f8e62545b9ba975d02ddc1b1cd12052ba5fb91e652c6889dbd8d25d9a/jobspec-0.0.12-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2257b1f021318ba0692ac99d59114bf30d5732046a9a970823c3b52f29c497c9",
                "md5": "94e39412ea4b07a54685c4ba4c8f1b23",
                "sha256": "44ec7090ee4cbef3973e3e49ef11077b11c49787634be8f0b2feb72190a9a497"
            },
            "downloads": -1,
            "filename": "jobspec-0.0.12.tar.gz",
            "has_sig": false,
            "md5_digest": "94e39412ea4b07a54685c4ba4c8f1b23",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 25433,
            "upload_time": "2024-03-19T22:35:18",
            "upload_time_iso_8601": "2024-03-19T22:35:18.675867Z",
            "url": "https://files.pythonhosted.org/packages/22/57/b1f021318ba0692ac99d59114bf30d5732046a9a970823c3b52f29c497c9/jobspec-0.0.12.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-19 22:35:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "compspec",
    "github_project": "jobspec",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "jobspec"
}
        
Elapsed time: 0.20190s