dsub


Namedsub JSON
Version 0.4.11 PyPI version JSON
download
home_pagehttps://github.com/DataBiosphere/dsub
SummaryA command-line tool that makes it easy to submit and run batch scripts in the cloud
upload_time2024-05-06 21:55:26
maintainerNone
docs_urlNone
authorVerily
requires_python>=3.7
licenseApache
keywords cloud bioinformatics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # dsub: simple batch jobs with Docker
[![License](https://img.shields.io/badge/license-Apache%202.0-brightgreen.svg)](https://github.com/DataBiosphere/dsub/blob/main/LICENSE)

## Overview

`dsub` is a command-line tool that makes it easy to submit and run batch scripts
in the cloud.

The `dsub` user experience is modeled after traditional high-performance
computing job schedulers like Grid Engine and Slurm. You write a script and
then submit it to a job scheduler from a shell prompt on your local machine.

Today `dsub` supports Google Cloud as the backend batch job runner, along with a
local provider for development and testing. With help from the community, we'd
like to add other backends, such as a Grid Engine, Slurm, Amazon Batch,
and Azure Batch.

## Getting started

`dsub` is written in Python and requires Python 3.7 or higher.

* The last version to support Python 3.6 was `dsub` [0.4.7](https://github.com/DataBiosphere/dsub/releases/tag/v0.4.7).
* For earlier versions of Python 3, use `dsub` [0.4.1](https://github.com/DataBiosphere/dsub/releases/tag/v0.4.1).
* For Python 2, use `dsub` [0.3.10](https://github.com/DataBiosphere/dsub/releases/tag/v0.3.10).

### Pre-installation steps

#### Create a Python virtual environment

This is optional, but whether installing from PyPI or from github,
you are strongly encouraged to use a
[Python virtual environment](https://docs.python.org/3/library/venv.html).

You can do this in a directory of your choosing.

        python3 -m venv dsub_libs
        source dsub_libs/bin/activate

Using a Python virtual environment isolates `dsub` library dependencies from
other Python applications on your system.

Activate this virtual environment in any shell session before running `dsub`.
To deactivate the virtual environment in your shell, run the command:

        deactivate

Alternatively, a set of convenience scripts are provided that activate the
virutalenv before calling `dsub`, `dstat`, and `ddel`. They are in the
[bin](https://github.com/DataBiosphere/dsub/tree/main/bin) directory. You can
use these scripts if you don't want to activate the virtualenv explicitly in
your shell.

#### Install the Google Cloud SDK

While not used directly by `dsub` for the `google-v2` or `google-cls-v2` providers, you are likely to want to install the command line tools found in the [Google
Cloud SDK](https://cloud.google.com/sdk/).

If you will be using the `local` provider for faster job development,
you *will* need to install the Google Cloud SDK, which uses `gsutil` to ensure
file operation semantics consistent with the Google `dsub` providers.

1. [Install the Google Cloud SDK](https://cloud.google.com/sdk/)
2. Run

        gcloud init


    `gcloud` will prompt you to set your default project and to grant
    credentials to the Google Cloud SDK.

### Install `dsub`

Choose **one** of the following:

#### Install from PyPI

1.  If necessary, [install pip](https://pip.pypa.io/en/stable/installing/).

1.  Install `dsub`

         pip install dsub

#### Install from github

1.  Be sure you have git installed

    Instructions for your environment can be found on the
    [git website](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).

1.  Clone this repository.

        git clone https://github.com/DataBiosphere/dsub
        cd dsub

1.  Install dsub (this will also install the dependencies)

        python -m pip install .

1.  Set up Bash tab completion (optional).

        source bash_tab_complete

### Post-installation steps

1.  Minimally verify the installation by running:

        dsub --help

1.  (Optional) [Install Docker](https://docs.docker.com/engine/installation/).

    This is necessary only if you're going to create your own Docker images or
    use the `local` provider.

### Makefile

After cloning the dsub repo, you can also use the
[Makefile](https://github.com/DataBiosphere/dsub/blob/main/Makefile)
by running:

        make

This will create a Python virtual environment and install `dsub` into a
directory named `dsub_libs`.

### Getting started with the local provider

We think you'll find the `local` provider to be very helpful when building
your `dsub` tasks. Instead of submitting a request to run your command on a
cloud VM, the `local` provider runs your `dsub` tasks on your local machine.

The `local` provider is not designed for running at scale. It is designed
to emulate running on a cloud VM such that you can rapidly iterate.
You'll get quicker turnaround times and won't incur cloud charges using it.

1. Run a `dsub` job and wait for completion.

    Here is a very simple "Hello World" test:

        dsub \
          --provider local \
          --logging "${TMPDIR:-/tmp}/dsub-test/logging/" \
          --output OUT="${TMPDIR:-/tmp}/dsub-test/output/out.txt" \
          --command 'echo "Hello World" > "${OUT}"' \
          --wait

    Note: `TMPDIR` is commonly set to `/tmp` by default on most Unix systems,
    although it is also often left unset.
    On some versions of MacOS TMPDIR is set to a location under `/var/folders`.

    Note: The above syntax `${TMPDIR:-/tmp}` is known to be supported by Bash, zsh, ksh.
    The shell will expand `TMPDIR`, but if it is unset, `/tmp` will be used.

1. View the output file.

        cat "${TMPDIR:-/tmp}/dsub-test/output/out.txt"

### Getting started on Google Cloud

`dsub` supports the use of two different APIs from Google Cloud for running
tasks. Google Cloud is transitioning from `Genomics v2alpha1`
to [Cloud Life Sciences v2beta](https://cloud.google.com/life-sciences/docs/reference/rest).

`dsub` supports both APIs with the (old) `google-v2` and (new) `google-cls-v2`
providers respectively. `google-v2` is the current default provider. `dsub`
will be transitioning to make `google-cls-v2` the default in coming releases.

The steps for getting started differ slightly as indicated in the steps below:

1.  Sign up for a Google account and
    [create a project](https://console.cloud.google.com/project?).

1.  Enable the APIs:

    - For the `v2alpha1` API (provider: `google-v2`):

     [Enable the Genomics, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=genomics,storage_component,compute_component&redirect=https://console.cloud.google.com).

    - For the `v2beta` API (provider: `google-cls-v2`):

     [Enable the Cloud Life Sciences, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=lifesciences.googleapis.com,storage_component,compute_component&redirect=https://console.cloud.google.com)

1. Provide [credentials](https://developers.google.com/identity/protocols/application-default-credentials)
    so `dsub` can call Google APIs:

        gcloud auth application-default login

1.  Create a [Google Cloud Storage](https://cloud.google.com/storage) bucket.

    The dsub logs and output files will be written to a bucket. Create a
    bucket using the [storage browser](https://console.cloud.google.com/storage/browser?project=)
    or run the command-line utility [gsutil](https://cloud.google.com/storage/docs/gsutil),
    included in the Cloud SDK.

        gsutil mb gs://my-bucket

    Change `my-bucket` to a unique name that follows the
    [bucket-naming conventions](https://cloud.google.com/storage/docs/bucket-naming).

    (By default, the bucket will be in the US, but you can change or
    refine the [location](https://cloud.google.com/storage/docs/bucket-locations)
    setting with the `-l` option.)

1.  Run a very simple "Hello World" `dsub` job and wait for completion.

    - For the `v2alpha1` API (provider: `google-v2`):

            dsub \
              --provider google-v2 \
              --project my-cloud-project \
              --regions us-central1 \
              --logging gs://my-bucket/logging/ \
              --output OUT=gs://my-bucket/output/out.txt \
              --command 'echo "Hello World" > "${OUT}"' \
              --wait

    Change `my-cloud-project` to your Google Cloud project, and `my-bucket` to
    the bucket you created above.

    - For the `v2beta` API (provider: `google-cls-v2`):

            dsub \
              --provider google-cls-v2 \
              --project my-cloud-project \
              --regions us-central1 \
              --logging gs://my-bucket/logging/ \
              --output OUT=gs://my-bucket/output/out.txt \
              --command 'echo "Hello World" > "${OUT}"' \
              --wait

    Change `my-cloud-project` to your Google Cloud project, and `my-bucket` to
    the bucket you created above.

    The output of the script command will be written to the `OUT` file in Cloud
    Storage that you specify.

1. View the output file.

        gsutil cat gs://my-bucket/output/out.txt

## Backend providers

Where possible, `dsub` tries to support users being able to develop and test
locally (for faster iteration) and then progressing to running at scale.

To this end, `dsub` provides multiple "backend providers", each of which
implements a consistent runtime environment. The current providers are:

- local
- google-v2 (the default)
- google-cls-v2 (*new*)

More details on the runtime environment implemented by the backend providers
can be found in [dsub backend providers](https://github.com/DataBiosphere/dsub/blob/main/docs/providers/README.md).

### Differences between `google-v2` and `google-cls-v2`

The `google-cls-v2` provider is built on the Cloud Life Sciences `v2beta` API.
This API is very similar to its predecessor, the Genomics `v2alpha1` API.
Details of the differences can be found in the
[Migration Guide](https://cloud.google.com/life-sciences/docs/how-tos/migration).

`dsub` largely hides the differences between the two APIs, but there are a
few difference to note:

- `v2beta` is a regional service, `v2alpha1` is a global service

What this means is that with `v2alpha1`, the metadata about your tasks
(called "operations"), is stored in a global database, while with `v2beta`, the
metadata about your tasks are stored in a regional database. If your operation
information needs to stay in a particular region, use the `v2beta` API
(the `google-cls-v2` provider), and specify the `--location` where your
operation information should be stored.

- The `--regions` and `--zones` flags can be omitted when using `google-cls-v2`

The `--regions` and `--zones` flags for `dsub` specify where the tasks should
run. More specifically, this specifies what Compute Engine Zones to use for
the VMs that run your tasks.

With the `google-v2` provider, there is no default region or zone, and thus
one of the `--regions` or `--zones` flags is required.

With `google-cls-v2`, the `--location` flag defaults to `us-central1`, and
if the `--regions` and `--zones` flags are omitted, the `location` will be
used as the default `regions` list.

## `dsub` features

The following sections show how to run more complex jobs.

### Defining what code to run

You can provide a shell command directly in the dsub command-line, as in the
hello example above.

You can also save your script to a file, like `hello.sh`. Then you can run:

    dsub \
        ... \
        --script hello.sh

If your script has dependencies that are not stored in your Docker image,
you can transfer them to the local disk. See the instructions below for
working with input and output files and folders.

### Selecting a Docker image

To get started more easily, `dsub` uses a stock Ubuntu Docker image.
This default image may change at any time in future releases, so for
reproducible production workflows, you should always specify the image
explicitly.

You can change the image by passing the `--image` flag.

    dsub \
        ... \
        --image ubuntu:16.04 \
        --script hello.sh

Note: your `--image` must include the
[Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) shell interpreter.

For more information on using the
`--image` flag, see the
[image section in Scripts, Commands, and Docker](https://github.com/DataBiosphere/dsub/blob/main/docs/code.md#--image-docker-image)

### Passing parameters to your script

You can pass environment variables to your script using the `--env` flag.

    dsub \
        ... \
        --env MESSAGE=hello \
        --command 'echo ${MESSAGE}'

The environment variable `MESSAGE` will be assigned the value `hello` when
your Docker container runs.

Your script or command can reference the variable like any other Linux
environment variable, as `${MESSAGE}`.

**Be sure to enclose your command string in single quotes and not double
quotes. If you use double quotes, the command will be expanded in your local
shell before being passed to dsub. For more information on using the
`--command` flag, see [Scripts, Commands, and Docker](https://github.com/DataBiosphere/dsub/blob/main/docs/code.md)**

To set multiple environment variables, you can repeat the flag:

    --env VAR1=value1 \
    --env VAR2=value2

You can also set multiple variables, space-delimited, with a single flag:

    --env VAR1=value1 VAR2=value2

### Working with input and output files and folders

dsub mimics the behavior of a shared file system using cloud storage
bucket paths for input and output files and folders. You specify
the cloud storage bucket path. Paths can be:

* file paths like `gs://my-bucket/my-file`
* folder paths like `gs://my-bucket/my-folder`
* wildcard paths like `gs://my-bucket/my-folder/*`

See the [inputs and outputs](https://github.com/DataBiosphere/dsub/blob/main/docs/input_output.md)
documentation for more details.

### Transferring input files to a Google Cloud Storage bucket.

If your script expects to read local input files that are not already
contained within your Docker image, the files must be available in Google
Cloud Storage.

If your script has dependent files, you can make them available to your script
by:

 * Building a private Docker image with the dependent files and publishing the
   image to a public site, or privately to Google Container Registry or
   Artifact Registry
 * Uploading the files to Google Cloud Storage

To upload the files to Google Cloud Storage, you can use the
[storage browser](https://console.cloud.google.com/storage/browser?project=) or
[gsutil](https://cloud.google.com/storage/docs/gsutil). You can also run on data
that’s public or shared with your service account, an email address that you
can find in the [Google Cloud Console](https://console.cloud.google.com).

#### Files

To specify input and output files, use the `--input` and `--output` flags:

    dsub \
        ... \
        --input INPUT_FILE_1=gs://my-bucket/my-input-file-1 \
        --input INPUT_FILE_2=gs://my-bucket/my-input-file-2 \
        --output OUTPUT_FILE=gs://my-bucket/my-output-file \
        --command 'cat "${INPUT_FILE_1}" "${INPUT_FILE_2}" > "${OUTPUT_FILE}"'

In this example:

- a file will be copied from `gs://my-bucket/my-input-file-1` to a path on the data disk
- the path to the file on the data disk will be set in the environment variable `${INPUT_FILE_1}`
- a file will be copied from `gs://my-bucket/my-input-file-2` to a path on the data disk
- the path to the file on the data disk will be set in the environment variable `${INPUT_FILE_2}`

The `--command` can reference the file paths using the environment variables.

Also in this example:

- a path on the data disk will be set in the environment variable `${OUTPUT_FILE}`
- the output file will written to the data disk at the location given by `${OUTPUT_FILE}`

After the `--command` completes, the output file will be copied to the bucket path `gs://my-bucket/my-output-file`

Multiple `--input`, and `--output` parameters can be specified and
they can be specified in any order.

#### Folders

To copy folders rather than files, use the `--input-recursive` and
`output-recursive` flags:

    dsub \
        ... \
        --input-recursive FOLDER=gs://my-bucket/my-folder \
        --command 'find ${FOLDER} -name "foo*"'

Multiple `--input-recursive`, and `--output-recursive` parameters can be
specified and they can be specified in any order.

#### Mounting "resource data"

While explicitly specifying inputs improves tracking provenance of your data,
there are cases where you might not want to expliclty localize all inputs
from Cloud Storage to your job VM.

For example, if you have:

- a large set of resource files
- your code only reads a subset of those files
- runtime decisions of which files to read

OR

- a large input file over which your code makes a single read pass

OR

- a large input file that your code does not read in its entirety

then you may find it more efficient or convenient to access this data by
mounting read-only:

- a Google Cloud Storage bucket
- a persistent disk that you pre-create and populate
- a persistent disk that gets created from a
[Compute Engine Image](https://cloud.google.com/compute/docs/images) that you
pre-create.

The `google-v2` and `google-cls-v2` providers support these methods of
providing access to resource data.

The `local` provider supports mounting a
local directory in a similar fashion to support your local development.

##### Mounting a Google Cloud Storage bucket

To have the `google-v2` or `google-cls-v2` provider mount a Cloud Storage bucket
using [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse),
use the `--mount` command line flag:

    --mount RESOURCES=gs://mybucket

The bucket will be mounted into the Docker container running your `--script`
or `--command` and the location made available via the environment variable
`${RESOURCES}`. Inside your script, you can reference the mounted path using the
environment variable. Please read
[Key differences from a POSIX file system](https://cloud.google.com/storage/docs/gcs-fuse#notes)
and [Semantics](https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md)
before using Cloud Storage FUSE.

##### Mounting an existing peristent disk

To have the `google-v2` or `google-cls-v2` provider mount a persistent disk that
you have pre-created and populated, use the `--mount` command line flag and the
url of the source disk:

    --mount RESOURCES="https://www.googleapis.com/compute/v1/projects/your-project/zones/your_disk_zone/disks/your-disk"

##### Mounting a persistent disk, created from an image

To have the `google-v2` or `google-cls-v2` provider mount a persistent disk created from an image,
use the `--mount` command line flag and the url of the source image and the size
(in GB) of the disk:

    --mount RESOURCES="https://www.googleapis.com/compute/v1/projects/your-project/global/images/your-image 50"

The image will be used to create a new persistent disk, which will be attached
to a Compute Engine VM. The disk will mounted into the Docker container running
your `--script` or `--command` and the location made available by the
environment variable `${RESOURCES}`. Inside your script, you can reference the
mounted path using the environment variable.

To create an image, see [Creating a custom image](https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images).

##### Mounting a local directory (`local` provider)

To have the `local` provider mount a directory read-only, use the `--mount`
command line flag and a `file://` prefix:

    --mount RESOURCES=file://path/to/my/dir

The local directory will be mounted into the Docker container running your
`--script`or `--command` and the location made available via the environment
variable `${RESOURCES}`. Inside your script, you can reference the mounted
path using the environment variable.

### Setting resource requirements

`dsub` tasks run using the `local` provider will use the resources available on
your local machine.

`dsub` tasks run using the `google`, `google-v2`, or `google-cls-v2` providers can take advantage
of a wide range of CPU, RAM, disk, and hardware accelerator (eg. GPU) options.

See the [Compute Resources](https://github.com/DataBiosphere/dsub/blob/main/docs/compute_resources.md)
documentation for details.

### Job Identifiers

By default, `dsub` generates a `job-id` with the form
`job-name--userid--timestamp` where the `job-name` is truncated at 10 characters
and the `timestamp` is of the form `YYMMDD-HHMMSS-XX`, unique to hundredths of a
second. If you are submitting multiple jobs concurrently, you may still run into
situations where the `job-id` is not unique. If you require a unique `job-id`
for this situation, you may use the `--unique-job-id` parameter.

If the `--unique-job-id` parameter is set, `job-id` will instead be a unique 32
character UUID created by https://docs.python.org/3/library/uuid.html. Because
some providers require that the `job-id` begin with a letter, `dsub` will
replace any starting digit with a letter in a manner that preserves uniqueness.

### Submitting a batch job

Each of the examples above has demonstrated submitting a single task with
a single set of variables, inputs, and outputs. If you have a batch of inputs
and you want to run the same operation over them, `dsub` allows you
to create a batch job.

Instead of calling `dsub` repeatedly, you can create
a tab-separated values (TSV) file containing the variables,
inputs, and outputs for each task, and then call `dsub` once.
The result will be a single `job-id` with multiple tasks. The tasks will
be scheduled and run independently, but can be
[monitored](https://github.com/DataBiosphere/dsub#viewing-job-status) and
[deleted](https://github.com/DataBiosphere/dsub#deleting-a-job) as a group.

#### Tasks file format

The first line of the TSV file specifies the names and types of the
parameters. For example:

    --env SAMPLE_ID<tab>--input VCF_FILE<tab>--output OUTPUT_PATH

Each addition line in the file should provide the variable, input, and output
values for each task. Each line beyond the header represents the values for a
separate task.

Multiple `--env`, `--input`, and `--output` parameters can be specified and
they can be specified in any order. For example:

    --env SAMPLE<tab>--input A<tab>--input B<tab>--env REFNAME<tab>--output O
    S1<tab>gs://path/A1.txt<tab>gs://path/B1.txt<tab>R1<tab>gs://path/O1.txt
    S2<tab>gs://path/A2.txt<tab>gs://path/B2.txt<tab>R2<tab>gs://path/O2.txt


#### Tasks parameter

Pass the TSV file to dsub using the `--tasks` parameter. This parameter
accepts both the file path and optionally a range of tasks to process.
The file may be read from the local filesystem (on the machine you're calling
`dsub` from), or from a bucket in Google Cloud Storage (file name starts with
"gs://").

For example, suppose `my-tasks.tsv` contains 101 lines: a one-line header and
100 lines of parameters for tasks to run. Then:

    dsub ... --tasks ./my-tasks.tsv

will create a job with 100 tasks, while:

    dsub ... --tasks ./my-tasks.tsv 1-10

will create a job with 10 tasks, one for each of lines 2 through 11.

The task range values can take any of the following forms:

*   `m` indicates to submit task `m` (line m+1)
*   `m-` indicates to submit all tasks starting with task `m`
*   `m-n` indicates to submit all tasks from `m` to `n` (inclusive).

### Logging

The `--logging` flag points to a location for `dsub` task log files. For details
on how to specify your logging path, see [Logging](https://github.com/DataBiosphere/dsub/blob/main/docs/logging.md).

### Job control

It's possible to wait for a job to complete before starting another.
For details, see [job control with dsub](https://github.com/DataBiosphere/dsub/blob/main/docs/job_control.md).

### Retries

It is possible for `dsub` to automatically retry failed tasks.
For details, see [retries with dsub](https://github.com/DataBiosphere/dsub/blob/main/docs/retries.md).

### Labeling jobs and tasks

You can add custom labels to jobs and tasks, which allows you to monitor and
cancel tasks using your own identifiers. In addition, with the Google
providers, labeling a task will label associated compute resources such as
virtual machines and disks.

For more details, see [Checking Status and Troubleshooting Jobs](https://github.com/DataBiosphere/dsub/blob/main/docs/troubleshooting.md)

### Viewing job status

The `dstat` command displays the status of jobs:

    dstat --provider google-v2 --project my-cloud-project

With no additional arguments, dstat will display a list of *running* jobs for
the current `USER`.

To display the status of a specific job, use the `--jobs` flag:

    dstat --provider google-v2 --project my-cloud-project --jobs job-id

For a batch job, the output will list all *running* tasks.

Each job submitted by dsub is given a set of metadata values that can be
used for job identification and job control. The metadata associated with
each job includes:

*   `job-name`: defaults to the name of your script file or the first word of
    your script command; it can be explicitly set with the `--name` parameter.
*   `user-id`: the `USER` environment variable value.
*   `job-id`: identifier of the job, which can be used in calls to `dstat` and
    `ddel` for job monitoring and canceling respectively. See
    [Job Identifiers](https://github.com/DataBiosphere/dsub#job-identifiers) for more
    details on the `job-id` format.
*   `task-id`: if the job is submitted with the `--tasks` parameter, each task
    gets a sequential value of the form "task-*n*" where *n* is 1-based.

Note that the job metadata values will be modified to conform with the "Label
Restrictions" listed in the [Checking Status and Troubleshooting Jobs](https://github.com/DataBiosphere/dsub/blob/main/docs/troubleshooting.md)
guide.

Metadata can be used to cancel a job or individual tasks within a batch job.

For more details, see [Checking Status and Troubleshooting Jobs](https://github.com/DataBiosphere/dsub/blob/main/docs/troubleshooting.md)

#### Summarizing job status

By default, dstat outputs one line per task. If you're using a batch job with
many tasks then you may benefit from `--summary`.

```
$ dstat --provider google-v2 --project my-project --status '*' --summary

Job Name        Status         Task Count
-------------   -------------  -------------
my-job-name     RUNNING        2
my-job-name     SUCCESS        1
```

In this mode, dstat prints one line per (job name, task status) pair. You can
see at a glance how many tasks are finished, how many are still running, and
how many are failed/canceled.

### Deleting a job

The `ddel` command will delete running jobs.

By default, only jobs submitted by the current user will be deleted.
Use the `--users` flag to specify other users, or `'*'` for all users.

To delete a running job:

    ddel --provider google-v2 --project my-cloud-project --jobs job-id

If the job is a batch job, all running tasks will be deleted.

To delete specific tasks:

    ddel \
        --provider google-v2 \
        --project my-cloud-project \
        --jobs job-id \
        --tasks task-id1 task-id2

To delete all running jobs for the current user:

    ddel --provider google-v2 --project my-cloud-project --jobs '*'

## Service Accounts and Scope (Google providers only)

When you run the `dsub` command with the `google-v2` or `google-cls-v2`
provider, there are two different sets of credentials to consider:

- Account submitting the `pipelines.run()` request to run your command/script on a VM
- Account accessing Cloud resources (such as files in GCS) when executing your command/script

The account used to submit the `pipelines.run()` request is typically your
end user credentials. You would have set this up by running:

    gcloud auth application-default login

The account used on the VM is a [service account](https://cloud.google.com/iam/docs/service-accounts).
The image below illustrates this:

![Pipelines Runner Architecture](./docs/images/pipelines_runner_architecture.png)

By default, `dsub` will use the [default Compute Engine service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account)
as the authorized service account on the VM instance. You can choose to specify
the email address of another service account using `--service-account`.

By default, `dsub` will grant the following access scopes to the service account:

- https://www.googleapis.com/auth/bigquery
- https://www.googleapis.com/auth/compute
- https://www.googleapis.com/auth/devstorage.full_control
- https://www.googleapis.com/auth/genomics
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring.write

In addition, [the API](https://cloud.google.com/life-sciences/docs/reference/rest/v2beta/projects.locations.pipelines/run#serviceaccount) will always add this scope:

- https://www.googleapis.com/auth/cloud-platform

You can choose to specify scopes using `--scopes`.

### Recommendations for service accounts

While it is straightforward to use the default service account, this account also
has broad privileges granted to it by default. Following the
[Principle of Least Privilege](https://en.wikipedia.org/wiki/Principle_of_least_privilege)
you may want to create and use a service account that has only sufficient privileges
granted in order to run your `dsub` command/script.

To create a new service account, follow the steps below:

1. Execute the `gcloud iam service-accounts create` command. The email address
of the service account will be `sa-name@project-id.iam.gserviceaccount.com`.

        gcloud iam service-accounts create "sa-name"

2. Grant IAM access on buckets, etc. to the service account.

        gsutil iam ch serviceAccount:sa-name@project-id.iam.gserviceaccount.com:roles/storage.objectAdmin gs://bucket-name

3. Update your `dsub` command to include `--service-account`

        dsub \
          --service-account sa-name@project-id.iam.gserviceaccount.com
          ...

## What next?

*   See the examples:

    *   [Custom scripts](https://github.com/DataBiosphere/dsub/tree/main/examples/custom_scripts)
    *   [Decompress files](https://github.com/DataBiosphere/dsub/tree/main/examples/decompress)
    *   [FastQC](https://github.com/DataBiosphere/dsub/tree/main/examples/fastqc)
    *   [Samtools index](https://github.com/DataBiosphere/dsub/tree/main/examples/samtools)

*   See more documentation for:

    *   [Scripts, Commands, and Docker](https://github.com/DataBiosphere/dsub/blob/main/docs/code.md)
    *   [Input and Output File Handling](https://github.com/DataBiosphere/dsub/blob/main/docs/input_output.md)
    *   [Logging](https://github.com/DataBiosphere/dsub/blob/main/docs/logging.md)
    *   [Compute Resources](https://github.com/DataBiosphere/dsub/blob/main/docs/compute_resources.md)
    *   [Compute Quotas](https://github.com/DataBiosphere/dsub/blob/main/docs/compute_quotas.md)
    *   [Job Control](https://github.com/DataBiosphere/dsub/blob/main/docs/job_control.md)
    *   [Retries](https://github.com/DataBiosphere/dsub/blob/main/docs/retries.md)
    *   [Checking Status and Troubleshooting Jobs](https://github.com/DataBiosphere/dsub/blob/main/docs/troubleshooting.md)
    *   [Backend providers](https://github.com/DataBiosphere/dsub/blob/main/docs/providers/README.md)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/DataBiosphere/dsub",
    "name": "dsub",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "cloud bioinformatics",
    "author": "Verily",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/17/aa/fe7e4a707e5ad2279f2c0ad98ab335781c2db224942a842cb6d48184ada8/dsub-0.4.11.tar.gz",
    "platform": null,
    "description": "# dsub: simple batch jobs with Docker\n[![License](https://img.shields.io/badge/license-Apache%202.0-brightgreen.svg)](https://github.com/DataBiosphere/dsub/blob/main/LICENSE)\n\n## Overview\n\n`dsub` is a command-line tool that makes it easy to submit and run batch scripts\nin the cloud.\n\nThe `dsub` user experience is modeled after traditional high-performance\ncomputing job schedulers like Grid Engine and Slurm. You write a script and\nthen submit it to a job scheduler from a shell prompt on your local machine.\n\nToday `dsub` supports Google Cloud as the backend batch job runner, along with a\nlocal provider for development and testing. With help from the community, we'd\nlike to add other backends, such as a Grid Engine, Slurm, Amazon Batch,\nand Azure Batch.\n\n## Getting started\n\n`dsub` is written in Python and requires Python 3.7 or higher.\n\n* The last version to support Python 3.6 was `dsub` [0.4.7](https://github.com/DataBiosphere/dsub/releases/tag/v0.4.7).\n* For earlier versions of Python 3, use `dsub` [0.4.1](https://github.com/DataBiosphere/dsub/releases/tag/v0.4.1).\n* For Python 2, use `dsub` [0.3.10](https://github.com/DataBiosphere/dsub/releases/tag/v0.3.10).\n\n### Pre-installation steps\n\n#### Create a Python virtual environment\n\nThis is optional, but whether installing from PyPI or from github,\nyou are strongly encouraged to use a\n[Python virtual environment](https://docs.python.org/3/library/venv.html).\n\nYou can do this in a directory of your choosing.\n\n        python3 -m venv dsub_libs\n        source dsub_libs/bin/activate\n\nUsing a Python virtual environment isolates `dsub` library dependencies from\nother Python applications on your system.\n\nActivate this virtual environment in any shell session before running `dsub`.\nTo deactivate the virtual environment in your shell, run the command:\n\n        deactivate\n\nAlternatively, a set of convenience scripts are provided that activate the\nvirutalenv before calling `dsub`, `dstat`, and `ddel`. They are in the\n[bin](https://github.com/DataBiosphere/dsub/tree/main/bin) directory. You can\nuse these scripts if you don't want to activate the virtualenv explicitly in\nyour shell.\n\n#### Install the Google Cloud SDK\n\nWhile not used directly by `dsub` for the `google-v2` or `google-cls-v2` providers, you are likely to want to install the command line tools found in the [Google\nCloud SDK](https://cloud.google.com/sdk/).\n\nIf you will be using the `local` provider for faster job development,\nyou *will* need to install the Google Cloud SDK, which uses `gsutil` to ensure\nfile operation semantics consistent with the Google `dsub` providers.\n\n1. [Install the Google Cloud SDK](https://cloud.google.com/sdk/)\n2. Run\n\n        gcloud init\n\n\n    `gcloud` will prompt you to set your default project and to grant\n    credentials to the Google Cloud SDK.\n\n### Install `dsub`\n\nChoose **one** of the following:\n\n#### Install from PyPI\n\n1.  If necessary, [install pip](https://pip.pypa.io/en/stable/installing/).\n\n1.  Install `dsub`\n\n         pip install dsub\n\n#### Install from github\n\n1.  Be sure you have git installed\n\n    Instructions for your environment can be found on the\n    [git website](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).\n\n1.  Clone this repository.\n\n        git clone https://github.com/DataBiosphere/dsub\n        cd dsub\n\n1.  Install dsub (this will also install the dependencies)\n\n        python -m pip install .\n\n1.  Set up Bash tab completion (optional).\n\n        source bash_tab_complete\n\n### Post-installation steps\n\n1.  Minimally verify the installation by running:\n\n        dsub --help\n\n1.  (Optional) [Install Docker](https://docs.docker.com/engine/installation/).\n\n    This is necessary only if you're going to create your own Docker images or\n    use the `local` provider.\n\n### Makefile\n\nAfter cloning the dsub repo, you can also use the\n[Makefile](https://github.com/DataBiosphere/dsub/blob/main/Makefile)\nby running:\n\n        make\n\nThis will create a Python virtual environment and install `dsub` into a\ndirectory named `dsub_libs`.\n\n### Getting started with the local provider\n\nWe think you'll find the `local` provider to be very helpful when building\nyour `dsub` tasks. Instead of submitting a request to run your command on a\ncloud VM, the `local` provider runs your `dsub` tasks on your local machine.\n\nThe `local` provider is not designed for running at scale. It is designed\nto emulate running on a cloud VM such that you can rapidly iterate.\nYou'll get quicker turnaround times and won't incur cloud charges using it.\n\n1. Run a `dsub` job and wait for completion.\n\n    Here is a very simple \"Hello World\" test:\n\n        dsub \\\n          --provider local \\\n          --logging \"${TMPDIR:-/tmp}/dsub-test/logging/\" \\\n          --output OUT=\"${TMPDIR:-/tmp}/dsub-test/output/out.txt\" \\\n          --command 'echo \"Hello World\" > \"${OUT}\"' \\\n          --wait\n\n    Note: `TMPDIR` is commonly set to `/tmp` by default on most Unix systems,\n    although it is also often left unset.\n    On some versions of MacOS TMPDIR is set to a location under `/var/folders`.\n\n    Note: The above syntax `${TMPDIR:-/tmp}` is known to be supported by Bash, zsh, ksh.\n    The shell will expand `TMPDIR`, but if it is unset, `/tmp` will be used.\n\n1. View the output file.\n\n        cat \"${TMPDIR:-/tmp}/dsub-test/output/out.txt\"\n\n### Getting started on Google Cloud\n\n`dsub` supports the use of two different APIs from Google Cloud for running\ntasks. Google Cloud is transitioning from `Genomics v2alpha1`\nto [Cloud Life Sciences v2beta](https://cloud.google.com/life-sciences/docs/reference/rest).\n\n`dsub` supports both APIs with the (old) `google-v2` and (new) `google-cls-v2`\nproviders respectively. `google-v2` is the current default provider. `dsub`\nwill be transitioning to make `google-cls-v2` the default in coming releases.\n\nThe steps for getting started differ slightly as indicated in the steps below:\n\n1.  Sign up for a Google account and\n    [create a project](https://console.cloud.google.com/project?).\n\n1.  Enable the APIs:\n\n    - For the `v2alpha1` API (provider: `google-v2`):\n\n     [Enable the Genomics, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=genomics,storage_component,compute_component&redirect=https://console.cloud.google.com).\n\n    - For the `v2beta` API (provider: `google-cls-v2`):\n\n     [Enable the Cloud Life Sciences, Storage, and Compute APIs](https://console.cloud.google.com/flows/enableapi?apiid=lifesciences.googleapis.com,storage_component,compute_component&redirect=https://console.cloud.google.com)\n\n1. Provide [credentials](https://developers.google.com/identity/protocols/application-default-credentials)\n    so `dsub` can call Google APIs:\n\n        gcloud auth application-default login\n\n1.  Create a [Google Cloud Storage](https://cloud.google.com/storage) bucket.\n\n    The dsub logs and output files will be written to a bucket. Create a\n    bucket using the [storage browser](https://console.cloud.google.com/storage/browser?project=)\n    or run the command-line utility [gsutil](https://cloud.google.com/storage/docs/gsutil),\n    included in the Cloud SDK.\n\n        gsutil mb gs://my-bucket\n\n    Change `my-bucket` to a unique name that follows the\n    [bucket-naming conventions](https://cloud.google.com/storage/docs/bucket-naming).\n\n    (By default, the bucket will be in the US, but you can change or\n    refine the [location](https://cloud.google.com/storage/docs/bucket-locations)\n    setting with the `-l` option.)\n\n1.  Run a very simple \"Hello World\" `dsub` job and wait for completion.\n\n    - For the `v2alpha1` API (provider: `google-v2`):\n\n            dsub \\\n              --provider google-v2 \\\n              --project my-cloud-project \\\n              --regions us-central1 \\\n              --logging gs://my-bucket/logging/ \\\n              --output OUT=gs://my-bucket/output/out.txt \\\n              --command 'echo \"Hello World\" > \"${OUT}\"' \\\n              --wait\n\n    Change `my-cloud-project` to your Google Cloud project, and `my-bucket` to\n    the bucket you created above.\n\n    - For the `v2beta` API (provider: `google-cls-v2`):\n\n            dsub \\\n              --provider google-cls-v2 \\\n              --project my-cloud-project \\\n              --regions us-central1 \\\n              --logging gs://my-bucket/logging/ \\\n              --output OUT=gs://my-bucket/output/out.txt \\\n              --command 'echo \"Hello World\" > \"${OUT}\"' \\\n              --wait\n\n    Change `my-cloud-project` to your Google Cloud project, and `my-bucket` to\n    the bucket you created above.\n\n    The output of the script command will be written to the `OUT` file in Cloud\n    Storage that you specify.\n\n1. View the output file.\n\n        gsutil cat gs://my-bucket/output/out.txt\n\n## Backend providers\n\nWhere possible, `dsub` tries to support users being able to develop and test\nlocally (for faster iteration) and then progressing to running at scale.\n\nTo this end, `dsub` provides multiple \"backend providers\", each of which\nimplements a consistent runtime environment. The current providers are:\n\n- local\n- google-v2 (the default)\n- google-cls-v2 (*new*)\n\nMore details on the runtime environment implemented by the backend providers\ncan be found in [dsub backend providers](https://github.com/DataBiosphere/dsub/blob/main/docs/providers/README.md).\n\n### Differences between `google-v2` and `google-cls-v2`\n\nThe `google-cls-v2` provider is built on the Cloud Life Sciences `v2beta` API.\nThis API is very similar to its predecessor, the Genomics `v2alpha1` API.\nDetails of the differences can be found in the\n[Migration Guide](https://cloud.google.com/life-sciences/docs/how-tos/migration).\n\n`dsub` largely hides the differences between the two APIs, but there are a\nfew difference to note:\n\n- `v2beta` is a regional service, `v2alpha1` is a global service\n\nWhat this means is that with `v2alpha1`, the metadata about your tasks\n(called \"operations\"), is stored in a global database, while with `v2beta`, the\nmetadata about your tasks are stored in a regional database. If your operation\ninformation needs to stay in a particular region, use the `v2beta` API\n(the `google-cls-v2` provider), and specify the `--location` where your\noperation information should be stored.\n\n- The `--regions` and `--zones` flags can be omitted when using `google-cls-v2`\n\nThe `--regions` and `--zones` flags for `dsub` specify where the tasks should\nrun. More specifically, this specifies what Compute Engine Zones to use for\nthe VMs that run your tasks.\n\nWith the `google-v2` provider, there is no default region or zone, and thus\none of the `--regions` or `--zones` flags is required.\n\nWith `google-cls-v2`, the `--location` flag defaults to `us-central1`, and\nif the `--regions` and `--zones` flags are omitted, the `location` will be\nused as the default `regions` list.\n\n## `dsub` features\n\nThe following sections show how to run more complex jobs.\n\n### Defining what code to run\n\nYou can provide a shell command directly in the dsub command-line, as in the\nhello example above.\n\nYou can also save your script to a file, like `hello.sh`. Then you can run:\n\n    dsub \\\n        ... \\\n        --script hello.sh\n\nIf your script has dependencies that are not stored in your Docker image,\nyou can transfer them to the local disk. See the instructions below for\nworking with input and output files and folders.\n\n### Selecting a Docker image\n\nTo get started more easily, `dsub` uses a stock Ubuntu Docker image.\nThis default image may change at any time in future releases, so for\nreproducible production workflows, you should always specify the image\nexplicitly.\n\nYou can change the image by passing the `--image` flag.\n\n    dsub \\\n        ... \\\n        --image ubuntu:16.04 \\\n        --script hello.sh\n\nNote: your `--image` must include the\n[Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) shell interpreter.\n\nFor more information on using the\n`--image` flag, see the\n[image section in Scripts, Commands, and Docker](https://github.com/DataBiosphere/dsub/blob/main/docs/code.md#--image-docker-image)\n\n### Passing parameters to your script\n\nYou can pass environment variables to your script using the `--env` flag.\n\n    dsub \\\n        ... \\\n        --env MESSAGE=hello \\\n        --command 'echo ${MESSAGE}'\n\nThe environment variable `MESSAGE` will be assigned the value `hello` when\nyour Docker container runs.\n\nYour script or command can reference the variable like any other Linux\nenvironment variable, as `${MESSAGE}`.\n\n**Be sure to enclose your command string in single quotes and not double\nquotes. If you use double quotes, the command will be expanded in your local\nshell before being passed to dsub. For more information on using the\n`--command` flag, see [Scripts, Commands, and Docker](https://github.com/DataBiosphere/dsub/blob/main/docs/code.md)**\n\nTo set multiple environment variables, you can repeat the flag:\n\n    --env VAR1=value1 \\\n    --env VAR2=value2\n\nYou can also set multiple variables, space-delimited, with a single flag:\n\n    --env VAR1=value1 VAR2=value2\n\n### Working with input and output files and folders\n\ndsub mimics the behavior of a shared file system using cloud storage\nbucket paths for input and output files and folders. You specify\nthe cloud storage bucket path. Paths can be:\n\n* file paths like `gs://my-bucket/my-file`\n* folder paths like `gs://my-bucket/my-folder`\n* wildcard paths like `gs://my-bucket/my-folder/*`\n\nSee the [inputs and outputs](https://github.com/DataBiosphere/dsub/blob/main/docs/input_output.md)\ndocumentation for more details.\n\n### Transferring input files to a Google Cloud Storage bucket.\n\nIf your script expects to read local input files that are not already\ncontained within your Docker image, the files must be available in Google\nCloud Storage.\n\nIf your script has dependent files, you can make them available to your script\nby:\n\n * Building a private Docker image with the dependent files and publishing the\n   image to a public site, or privately to Google Container Registry or\n   Artifact Registry\n * Uploading the files to Google Cloud Storage\n\nTo upload the files to Google Cloud Storage, you can use the\n[storage browser](https://console.cloud.google.com/storage/browser?project=) or\n[gsutil](https://cloud.google.com/storage/docs/gsutil). You can also run on data\nthat\u2019s public or shared with your service account, an email address that you\ncan find in the [Google Cloud Console](https://console.cloud.google.com).\n\n#### Files\n\nTo specify input and output files, use the `--input` and `--output` flags:\n\n    dsub \\\n        ... \\\n        --input INPUT_FILE_1=gs://my-bucket/my-input-file-1 \\\n        --input INPUT_FILE_2=gs://my-bucket/my-input-file-2 \\\n        --output OUTPUT_FILE=gs://my-bucket/my-output-file \\\n        --command 'cat \"${INPUT_FILE_1}\" \"${INPUT_FILE_2}\" > \"${OUTPUT_FILE}\"'\n\nIn this example:\n\n- a file will be copied from `gs://my-bucket/my-input-file-1` to a path on the data disk\n- the path to the file on the data disk will be set in the environment variable `${INPUT_FILE_1}`\n- a file will be copied from `gs://my-bucket/my-input-file-2` to a path on the data disk\n- the path to the file on the data disk will be set in the environment variable `${INPUT_FILE_2}`\n\nThe `--command` can reference the file paths using the environment variables.\n\nAlso in this example:\n\n- a path on the data disk will be set in the environment variable `${OUTPUT_FILE}`\n- the output file will written to the data disk at the location given by `${OUTPUT_FILE}`\n\nAfter the `--command` completes, the output file will be copied to the bucket path `gs://my-bucket/my-output-file`\n\nMultiple `--input`, and `--output` parameters can be specified and\nthey can be specified in any order.\n\n#### Folders\n\nTo copy folders rather than files, use the `--input-recursive` and\n`output-recursive` flags:\n\n    dsub \\\n        ... \\\n        --input-recursive FOLDER=gs://my-bucket/my-folder \\\n        --command 'find ${FOLDER} -name \"foo*\"'\n\nMultiple `--input-recursive`, and `--output-recursive` parameters can be\nspecified and they can be specified in any order.\n\n#### Mounting \"resource data\"\n\nWhile explicitly specifying inputs improves tracking provenance of your data,\nthere are cases where you might not want to expliclty localize all inputs\nfrom Cloud Storage to your job VM.\n\nFor example, if you have:\n\n- a large set of resource files\n- your code only reads a subset of those files\n- runtime decisions of which files to read\n\nOR\n\n- a large input file over which your code makes a single read pass\n\nOR\n\n- a large input file that your code does not read in its entirety\n\nthen you may find it more efficient or convenient to access this data by\nmounting read-only:\n\n- a Google Cloud Storage bucket\n- a persistent disk that you pre-create and populate\n- a persistent disk that gets created from a\n[Compute Engine Image](https://cloud.google.com/compute/docs/images) that you\npre-create.\n\nThe `google-v2` and `google-cls-v2` providers support these methods of\nproviding access to resource data.\n\nThe `local` provider supports mounting a\nlocal directory in a similar fashion to support your local development.\n\n##### Mounting a Google Cloud Storage bucket\n\nTo have the `google-v2` or `google-cls-v2` provider mount a Cloud Storage bucket\nusing [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse),\nuse the `--mount` command line flag:\n\n    --mount RESOURCES=gs://mybucket\n\nThe bucket will be mounted into the Docker container running your `--script`\nor `--command` and the location made available via the environment variable\n`${RESOURCES}`. Inside your script, you can reference the mounted path using the\nenvironment variable. Please read\n[Key differences from a POSIX file system](https://cloud.google.com/storage/docs/gcs-fuse#notes)\nand [Semantics](https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md)\nbefore using Cloud Storage FUSE.\n\n##### Mounting an existing peristent disk\n\nTo have the `google-v2` or `google-cls-v2` provider mount a persistent disk that\nyou have pre-created and populated, use the `--mount` command line flag and the\nurl of the source disk:\n\n    --mount RESOURCES=\"https://www.googleapis.com/compute/v1/projects/your-project/zones/your_disk_zone/disks/your-disk\"\n\n##### Mounting a persistent disk, created from an image\n\nTo have the `google-v2` or `google-cls-v2` provider mount a persistent disk created from an image,\nuse the `--mount` command line flag and the url of the source image and the size\n(in GB) of the disk:\n\n    --mount RESOURCES=\"https://www.googleapis.com/compute/v1/projects/your-project/global/images/your-image 50\"\n\nThe image will be used to create a new persistent disk, which will be attached\nto a Compute Engine VM. The disk will mounted into the Docker container running\nyour `--script` or `--command` and the location made available by the\nenvironment variable `${RESOURCES}`. Inside your script, you can reference the\nmounted path using the environment variable.\n\nTo create an image, see [Creating a custom image](https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images).\n\n##### Mounting a local directory (`local` provider)\n\nTo have the `local` provider mount a directory read-only, use the `--mount`\ncommand line flag and a `file://` prefix:\n\n    --mount RESOURCES=file://path/to/my/dir\n\nThe local directory will be mounted into the Docker container running your\n`--script`or `--command` and the location made available via the environment\nvariable `${RESOURCES}`. Inside your script, you can reference the mounted\npath using the environment variable.\n\n### Setting resource requirements\n\n`dsub` tasks run using the `local` provider will use the resources available on\nyour local machine.\n\n`dsub` tasks run using the `google`, `google-v2`, or `google-cls-v2` providers can take advantage\nof a wide range of CPU, RAM, disk, and hardware accelerator (eg. GPU) options.\n\nSee the [Compute Resources](https://github.com/DataBiosphere/dsub/blob/main/docs/compute_resources.md)\ndocumentation for details.\n\n### Job Identifiers\n\nBy default, `dsub` generates a `job-id` with the form\n`job-name--userid--timestamp` where the `job-name` is truncated at 10 characters\nand the `timestamp` is of the form `YYMMDD-HHMMSS-XX`, unique to hundredths of a\nsecond. If you are submitting multiple jobs concurrently, you may still run into\nsituations where the `job-id` is not unique. If you require a unique `job-id`\nfor this situation, you may use the `--unique-job-id` parameter.\n\nIf the `--unique-job-id` parameter is set, `job-id` will instead be a unique 32\ncharacter UUID created by https://docs.python.org/3/library/uuid.html. Because\nsome providers require that the `job-id` begin with a letter, `dsub` will\nreplace any starting digit with a letter in a manner that preserves uniqueness.\n\n### Submitting a batch job\n\nEach of the examples above has demonstrated submitting a single task with\na single set of variables, inputs, and outputs. If you have a batch of inputs\nand you want to run the same operation over them, `dsub` allows you\nto create a batch job.\n\nInstead of calling `dsub` repeatedly, you can create\na tab-separated values (TSV) file containing the variables,\ninputs, and outputs for each task, and then call `dsub` once.\nThe result will be a single `job-id` with multiple tasks. The tasks will\nbe scheduled and run independently, but can be\n[monitored](https://github.com/DataBiosphere/dsub#viewing-job-status) and\n[deleted](https://github.com/DataBiosphere/dsub#deleting-a-job) as a group.\n\n#### Tasks file format\n\nThe first line of the TSV file specifies the names and types of the\nparameters. For example:\n\n    --env SAMPLE_ID<tab>--input VCF_FILE<tab>--output OUTPUT_PATH\n\nEach addition line in the file should provide the variable, input, and output\nvalues for each task. Each line beyond the header represents the values for a\nseparate task.\n\nMultiple `--env`, `--input`, and `--output` parameters can be specified and\nthey can be specified in any order. For example:\n\n    --env SAMPLE<tab>--input A<tab>--input B<tab>--env REFNAME<tab>--output O\n    S1<tab>gs://path/A1.txt<tab>gs://path/B1.txt<tab>R1<tab>gs://path/O1.txt\n    S2<tab>gs://path/A2.txt<tab>gs://path/B2.txt<tab>R2<tab>gs://path/O2.txt\n\n\n#### Tasks parameter\n\nPass the TSV file to dsub using the `--tasks` parameter. This parameter\naccepts both the file path and optionally a range of tasks to process.\nThe file may be read from the local filesystem (on the machine you're calling\n`dsub` from), or from a bucket in Google Cloud Storage (file name starts with\n\"gs://\").\n\nFor example, suppose `my-tasks.tsv` contains 101 lines: a one-line header and\n100 lines of parameters for tasks to run. Then:\n\n    dsub ... --tasks ./my-tasks.tsv\n\nwill create a job with 100 tasks, while:\n\n    dsub ... --tasks ./my-tasks.tsv 1-10\n\nwill create a job with 10 tasks, one for each of lines 2 through 11.\n\nThe task range values can take any of the following forms:\n\n*   `m` indicates to submit task `m` (line m+1)\n*   `m-` indicates to submit all tasks starting with task `m`\n*   `m-n` indicates to submit all tasks from `m` to `n` (inclusive).\n\n### Logging\n\nThe `--logging` flag points to a location for `dsub` task log files. For details\non how to specify your logging path, see [Logging](https://github.com/DataBiosphere/dsub/blob/main/docs/logging.md).\n\n### Job control\n\nIt's possible to wait for a job to complete before starting another.\nFor details, see [job control with dsub](https://github.com/DataBiosphere/dsub/blob/main/docs/job_control.md).\n\n### Retries\n\nIt is possible for `dsub` to automatically retry failed tasks.\nFor details, see [retries with dsub](https://github.com/DataBiosphere/dsub/blob/main/docs/retries.md).\n\n### Labeling jobs and tasks\n\nYou can add custom labels to jobs and tasks, which allows you to monitor and\ncancel tasks using your own identifiers. In addition, with the Google\nproviders, labeling a task will label associated compute resources such as\nvirtual machines and disks.\n\nFor more details, see [Checking Status and Troubleshooting Jobs](https://github.com/DataBiosphere/dsub/blob/main/docs/troubleshooting.md)\n\n### Viewing job status\n\nThe `dstat` command displays the status of jobs:\n\n    dstat --provider google-v2 --project my-cloud-project\n\nWith no additional arguments, dstat will display a list of *running* jobs for\nthe current `USER`.\n\nTo display the status of a specific job, use the `--jobs` flag:\n\n    dstat --provider google-v2 --project my-cloud-project --jobs job-id\n\nFor a batch job, the output will list all *running* tasks.\n\nEach job submitted by dsub is given a set of metadata values that can be\nused for job identification and job control. The metadata associated with\neach job includes:\n\n*   `job-name`: defaults to the name of your script file or the first word of\n    your script command; it can be explicitly set with the `--name` parameter.\n*   `user-id`: the `USER` environment variable value.\n*   `job-id`: identifier of the job, which can be used in calls to `dstat` and\n    `ddel` for job monitoring and canceling respectively. See\n    [Job Identifiers](https://github.com/DataBiosphere/dsub#job-identifiers) for more\n    details on the `job-id` format.\n*   `task-id`: if the job is submitted with the `--tasks` parameter, each task\n    gets a sequential value of the form \"task-*n*\" where *n* is 1-based.\n\nNote that the job metadata values will be modified to conform with the \"Label\nRestrictions\" listed in the [Checking Status and Troubleshooting Jobs](https://github.com/DataBiosphere/dsub/blob/main/docs/troubleshooting.md)\nguide.\n\nMetadata can be used to cancel a job or individual tasks within a batch job.\n\nFor more details, see [Checking Status and Troubleshooting Jobs](https://github.com/DataBiosphere/dsub/blob/main/docs/troubleshooting.md)\n\n#### Summarizing job status\n\nBy default, dstat outputs one line per task. If you're using a batch job with\nmany tasks then you may benefit from `--summary`.\n\n```\n$ dstat --provider google-v2 --project my-project --status '*' --summary\n\nJob Name        Status         Task Count\n-------------   -------------  -------------\nmy-job-name     RUNNING        2\nmy-job-name     SUCCESS        1\n```\n\nIn this mode, dstat prints one line per (job name, task status) pair. You can\nsee at a glance how many tasks are finished, how many are still running, and\nhow many are failed/canceled.\n\n### Deleting a job\n\nThe `ddel` command will delete running jobs.\n\nBy default, only jobs submitted by the current user will be deleted.\nUse the `--users` flag to specify other users, or `'*'` for all users.\n\nTo delete a running job:\n\n    ddel --provider google-v2 --project my-cloud-project --jobs job-id\n\nIf the job is a batch job, all running tasks will be deleted.\n\nTo delete specific tasks:\n\n    ddel \\\n        --provider google-v2 \\\n        --project my-cloud-project \\\n        --jobs job-id \\\n        --tasks task-id1 task-id2\n\nTo delete all running jobs for the current user:\n\n    ddel --provider google-v2 --project my-cloud-project --jobs '*'\n\n## Service Accounts and Scope (Google providers only)\n\nWhen you run the `dsub` command with the `google-v2` or `google-cls-v2`\nprovider, there are two different sets of credentials to consider:\n\n- Account submitting the `pipelines.run()` request to run your command/script on a VM\n- Account accessing Cloud resources (such as files in GCS) when executing your command/script\n\nThe account used to submit the `pipelines.run()` request is typically your\nend user credentials. You would have set this up by running:\n\n    gcloud auth application-default login\n\nThe account used on the VM is a [service account](https://cloud.google.com/iam/docs/service-accounts).\nThe image below illustrates this:\n\n![Pipelines Runner Architecture](./docs/images/pipelines_runner_architecture.png)\n\nBy default, `dsub` will use the [default Compute Engine service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account)\nas the authorized service account on the VM instance. You can choose to specify\nthe email address of another service account using `--service-account`.\n\nBy default, `dsub` will grant the following access scopes to the service account:\n\n- https://www.googleapis.com/auth/bigquery\n- https://www.googleapis.com/auth/compute\n- https://www.googleapis.com/auth/devstorage.full_control\n- https://www.googleapis.com/auth/genomics\n- https://www.googleapis.com/auth/logging.write\n- https://www.googleapis.com/auth/monitoring.write\n\nIn addition, [the API](https://cloud.google.com/life-sciences/docs/reference/rest/v2beta/projects.locations.pipelines/run#serviceaccount) will always add this scope:\n\n- https://www.googleapis.com/auth/cloud-platform\n\nYou can choose to specify scopes using `--scopes`.\n\n### Recommendations for service accounts\n\nWhile it is straightforward to use the default service account, this account also\nhas broad privileges granted to it by default. Following the\n[Principle of Least Privilege](https://en.wikipedia.org/wiki/Principle_of_least_privilege)\nyou may want to create and use a service account that has only sufficient privileges\ngranted in order to run your `dsub` command/script.\n\nTo create a new service account, follow the steps below:\n\n1. Execute the `gcloud iam service-accounts create` command. The email address\nof the service account will be `sa-name@project-id.iam.gserviceaccount.com`.\n\n        gcloud iam service-accounts create \"sa-name\"\n\n2. Grant IAM access on buckets, etc. to the service account.\n\n        gsutil iam ch serviceAccount:sa-name@project-id.iam.gserviceaccount.com:roles/storage.objectAdmin gs://bucket-name\n\n3. Update your `dsub` command to include `--service-account`\n\n        dsub \\\n          --service-account sa-name@project-id.iam.gserviceaccount.com\n          ...\n\n## What next?\n\n*   See the examples:\n\n    *   [Custom scripts](https://github.com/DataBiosphere/dsub/tree/main/examples/custom_scripts)\n    *   [Decompress files](https://github.com/DataBiosphere/dsub/tree/main/examples/decompress)\n    *   [FastQC](https://github.com/DataBiosphere/dsub/tree/main/examples/fastqc)\n    *   [Samtools index](https://github.com/DataBiosphere/dsub/tree/main/examples/samtools)\n\n*   See more documentation for:\n\n    *   [Scripts, Commands, and Docker](https://github.com/DataBiosphere/dsub/blob/main/docs/code.md)\n    *   [Input and Output File Handling](https://github.com/DataBiosphere/dsub/blob/main/docs/input_output.md)\n    *   [Logging](https://github.com/DataBiosphere/dsub/blob/main/docs/logging.md)\n    *   [Compute Resources](https://github.com/DataBiosphere/dsub/blob/main/docs/compute_resources.md)\n    *   [Compute Quotas](https://github.com/DataBiosphere/dsub/blob/main/docs/compute_quotas.md)\n    *   [Job Control](https://github.com/DataBiosphere/dsub/blob/main/docs/job_control.md)\n    *   [Retries](https://github.com/DataBiosphere/dsub/blob/main/docs/retries.md)\n    *   [Checking Status and Troubleshooting Jobs](https://github.com/DataBiosphere/dsub/blob/main/docs/troubleshooting.md)\n    *   [Backend providers](https://github.com/DataBiosphere/dsub/blob/main/docs/providers/README.md)\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "A command-line tool that makes it easy to submit and run batch scripts in the cloud",
    "version": "0.4.11",
    "project_urls": {
        "Homepage": "https://github.com/DataBiosphere/dsub"
    },
    "split_keywords": [
        "cloud",
        "bioinformatics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b2cea2af3d8787d91406b06fe4b69b0842148702dd81deddb32cf703a13b86c5",
                "md5": "4be4bf4e420aef52d5efceee1a4c797c",
                "sha256": "9cbe499b366595b49dfef4161934b993edc97d414e0a1054d10f2d4e37310bc9"
            },
            "downloads": -1,
            "filename": "dsub-0.4.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4be4bf4e420aef52d5efceee1a4c797c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 186119,
            "upload_time": "2024-05-06T21:55:25",
            "upload_time_iso_8601": "2024-05-06T21:55:25.016642Z",
            "url": "https://files.pythonhosted.org/packages/b2/ce/a2af3d8787d91406b06fe4b69b0842148702dd81deddb32cf703a13b86c5/dsub-0.4.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "17aafe7e4a707e5ad2279f2c0ad98ab335781c2db224942a842cb6d48184ada8",
                "md5": "c799d6f365063ed84bc3ccb67dfaa5ef",
                "sha256": "8e343f89efff0680a419f236d8815b7d7045c5cdfc953e35de50e7dc60cd57db"
            },
            "downloads": -1,
            "filename": "dsub-0.4.11.tar.gz",
            "has_sig": false,
            "md5_digest": "c799d6f365063ed84bc3ccb67dfaa5ef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 162083,
            "upload_time": "2024-05-06T21:55:26",
            "upload_time_iso_8601": "2024-05-06T21:55:26.788502Z",
            "url": "https://files.pythonhosted.org/packages/17/aa/fe7e4a707e5ad2279f2c0ad98ab335781c2db224942a842cb6d48184ada8/dsub-0.4.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-06 21:55:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "DataBiosphere",
    "github_project": "dsub",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "lcname": "dsub"
}
        
Elapsed time: 0.24022s