analysis-runner

Name	analysis-runner JSON
Version	3.0.0 JSON
	download
home_page	https://github.com/populationgenomics/analysis-runner
Summary	Analysis runner to help make analysis results reproducible
upload_time	2024-04-14 23:16:45
maintainer	None
docs_url	None
author	None
requires_python	None
license	MIT
keywords	bioinformatics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Analysis runner

This tool helps to [improve analysis provenance](https://github.com/populationgenomics/team-docs/blob/main/reproducible_analyses.md) and theoretical reproducibility by automating the following aspects:

- Allow quick iteration using an environment that resembles production.
- Only allow access to production datasets through code that has been reviewed.
- Link the output data with the exact program invocation of how the data has been generated.

One of our main workflow pipeline systems at the CPG is [Hail Batch](https://hail.is/docs/batch/getting_started.html). By default, its pipelines are defined by running a Python program _locally_ and submitting the resulting DAG to the Hail Batch server. By specifying a repo, commit, and file, this tool will run your script inside a "driver" image on Hail Batch, with the correct permissions.

All invocations are logged to metamist, in the [analysis-runner page](https://sample-metadata.populationgenomics.org.au/analysis-runner/).

When using the analysis-runner, the jobs are run as a specific Hail Batch service account to give appropriate permissions based on the dataset, and access level ("test", "standard", or "full", as documented in the team docs [storage policies](https://github.com/populationgenomics/team-docs/tree/main/storage_policies#analysis-runner)). This helps with bucket permission management and billing budgets.

By default, we run your script in a driver image, that contains a number of common tools - but you can in fact run any container inside the cpg-common artifact registry (and any container if running using the test access level).

The analysis-runner is also integrated with our Cromwell server to run WDL based workflows.

## CLI

The analysis-runner CLI is used to start pipelines based on a GitHub repository, commit, and command to run.

First, make sure that your environment provides Python 3.10 or newer. We recommend using `pyenv` to manage your python versions

```sh
pyenv install 3.10.12
pyenv global 3.10.12
> python3 --version
Python 3.10.12
```

Then install the `analysis-runner` Python package using `pip`:

```bash
python3 -m pip install analysis-runner
```

Run `analysis-runner --help` to see usage information.

Make sure that you're logged into GCP with _application-default_ credentials:

```bash
gcloud auth application-default login
```

If you're in the directory of the project you want to run, you can omit the `--commit` and `--repository` parameters, which will use your current git remote and commit HEAD.

For example:

```bash
analysis-runner \
    --dataset <dataset> \
    --description <description> \
    --access-level <level> \
    --output-dir <directory-within-bucket> \
    script_to_run.py with arguments
```

`<level>` corresponds to an [access level](https://github.com/populationgenomics/team-docs/tree/main/storage_policies#analysis-runner) as defined in the storage policies.

`<directory-within-bucket>` does _not_ contain a prefix like `gs://cpg-fewgenomes-main/`. For example, if you want your results to be stored in `gs://cpg-fewgenomes-main/1kg_pca/v2`, specify `--output-dir 1kg_pca/v2`.

If you provide a `--repository`, you MUST supply a `--commit <SHA>`, e.g.:

```bash
analysis-runner \
    --repository my-approved-repo \
    --commit <commit-sha> \
    --dataset <dataset> \
    --description <description> \
    --access-level <level>
    --output-dir <directory-within-bucket> \
    script_to_run.py with arguments
```

For more examples (including for running an R script and dataproc), see the
[examples](examples) directory.

### GitHub Authentication

If you are submitting an analysis-runner job that needs to clone a private repository owned by populationgenomics on GitHub (eg submitting a script to analysis-runner from a private repository), the analysis-runner should insert the following items into your `config.toml`:

```toml
[infrastructure]
git_credentials_secret_name = '<ask_software_team_for_secret_name>'
git_credentials_secret_project = '<ask_software_team_for_secret_project>'
```

If you are specifying multiple configuration files, please don't accidentally override these values.

## Custom Docker images

The default driver image that's used to run scripts comes with Hail and some statistics libraries preinstalled (see the corresponding [Hail Dockerfile](driver/Dockerfile.hail)). It's possible to use any custom Docker image instead, using the `--image` parameter. Note that any such image needs to contain the critical dependencies as specified in the [`base` image](driver/Dockerfile.base).

For R scripts, we add the R-tidyverse set of packages to the base image, see the corresponding [R Dockerfile](driver/Dockerfile.r) and the [R example](examples/r) for more details.

## Helper for Hail Batch

The analysis-runner package has a number of functions that make it easier to run reproducible analysis through Hail Batch.

This is installed in the analysis runner driver image, ie: you can access the analysis_runner module when running scripts through the analysis-runner.

### Checking out a git repository at the current commit

```python
from cpg_utils.hail_batch import get_batch
from analysis_runner.git import (
  prepare_git_job,
  get_repo_name_from_current_directory,
  get_git_commit_ref_of_current_repository,
)

b = get_batch(name='do-some-analysis')
j = b.new_job('checkout_repo')
prepare_git_job(
  job=j,
  # you could specify a name here, like 'analysis-runner'
  repo_name=get_repo_name_from_current_directory(),
  # you could specify the specific commit here, eg: '1be7bb44de6182d834d9bbac6036b841f459a11a'
  commit=get_git_commit_ref_of_current_repository(),
)

# Now, the working directory of j is the checkout out repository
j.command('examples/bash/hello.sh')
```

### Running a dataproc script

```python
from cpg_utils.hail_batch import get_batch
from analysis_runner.dataproc import setup_dataproc

b = get_batch(name='do-some-analysis')

# starts up a cluster, and submits a script to the cluster,
# see the definition for more information about how you can configure the cluster
# https://github.com/populationgenomics/analysis-runner/blob/main/analysis_runner/dataproc.py#L80
cluster = dataproc.setup_dataproc(
    b,
    max_age='1h',
    packages=['click', 'selenium'],
    cluster_name='My Cluster with max-age=1h',
)
cluster.add_job('examples/dataproc/query.py', job_name='example')
```

## Development

You can ignore this section if you just want to run the tool.

To set up a development environment for the analysis runner using pip, run
the following:

```bash
pip install -r requirements-dev.txt
pip install --editable .
```

### Deployment

The server can be deployed by manually running the [`deploy_server.yaml`](https://github.com/populationgenomics/analysis-runner/actions/workflows/deploy_server.yaml) GitHub action. This will also deploy the driver image.

The CLI tool is shipped as a pip package, this happens automatically on pushes to `main.py`. To build a new version, you should add a [bump2version](https://pypi.org/project/bump2version/) commit to your branch.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/populationgenomics/analysis-runner",
    "name": "analysis-runner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "bioinformatics",
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/8c/99/6666e3b241392d957ed626a5d3315874a3514303a53de02d40e450bffd23/analysis-runner-3.0.0.tar.gz",
    "platform": null,
    "description": "# Analysis runner\n\nThis tool helps to [improve analysis provenance](https://github.com/populationgenomics/team-docs/blob/main/reproducible_analyses.md) and theoretical reproducibility by automating the following aspects:\n\n- Allow quick iteration using an environment that resembles production.\n- Only allow access to production datasets through code that has been reviewed.\n- Link the output data with the exact program invocation of how the data has been generated.\n\nOne of our main workflow pipeline systems at the CPG is [Hail Batch](https://hail.is/docs/batch/getting_started.html). By default, its pipelines are defined by running a Python program _locally_ and submitting the resulting DAG to the Hail Batch server. By specifying a repo, commit, and file, this tool will run your script inside a \"driver\" image on Hail Batch, with the correct permissions.\n\nAll invocations are logged to metamist, in the [analysis-runner page](https://sample-metadata.populationgenomics.org.au/analysis-runner/).\n\nWhen using the analysis-runner, the jobs are run as a specific Hail Batch service account to give appropriate permissions based on the dataset, and access level (\"test\", \"standard\", or \"full\", as documented in the team docs [storage policies](https://github.com/populationgenomics/team-docs/tree/main/storage_policies#analysis-runner)). This helps with bucket permission management and billing budgets.\n\nBy default, we run your script in a driver image, that contains a number of common tools - but you can in fact run any container inside the cpg-common artifact registry (and any container if running using the test access level).\n\nThe analysis-runner is also integrated with our Cromwell server to run WDL based workflows.\n\n## CLI\n\nThe analysis-runner CLI is used to start pipelines based on a GitHub repository, commit, and command to run.\n\nFirst, make sure that your environment provides Python 3.10 or newer. We recommend using `pyenv` to manage your python versions\n\n```sh\npyenv install 3.10.12\npyenv global 3.10.12\n> python3 --version\nPython 3.10.12\n```\n\nThen install the `analysis-runner` Python package using `pip`:\n\n```bash\npython3 -m pip install analysis-runner\n```\n\nRun `analysis-runner --help` to see usage information.\n\nMake sure that you're logged into GCP with _application-default_ credentials:\n\n```bash\ngcloud auth application-default login\n```\n\nIf you're in the directory of the project you want to run, you can omit the `--commit` and `--repository` parameters, which will use your current git remote and commit HEAD.\n\nFor example:\n\n```bash\nanalysis-runner \\\n    --dataset <dataset> \\\n    --description <description> \\\n    --access-level <level> \\\n    --output-dir <directory-within-bucket> \\\n    script_to_run.py with arguments\n```\n\n`<level>` corresponds to an [access level](https://github.com/populationgenomics/team-docs/tree/main/storage_policies#analysis-runner) as defined in the storage policies.\n\n`<directory-within-bucket>` does _not_ contain a prefix like `gs://cpg-fewgenomes-main/`. For example, if you want your results to be stored in `gs://cpg-fewgenomes-main/1kg_pca/v2`, specify `--output-dir 1kg_pca/v2`.\n\nIf you provide a `--repository`, you MUST supply a `--commit <SHA>`, e.g.:\n\n```bash\nanalysis-runner \\\n    --repository my-approved-repo \\\n    --commit <commit-sha> \\\n    --dataset <dataset> \\\n    --description <description> \\\n    --access-level <level>\n    --output-dir <directory-within-bucket> \\\n    script_to_run.py with arguments\n```\n\nFor more examples (including for running an R script and dataproc), see the\n[examples](examples) directory.\n\n### GitHub Authentication\n\nIf you are submitting an analysis-runner job that needs to clone a private repository owned by populationgenomics on GitHub (eg submitting a script to analysis-runner from a private repository), the analysis-runner should insert the following items into your `config.toml`:\n\n```toml\n[infrastructure]\ngit_credentials_secret_name = '<ask_software_team_for_secret_name>'\ngit_credentials_secret_project = '<ask_software_team_for_secret_project>'\n```\n\nIf you are specifying multiple configuration files, please don't accidentally override these values.\n\n## Custom Docker images\n\nThe default driver image that's used to run scripts comes with Hail and some statistics libraries preinstalled (see the corresponding [Hail Dockerfile](driver/Dockerfile.hail)). It's possible to use any custom Docker image instead, using the `--image` parameter. Note that any such image needs to contain the critical dependencies as specified in the [`base` image](driver/Dockerfile.base).\n\nFor R scripts, we add the R-tidyverse set of packages to the base image, see the corresponding [R Dockerfile](driver/Dockerfile.r) and the [R example](examples/r) for more details.\n\n## Helper for Hail Batch\n\nThe analysis-runner package has a number of functions that make it easier to run reproducible analysis through Hail Batch.\n\nThis is installed in the analysis runner driver image, ie: you can access the analysis_runner module when running scripts through the analysis-runner.\n\n### Checking out a git repository at the current commit\n\n```python\nfrom cpg_utils.hail_batch import get_batch\nfrom analysis_runner.git import (\n  prepare_git_job,\n  get_repo_name_from_current_directory,\n  get_git_commit_ref_of_current_repository,\n)\n\nb = get_batch(name='do-some-analysis')\nj = b.new_job('checkout_repo')\nprepare_git_job(\n  job=j,\n  # you could specify a name here, like 'analysis-runner'\n  repo_name=get_repo_name_from_current_directory(),\n  # you could specify the specific commit here, eg: '1be7bb44de6182d834d9bbac6036b841f459a11a'\n  commit=get_git_commit_ref_of_current_repository(),\n)\n\n# Now, the working directory of j is the checkout out repository\nj.command('examples/bash/hello.sh')\n```\n\n### Running a dataproc script\n\n```python\nfrom cpg_utils.hail_batch import get_batch\nfrom analysis_runner.dataproc import setup_dataproc\n\nb = get_batch(name='do-some-analysis')\n\n# starts up a cluster, and submits a script to the cluster,\n# see the definition for more information about how you can configure the cluster\n# https://github.com/populationgenomics/analysis-runner/blob/main/analysis_runner/dataproc.py#L80\ncluster = dataproc.setup_dataproc(\n    b,\n    max_age='1h',\n    packages=['click', 'selenium'],\n    cluster_name='My Cluster with max-age=1h',\n)\ncluster.add_job('examples/dataproc/query.py', job_name='example')\n```\n\n## Development\n\nYou can ignore this section if you just want to run the tool.\n\nTo set up a development environment for the analysis runner using pip, run\nthe following:\n\n```bash\npip install -r requirements-dev.txt\npip install --editable .\n```\n\n### Deployment\n\nThe server can be deployed by manually running the [`deploy_server.yaml`](https://github.com/populationgenomics/analysis-runner/actions/workflows/deploy_server.yaml) GitHub action. This will also deploy the driver image.\n\nThe CLI tool is shipped as a pip package, this happens automatically on pushes to `main.py`. To build a new version, you should add a [bump2version](https://pypi.org/project/bump2version/) commit to your branch.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Analysis runner to help make analysis results reproducible",
    "version": "3.0.0",
    "project_urls": {
        "Homepage": "https://github.com/populationgenomics/analysis-runner"
    },
    "split_keywords": [
        "bioinformatics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8c996666e3b241392d957ed626a5d3315874a3514303a53de02d40e450bffd23",
                "md5": "7dfed582d0da8d43c31e11f557481db6",
                "sha256": "216885e4cf13b41bd8c36d3d76bf50af64625821cbb527c968c7def4fb210042"
            },
            "downloads": -1,
            "filename": "analysis-runner-3.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7dfed582d0da8d43c31e11f557481db6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 18913,
            "upload_time": "2024-04-14T23:16:45",
            "upload_time_iso_8601": "2024-04-14T23:16:45.330217Z",
            "url": "https://files.pythonhosted.org/packages/8c/99/6666e3b241392d957ed626a5d3315874a3514303a53de02d40e450bffd23/analysis-runner-3.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-14 23:16:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "populationgenomics",
    "github_project": "analysis-runner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "analysis-runner"
}

None