# Analysis runner
This tool helps to [make analysis results reproducible](https://github.com/populationgenomics/team-docs/blob/main/reproducible_analyses.md),
by automating the following aspects:
- Allow quick iteration using an environment that resembles production.
- Only allow access to production datasets through code that has been reviewed.
- Link the output data with the exact program invocation of how the data has
been generated.
One of our main workflow pipeline systems at the CPG is
[Hail Batch](https://hail.is/docs/batch/getting_started.html). By default, its
pipelines are defined by running a Python program
_locally_. This tool instead lets you run the "driver" on Hail Batch itself.
Furthermore, all invocations are logged together with the output data, as well as [Airtable](https://airtable.com/tblx9NarwtJwGqTPA/viwIomAHV49Stq5zr) and the sample-metadata server.
When using the analysis-runner, the batch jobs are not run under your standard
Hail Batch [service account user](https://hail.is/docs/batch/service.html#sign-up)
(`<USERNAME>-trial`). Instead, a separate Hail Batch account is
used to run the batch jobs on your behalf. There's a dedicated Batch service
account for each dataset (e.g. "tob-wgs", "fewgenomes") and access level
("test", "standard", or "full", as documented in the team docs
[storage policies](https://github.com/populationgenomics/team-docs/tree/main/storage_policies#analysis-runner)),
which helps with bucket permission management and billing budgets.
Note that you can use the analysis-runner to start arbitrary jobs, e.g. R scripts. They're just launched in the Hail Batch environment, but you can use any Docker image you like.
The analysis-runner is also integrated with our Cromwell server to run WDL based workflows.
## CLI
The analysis-runner CLI can be used to start pipelines based on a GitHub repository,
commit, and command to run.
First, make sure that your environment provides Python 3.10 or newer:
```sh
> python3 --version
Python 3.10.7
```
If the installed version is too old, on a Mac you can use `brew` to update. E.g.:
```sh
brew install python@3.10
```
Then install the `analysis-runner` Python package using `pip`:
```bash
python3 -m pip install analysis-runner
```
Run `analysis-runner --help` to see usage information.
Make sure that you're logged into GCP:
```bash
gcloud auth application-default login
```
If you're in the directory of the project you want to run, you can omit the
`--commit` and `--repository` parameters, which will use your current git remote and
commit HEAD.
For example:
```bash
analysis-runner \
--dataset <dataset> \
--description <description> \
--access-level <level> \
--output-dir <directory-within-bucket> \
script_to_run.py with arguments
```
`<level>` corresponds to an [access level](https://github.com/populationgenomics/team-docs/tree/main/storage_policies#analysis-runner) as defined in the storage policies.
`<directory-within-bucket>` does _not_ contain a prefix like `gs://cpg-fewgenomes-main/`. For example, if you want your results to be stored in `gs://cpg-fewgenomes-main/1kg_pca/v2`, specify `--output-dir 1kg_pca/v2`.
If you provide a `--repository`, you MUST supply a `--commit <SHA>`, e.g.:
```bash
analysis-runner \
--repository my-approved-repo \
--commit <commit-sha> \
--dataset <dataset> \
--description <description> \
--access-level <level>
--output-dir <directory-within-bucket> \
script_to_run.py with arguments
```
For more examples (including for running an R script and dataproc), see the
[examples](examples) directory.
## Custom Docker images
The default driver image that's used to run scripts comes with Hail and some statistics libraries preinstalled (see the corresponding [Hail Dockerfile](driver/Dockerfile.hail)). It's possible to use any custom Docker image instead, using the `--image` parameter. Note that any such image needs to contain the critical dependencies as specified in the [`base` image](driver/Dockerfile.base).
For R scripts, we add the R-tidyverse set of packages to the base image, see the corresponding [R Dockerfile](driver/Dockerfile.r) and the [R example](examples/r) for more details.
## Helper for Hail Batch
The analysis-runner package has a number of functions that make it easier to run reproducible analysis through Hail Batch.
This is installed in the analysis runner driver image, ie: you can access the analysis_runner module when running scripts through the analysis-runner.
### Checking out a git repository at the current commit
```python
import hailtop.batch as hb
from cpg_utils.git import (
prepare_git_job,
get_repo_name_from_current_directory,
get_git_commit_ref_of_current_repository,
)
b = hb.Batch('do-some-analysis')
j = b.new_job('checkout_repo')
prepare_git_job(
job=j,
organisation='populationgenomics',
# you could specify a name here, like 'analysis-runner'
repo_name=get_repo_name_from_current_directory(),
# you could specify the specific commit here, eg: '1be7bb44de6182d834d9bbac6036b841f459a11a'
commit=get_git_commit_ref_of_current_repository(),
)
# Now, the working directory of j is the checkout out repository
j.command('examples/bash/hello.sh')
```
### Running a dataproc script
```python
import hailtop.batch as hb
from analysis_runner.dataproc import setup_dataproc
b = hb.Batch('do-some-analysis')
# starts up a cluster, and submits a script to the cluster,
# see the definition for more information about how you can configure the cluster
# https://github.com/populationgenomics/analysis-runner/blob/main/analysis_runner/dataproc.py#L80
cluster = dataproc.setup_dataproc(
b,
max_age='1h',
packages=['click', 'selenium'],
init=['gs://cpg-common-main/hail_dataproc/install_common.sh'],
cluster_name='My Cluster with max-age=1h',
)
cluster.add_job('examples/dataproc/query.py', job_name='example')
```
## Development
You can ignore this section if you just want to run the tool.
To set up a development environment for the analysis runner using pip, run
the following:
```bash
pip install -r requirements-dev.txt
pre-commit install --install-hooks
pip install --editable .
```
### Deployment
1. Add a Hail Batch service account for all supported datasets.
1. [Copy the Hail tokens](tokens) to the Secret Manager.
1. Deploy the [server](server) by invoking the [`deploy_server` workflow](https://github.com/populationgenomics/analysis-runner/blob/main/.github/workflows/deploy_server.yaml) manually.
1. Deploy the [Airtable](airtable) publisher.
1. Publish the [CLI tool and library](analysis_runner) to PyPI.
The CLI tool is shipped as a pip package. To build a new version,
we use [bump2version](https://pypi.org/project/bump2version/).
For example, to increment the patch section of the version tag 1.0.0 and make
it 1.0.1, run:
```bash
git checkout -b add-new-version
bump2version patch
git push --set-upstream origin add-new-version
# Open pull request
open "https://github.com/populationgenomics/analysis-runner/pull/new/add-new-version"
```
It's important the pull request name start with "Bump version:" (which should happen
by default). Once this is merged into `main`, a GitHub action workflow will build a
new package that will be uploaded to PyPI, and become available to install with `pip install`.
Raw data
{
"_id": null,
"home_page": "https://github.com/populationgenomics/analysis-runner-ms",
"name": "analysis-runner-ms",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "bioinformatics",
"author": "",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/4d/5f/f818d3d37ee91847df64af380691ecb32c62837cabb4589b0801426dee18/analysis-runner-ms-0.9.10.tar.gz",
"platform": null,
"description": "# Analysis runner\n\nThis tool helps to [make analysis results reproducible](https://github.com/populationgenomics/team-docs/blob/main/reproducible_analyses.md),\nby automating the following aspects:\n\n- Allow quick iteration using an environment that resembles production.\n- Only allow access to production datasets through code that has been reviewed.\n- Link the output data with the exact program invocation of how the data has\n been generated.\n\nOne of our main workflow pipeline systems at the CPG is\n[Hail Batch](https://hail.is/docs/batch/getting_started.html). By default, its\npipelines are defined by running a Python program\n_locally_. This tool instead lets you run the \"driver\" on Hail Batch itself.\n\nFurthermore, all invocations are logged together with the output data, as well as [Airtable](https://airtable.com/tblx9NarwtJwGqTPA/viwIomAHV49Stq5zr) and the sample-metadata server.\n\nWhen using the analysis-runner, the batch jobs are not run under your standard\nHail Batch [service account user](https://hail.is/docs/batch/service.html#sign-up)\n(`<USERNAME>-trial`). Instead, a separate Hail Batch account is\nused to run the batch jobs on your behalf. There's a dedicated Batch service\naccount for each dataset (e.g. \"tob-wgs\", \"fewgenomes\") and access level\n(\"test\", \"standard\", or \"full\", as documented in the team docs\n[storage policies](https://github.com/populationgenomics/team-docs/tree/main/storage_policies#analysis-runner)),\nwhich helps with bucket permission management and billing budgets.\n\nNote that you can use the analysis-runner to start arbitrary jobs, e.g. R scripts. They're just launched in the Hail Batch environment, but you can use any Docker image you like.\n\nThe analysis-runner is also integrated with our Cromwell server to run WDL based workflows.\n\n## CLI\n\nThe analysis-runner CLI can be used to start pipelines based on a GitHub repository,\ncommit, and command to run.\n\nFirst, make sure that your environment provides Python 3.10 or newer:\n\n```sh\n> python3 --version\nPython 3.10.7\n```\n\nIf the installed version is too old, on a Mac you can use `brew` to update. E.g.:\n\n```sh\nbrew install python@3.10\n```\n\nThen install the `analysis-runner` Python package using `pip`:\n\n```bash\npython3 -m pip install analysis-runner\n```\n\nRun `analysis-runner --help` to see usage information.\n\nMake sure that you're logged into GCP:\n\n```bash\ngcloud auth application-default login\n```\n\nIf you're in the directory of the project you want to run, you can omit the\n`--commit` and `--repository` parameters, which will use your current git remote and\ncommit HEAD.\n\nFor example:\n\n```bash\nanalysis-runner \\\n --dataset <dataset> \\\n --description <description> \\\n --access-level <level> \\\n --output-dir <directory-within-bucket> \\\n script_to_run.py with arguments\n```\n\n`<level>` corresponds to an [access level](https://github.com/populationgenomics/team-docs/tree/main/storage_policies#analysis-runner) as defined in the storage policies.\n\n`<directory-within-bucket>` does _not_ contain a prefix like `gs://cpg-fewgenomes-main/`. For example, if you want your results to be stored in `gs://cpg-fewgenomes-main/1kg_pca/v2`, specify `--output-dir 1kg_pca/v2`.\n\nIf you provide a `--repository`, you MUST supply a `--commit <SHA>`, e.g.:\n\n```bash\nanalysis-runner \\\n --repository my-approved-repo \\\n --commit <commit-sha> \\\n --dataset <dataset> \\\n --description <description> \\\n --access-level <level>\n --output-dir <directory-within-bucket> \\\n script_to_run.py with arguments\n```\n\nFor more examples (including for running an R script and dataproc), see the\n[examples](examples) directory.\n\n## Custom Docker images\n\nThe default driver image that's used to run scripts comes with Hail and some statistics libraries preinstalled (see the corresponding [Hail Dockerfile](driver/Dockerfile.hail)). It's possible to use any custom Docker image instead, using the `--image` parameter. Note that any such image needs to contain the critical dependencies as specified in the [`base` image](driver/Dockerfile.base).\n\nFor R scripts, we add the R-tidyverse set of packages to the base image, see the corresponding [R Dockerfile](driver/Dockerfile.r) and the [R example](examples/r) for more details.\n\n## Helper for Hail Batch\n\nThe analysis-runner package has a number of functions that make it easier to run reproducible analysis through Hail Batch.\n\nThis is installed in the analysis runner driver image, ie: you can access the analysis_runner module when running scripts through the analysis-runner.\n\n### Checking out a git repository at the current commit\n\n```python\nimport hailtop.batch as hb\nfrom cpg_utils.git import (\n prepare_git_job,\n get_repo_name_from_current_directory,\n get_git_commit_ref_of_current_repository,\n)\n\nb = hb.Batch('do-some-analysis')\nj = b.new_job('checkout_repo')\nprepare_git_job(\n job=j,\n organisation='populationgenomics',\n # you could specify a name here, like 'analysis-runner'\n repo_name=get_repo_name_from_current_directory(),\n # you could specify the specific commit here, eg: '1be7bb44de6182d834d9bbac6036b841f459a11a'\n commit=get_git_commit_ref_of_current_repository(),\n)\n\n# Now, the working directory of j is the checkout out repository\nj.command('examples/bash/hello.sh')\n```\n\n### Running a dataproc script\n\n```python\nimport hailtop.batch as hb\nfrom analysis_runner.dataproc import setup_dataproc\n\nb = hb.Batch('do-some-analysis')\n\n# starts up a cluster, and submits a script to the cluster,\n# see the definition for more information about how you can configure the cluster\n# https://github.com/populationgenomics/analysis-runner/blob/main/analysis_runner/dataproc.py#L80\ncluster = dataproc.setup_dataproc(\n b,\n max_age='1h',\n packages=['click', 'selenium'],\n init=['gs://cpg-common-main/hail_dataproc/install_common.sh'],\n cluster_name='My Cluster with max-age=1h',\n)\ncluster.add_job('examples/dataproc/query.py', job_name='example')\n```\n\n## Development\n\nYou can ignore this section if you just want to run the tool.\n\nTo set up a development environment for the analysis runner using pip, run\nthe following:\n\n```bash\npip install -r requirements-dev.txt\n\npre-commit install --install-hooks\n\npip install --editable .\n```\n\n### Deployment\n\n1. Add a Hail Batch service account for all supported datasets.\n1. [Copy the Hail tokens](tokens) to the Secret Manager.\n1. Deploy the [server](server) by invoking the [`deploy_server` workflow](https://github.com/populationgenomics/analysis-runner/blob/main/.github/workflows/deploy_server.yaml) manually.\n1. Deploy the [Airtable](airtable) publisher.\n1. Publish the [CLI tool and library](analysis_runner) to PyPI.\n\nThe CLI tool is shipped as a pip package. To build a new version,\nwe use [bump2version](https://pypi.org/project/bump2version/).\nFor example, to increment the patch section of the version tag 1.0.0 and make\nit 1.0.1, run:\n\n```bash\ngit checkout -b add-new-version\nbump2version patch\ngit push --set-upstream origin add-new-version\n# Open pull request\nopen \"https://github.com/populationgenomics/analysis-runner/pull/new/add-new-version\"\n```\n\nIt's important the pull request name start with \"Bump version:\" (which should happen\nby default). Once this is merged into `main`, a GitHub action workflow will build a\nnew package that will be uploaded to PyPI, and become available to install with `pip install`.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Analysis runner to help make analysis results reproducible",
"version": "0.9.10",
"split_keywords": [
"bioinformatics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4d5ff818d3d37ee91847df64af380691ecb32c62837cabb4589b0801426dee18",
"md5": "18011815feeca09633edce29ee4aa5f8",
"sha256": "cdb9cd028a8c246a9675f7c23de6c296141dc705c04701f3e758f084620ca63b"
},
"downloads": -1,
"filename": "analysis-runner-ms-0.9.10.tar.gz",
"has_sig": false,
"md5_digest": "18011815feeca09633edce29ee4aa5f8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 29959,
"upload_time": "2023-01-24T22:46:23",
"upload_time_iso_8601": "2023-01-24T22:46:23.013470Z",
"url": "https://files.pythonhosted.org/packages/4d/5f/f818d3d37ee91847df64af380691ecb32c62837cabb4589b0801426dee18/analysis-runner-ms-0.9.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-24 22:46:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "populationgenomics",
"github_project": "analysis-runner-ms",
"lcname": "analysis-runner-ms"
}