cpg-flow

Name	cpg-flow JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	CPG Flow API for Hail Batch
upload_time	2025-02-05 03:10:35
maintainer	None
docs_url	None
author	None
requires_python	<3.11,>=3.10
license	MIT License Copyright (c) 2022 Centre for Population Genomics Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	hail flow api bioinformatics genomics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <!-- markdownlint-disable MD033 MD024 -->
# 🐙 CPG Flow

<img src="/assets/DNA_CURIOUS_FLOYD_CROPPED.png" height="300" alt="CPG Flow logo" align="right"/>

![Python](https://img.shields.io/badge/-Python-black?style=for-the-badge&logoColor=white&logo=python&color=2F73BF)

[![⚙️ Test Workflow](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml)
[![🚀 Deploy To Production Workflow](https://github.com/populationgenomics/cpg-flow/actions/workflows/package.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/package.yaml)
[![GitHub Latest Main Release](https://img.shields.io/github/v/release/populationgenomics/cpg-flow?label=main%20release)](https://GitHub.com/populationgenomics/cpg-flow/releases/)
[![GitHub Release](https://img.shields.io/github/v/release/populationgenomics/cpg-flow?include_prereleases&label=latest)](https://GitHub.com/populationgenomics/cpg-flow/releases/)
[![semantic-release: conventional commits](https://img.shields.io/badge/semantic--release-conventional%20commits-Æ1A7DBD?logo=semantic-release&color=1E7FBF)](https://github.com/semantic-release/semantic-release)
[![GitHub license](https://img.shields.io/github/license/populationgenomics/cpg-flow.svg)](https://github.com/populationgenomics/cpg-flow/blob/main/LICENSE)

[![Technical Debt](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=sqale_index&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)
[![Duplicated Lines (%)](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=duplicated_lines_density&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)
[![Code Smells](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=code_smells&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)

<br />

## 📋 Table of Contents

1. 🐙 [What is this API ?](#what-is-this-api)
2. ✨ [Production and development links](#production-and-development-links)
3. 🔨 [Installation](#installation)
4. 🚀 [Build](#build)
5. 🤖 [Usage](#usage)
6. 😵‍💫 [Key Considerations and Limitations](#key-considerations-and-limitations)
7. 🐳 [Docker](#docker)
8. 💯 [Tests](#tests)
9. ☑️ [Code analysis and consistency](#code-analysis-and-consistency)
10. 📈 [Releases & Changelog](#versions)
11. 🎬 [GitHub Actions](#github-actions)
12. ©️ [License](#license)
13. ❤️ [Contributors](#contributors)

## <a name="what-is-this-api">🐙 What is this API ?</a>

Welcome to CPG Flow!

This API provides a set of tools and workflows for managing population genomics data pipelines, designed to streamline the processing, analysis, and storage of large-scale genomic datasets. It facilitates automated pipeline execution, enabling reproducible research while integrating with cloud-based resources for scalable computation.

CPG Flow supports various stages of genomic data processing, from raw data ingestion to final analysis outputs, making it easier for researchers to manage and scale their population genomics workflows.

The API constructs a DAG (Directed Acyclic Graph) structure from a set of chained stages. This DAG structure then forms the **pipeline**.

## <a name="documentation">✨ Documentation</a>

### 🌐 Production

The production version of this API is documented at **[populationgenomics.github.io/cpg-flow/](https://populationgenomics.github.io/cpg-flow/)**.

The documentation is updated automatically when a commit is pushed on the `alpha` (prerelease) or `main` (release) branch.

## <a name="installation">🔨 Installation</a>

The packages are hosted on:

![PyPI](https://img.shields.io/badge/-PyPI-black?style=for-the-badge&logoColor=white&logo=pypi&color=3776AB)

To install this project, you will need to have Python and `uv` installed on your machine:

![uv](https://img.shields.io/badge/-uv-black?style=for-the-badge&logoColor=white&logo=uv&color=3776AB&link=https://docs.astral.sh/uv/)
![Python](https://img.shields.io/badge/-Python-black?style=for-the-badge&logoColor=white&logo=python&color=3776AB)

Run the following commands, to create a virtual environment with `uv` and install the dependencies:

```bash
# Install the package using uv
uv sync

# Or equivalently use make (also installs pre-commit)
make init
```

### 🛠️ Development

To setup for development we recommend using the makefile setup

```bash
make init-dev # installs pre-commit as a hook
```

To install `cpg-flow` locally, run:

```bash
make install-local
```

To try out the pre-installed `cpg-flow` in a Docker image, find more information in the **[Docker](#docker)** section.

## <a name="build">🚀 Build</a>

To build the project, run the following command:

```bash
make build
```

To make sure that you're actually using the installed build we suggest calling the following to install the build wheel.

```bash
make install-build
```

## <a name="usage">🤖 Usage</a>

This project provides the framework to construct pipelines but does not offer hosting the logic of any pipelines themselves. This approach offers the benefit of making all components more modular, manageable and decoupled. Pipelines themselves are hosted in a separate repository.

The [test_workflows_shared repository](https://github.com/populationgenomics/test_workflows_shared) acts as a template and demonstrates how to structure a pipeline using CPG Flow.

The components required to build pipelines with CPG Flow:

### config `.toml` file

This file contains the configuration settings to your pipeline. This file allows the pipeline developer to define settings such as:

1. what stages will be run or skipped
2. what dataset to use
3. what access level to use
4. any input cohorts
5. sequencing type

```toml
[workflow]
dataset = 'fewgenomes'

# Note: for fewgenomes and sandbox mentioning datasets by name is not a security risk
# DO NOT DO THIS FOR OTHER DATASETS

input_cohorts = ['COH2142']
access_level = 'test'

# Force stage rerun
force_stages = [
    'GeneratePrimes', # the first stage
    'CumulativeCalc', # the second stage
    'FilterEvens', # the third stage
    'BuildAPrimePyramid', # the last stage
]

# Show a workflow graph locally or save to web bucket.
# Default is false, set to true to show the workflow graph.
show_workflow = true
# ...
```

For a full list of supported config options with documentation, see [defaults.toml](src/cpg_flow/defaults.toml)

This `.toml` file will be may be named anything, as long as it is correctly passed to the `analysis-runner` invocation. The `analysis-runner` supplies its own default settings, and combines it with the settings from this file, before submitting a job.

### `main.py` or equivalent entrypoint for the pipeline

This file would store the workflow definition as a list of stages, and then run said workflow:

```python
 import os
 from pathlib import Path
 from cpg_flow.workflow import run_workflow
 from cpg_utils.config import set_config_paths
 from stages import BuildAPrimePyramid, CumulativeCalc, FilterEvens, GeneratePrimes

 CONFIG_FILE = str(Path(__file__).parent / '<YOUR_CONFIG>.toml')

 def run_cpg_flow(dry_run=False):

    #See the 'Key Considerations and Limitations' section for notes on the definition of the `workflow` variable.

    # This represents the flow of the DAG
     workflow = [GeneratePrimes, CumulativeCalc, FilterEvens, BuildAPrimePyramid]

     config_paths = os.environ['CPG_CONFIG_PATH'].split(',')

     # Inserting after the "defaults" config, but before user configs:
     set_config_paths(config_paths[:1] + [CONFIG_FILE] + config_paths[1:])
     run_workflow(stages=workflow, dry_run=dry_run)

 if __name__ == '__main__':
   run_cpg_flow()
```

  The workflow definition here forms a DAG (Directed Acyclic Graph) structure.

  ![DAG](assets/newplot.png)

  > To generate a plot of the DAG, `show_workflow = True` should be included in the config. The DAG plot generated from the pipeline definition is available in the logs via the job URL. To find the link to the plot, search the *Logs* section for the string: "**INFO - Link to the graph:**".

  There are some key considerations and limitations to take into account when designing the DAG:

  - [No Forward Discovery](#no-forward-discovery)
  - [Workflow Definition](#workflow-definition)

### `stages.py` or equivalent file(s) for the `Stage` definitions

A `Stage` represents a node in the DAG. The stages can be abstracted from either a `DatasetStage`, `CohortStage`, `MultiCohortStage`, or a `SequencingGroupStage`.

The stage definition should use the `@stage` decorator to ***optionally*** set:

- dependent stages (this is used to build the DAG)
- analysis keys (this determines what outputs should be written to metamist)
- the analysis type (this determines the analysis-type to be written to metamist)

All stages require an `expected_outputs` class method definition, that sets the expected output path location for a given `Target` such as a `SequencingGroup`, `Dataset`, `Cohort`, or `MultiCohort`.

Also required, is a `queue_jobs` class method definition that calls pipeline jobs, and stores the results of these jobs to the paths defined in `expected_outputs`.

It is good practice to separate the `Stage` definitions into their own files, to keep the code compact, and manageable.

```python
from cpg_flow.stage import SequencingGroupStage, StageInput, StageOutput, stage
from cpg_flow.targets.sequencing_group import SequencingGroup
from jobs import cumulative_calc

WORKFLOW_FOLDER = 'prime_pyramid'

# ...
# This stage depends on the `GeneratePrimes` stage, and requires outputs from that stage.
@stage(required_stages=[GeneratePrimes], analysis_keys=['cumulative'], analysis_type='custom')
class CumulativeCalc(SequencingGroupStage):
 def expected_outputs(self, sequencing_group: SequencingGroup):
     return {
         'cumulative': sequencing_group.dataset.prefix() / WORKFLOW_FOLDER / f'{sequencing_group.id}_cumulative.txt',
     }

 def queue_jobs(self, sequencing_group: SequencingGroup, inputs: StageInput) -> StageOutput | None:
     input_txt = inputs.as_path(sequencing_group, GeneratePrimes, 'primes')
     b = get_batch()

     cumulative_calc_output_path = str(self.expected_outputs(sequencing_group).get('cumulative', ''))

     # We define a job instance from the `cumulative_calc` job definition.
     job_cumulative_calc = cumulative_calc(b, sequencing_group, input_txt, cumulative_calc_output_path)

     jobs = [job_cumulative_calc]

     return self.make_outputs(
         sequencing_group,
         data=self.expected_outputs(sequencing_group),
         jobs=jobs,
     )
# ...
```

There is a key consideration to take into account when writing the stages:

- [No Forward Discovery](#no-forward-discovery)

### `jobs.py` or equivalent file for `Job` definitions

Every `Stage` requires a collection of jobs that will be executed within. It is good practice to store these jobs in their own files, as the definitions can often get long.

```python
from cpg_flow.targets.sequencing_group import SequencingGroup
from hailtop.batch import Batch
from hailtop.batch.job import Job


def cumulative_calc(
    b: Batch,
    sequencing_group: SequencingGroup,
    input_file_path: str,
    output_file_path: str,
) -> list[Job]:
    title = f'Cumulative Calc: {sequencing_group.id}'
    job = b.new_job(name=title)
    primes_path = b.read_input(input_file_path)

    cmd = f"""
    primes=($(cat {primes_path}))
    csum=0
    cumulative=()
    for prime in "${{primes[@]}}"; do
        ((csum += prime))
        cumulative+=("$csum")
    done
    echo "${{cumulative[@]}}" > {job.cumulative}
    """

    job.command(cmd)

    print('-----PRINT CUMULATIVE-----')
    print(output_file_path)
    b.write_output(job.cumulative, output_file_path)

    return job
```

Once these required components are written, the pipeline is ready to be executed against this framework.

### Running the pipeline

All pipelines can only be exclusively run using the [`analysis-runner` package](https://pypi.org/project/analysis-runner/) which grants the user appropriate permissions based on the dataset and access level defined above. `analysis-runner` requires a repo, commit and the entrypoint file, and then runs the code inside a "driver" image on Hail Batch, logging the invocation to `metamist` for future audit and reproducibility.

Therefore, the pipeline code needs to be pushed to a remote version control system, for `analysis-runner` to be able to pull it for execution. A job can then be submitted:

```shell
analysis-runner \
  --image "australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_flow:1.0.0" \
  --dataset "fewgenomes" \
  --description "cpg-flow_test" \
  --access-level "test" \
  --output-dir "cpg-flow_test" \
  --config "<YOUR_CONFIG>.toml" \
  workflow.py
```

If the job is successfully created, the analysis-runner output will include a job URL. This driver job will trigger additional jobs, which can be monitored via the `/batches` page on Hail. Monitoring these jobs helps verify that the workflow ran successfully. When all expected jobs complete without errors, this confirms the successful execution of the workflow and indicates that the `cpg_flow` package is functioning as intended.

See the [Docker](#docker) section for instruction on pulling valid images releases.

## <a name="key-considerations-and-limitations">😵‍💫 Key Considerations and Limitations</a>

### No Forward Discovery

 The framework exclusively relies on backward traversal. If a stage is not explicitly or indirectly linked to one of the final stages through the `required_stages` parameter of the `@stage` decorator, it will not be included in the workflow. In other words, stages that are not reachable from a final stage are effectively ignored. This backward discovery approach ensures that only the stages directly required for the specified final stages are included, optimizing the workflow by excluding irrelevant or unused stages.

### Workflow Definition

The workflow definition serves as a lookup table for the final stages. If a final stage is not listed in this definition, it will not be part of the workflow, as there is no mechanism for forward discovery to identify it.

```python
workflow = [GeneratePrimes, CumulativeCalc, FilterEvens, BuildAPrimePyramid]
```

### Config Settings for `expected_outputs`

The `expected_outputs` method is called for every stage in the workflow, even if the `config.toml` configures the stage to be skipped. This ensures that the workflow can validate or reference the expected outputs of all stages.

Since this method may depend on workflow-specific configuration settings, these settings must be present in the workflow configuration, regardless of whether the stage will run. To avoid issues, it is common practice to include dummy values for such settings in the default configuration. This is not the intended behaviour and is marked as an area of improvement in a future release.

### Verifying results of `expected_outputs`

The API uses the results of the `expected_outputs` method to determine whether a stage needs to run. A stage is scheduled for execution only if one or more Path objects returned by `expected_outputs` do not exist in Google Cloud Platform (GCP). If a returned Path object exists, the stage is considered to have already run successfully, and is therefore skipped.

For outputs such as Matrix Tables (.mt), Hail Tables (.ht), or Variant Datasets (.vds), which are complex structures of thousands of files, the check is performed on the `object/_SUCCESS` file to verify that the output was written completely. However, it has been observed that the `object/_SUCCESS` file may be written multiple times during processing, contrary to the expectation that it should only be written once after all associated files have been fully processed.

### `String` outputs from `expected_outputs`

String outputs from the `expected_outputs` method are not checked by the API. This is because string outputs cannot reliably be assumed to represent valid file paths and may instead correspond to other forms of outputs.

### Behavior of `queue_jobs` in relation to `expected_outputs`

When the `expected_outputs` check determines that one or more required files do not exist, and the stage is not configured to be skipped, the `queue_jobs` method is invoked to define the specific work that needs to be scheduled in the workflow.

The `queue_jobs` method runs within the driver image, before any jobs in the workflow are executed. Because of this, it cannot access or read files generated by earlier stages, as those outputs have not yet been created. The actual outputs from earlier jobs only become available as the jobs are executed during runtime.

### Explicit dependency between all jobs from `queue_jobs`

When the `queue_jobs` method schedules a collection of jobs to Hail Batch, one or more jobs are returned from the method, and the framework sets an explicit dependency between *these* jobs, and all jobs from the `Stages` set in the `required_stages` parameter. Therefore, all jobs that run in a Stage must be returned within `queue_jobs` to ensure no jobs start out of sequence. As an example:

```python
# test_workflows_shared/cpg_flow_test/jobs/filter_evens.py
def filter_evens(
    b: Batch,
    inputs: StageInput,
    previous_stage: Stage,
    sequencing_groups: list[SequencingGroup],
    input_files: dict[str, dict[str, Any]],
    sg_outputs: dict[str, dict[str, Any]],
    output_file_path: str,
) -> list[Job]:
    title = 'Filter Evens'

    # Compute the no evens list for each sequencing group
    sg_jobs = []
    sg_output_files = []
    for sg in sequencing_groups:  # type: ignore
        job = b.new_job(name=title + ': ' + sg.id)
        ...

        cmd = f"""
        ...
        """

        job.command(cmd)
        b.write_output(job.sg_no_evens_file, no_evens_output_file_path)
        sg_jobs.append(job)

    # Merge the no evens lists for all sequencing groups into a single file
    job = b.new_job(name=title)
    job.depends_on(*sg_jobs)
    inputs = ' '.join([b.read_input(f) for f in sg_output_files])
    job.command(f'cat {inputs} >> {job.no_evens_file}')
    b.write_output(job.no_evens_file, output_file_path)

    # ALL jobs are returned back to `queue_jobs`
    # including new jobs created within this job.
    all_jobs = [job, *sg_jobs]
    return all_jobs
```

## <a name="docker">🐳 Docker</a>


## Docker Image Usage for cpg-flow Python Package

### Pulling and Using the Docker Image

These steps are restricted to CPG members only. Anyone will have access to the code in this public repositry and can build a version of cpg-flow themselves. The following requires authentication with the CPG's GCP.

To pull and use the Docker image for the `cpg-flow` Python package, follow these steps:

1. **Authenticate with Google Cloud Registry**:

    ```sh
    gcloud auth configure-docker australia-southeast1-docker.pkg.dev
    ```

2. **Pull the Docker Image**:
    - For alpha releases:

      ```sh
      docker pull australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_flow:0.1.0-alpha.11
      ```

    - For main releases:

      ```sh
      docker pull australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_flow:1.0.0
      ```

3. **Run the Docker Container**:

    ```sh
    docker run -it australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_flow:<tag>
    ```

### Temporary Images for Development

Temporary images are created for each commit and expire in 30 days. These images are useful for development and testing purposes.

- Example of pulling a temporary image:

  ```sh
  docker pull australia-southeast1-docker.pkg.dev/cpg-common/images-tmp/cpg_flow:991cf5783d7d35dee56a7ab0452d54e69c695c4e
  ```

### Accessing Build Images for CPG Members

Members of the CPG can find the build images in the Google Cloud Registry under the following paths:

- Alpha and main releases: `australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_flow`
- Temporary images: `australia-southeast1-docker.pkg.dev/cpg-common/images-tmp/cpg_flow`

Ensure you have the necessary permissions and are authenticated with Google Cloud to access these images.

### <a name="tests">🧪 Unit and E2E tests</a>

#### Unit Tests

Unit tests are run in the [Test CI workflow](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml) for each branch.

#### E2E Test

We recommend frequently running the manual test workflow found in [test_workflows_shared](https://github.com/populationgenomics/test_workflows_shared)  specifically the `cpg_flow_test` workflow during development to ensure updates work with the CPG production environment.

Docummentation for running the tests are found in the repository readme.


### ▶️ Commands

Before testing, you must follow the **[installation steps](#installation)**.

## <a name="code-analysis-and-consistency">☑️ Code analysis and consistency</a>

### 🔍 Code linting & formatting

![Precommit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)

In order to keep the code clean, consistent and free of bad python practices, more than **Over 10 pre-commit hooks are enabled** !

Complete list of all enabled rules is available in the **[.pre-commit-config.yaml file](https://github.com/populationgenomics/cpg-flow/blob/main/.pre-commit-config.yaml)**.

### ▶️ Commands

Before linting, you must follow the [installation steps](#installation).

Then, run the following command

```bash
# Lint
pre-commit run --all-files
```

When setting up local linting for development you can also run the following once:

```bash
# Install the pre-commit hook
pre-commit install

# Or equivalently
make init || make init-dev
```

### 🥇 Project quality scanner

Multiple tools are set up to maintain the best code quality and to prevent vulnerabilities:

![SonarQube](https://img.shields.io/badge/-SonarQube-black?style=for-the-badge&logoColor=white&logo=sonarqube&color=4E9BCD)

SonarQube summary is available **[here](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)**.

[![Coverage](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=coverage&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)
[![Duplicated Lines (%)](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=duplicated_lines_density&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)
[![Quality Gate Status](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=alert_status&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)

[![Technical Debt](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=sqale_index&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)
[![Vulnerabilities](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=vulnerabilities&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)
[![Code Smells](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=code_smells&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)

[![Reliability Rating](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=reliability_rating&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)
[![Security Rating](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=security_rating&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)
[![Bugs](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=bugs&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)


## <a name="versions">📈 Releases & Changelog</a>

Releases on **main** branch are generated and published automatically,
pre-releases on the **alpha** branch are also generated and published by:

![Semantic Release](https://img.shields.io/badge/-Semantic%20Release-black?style=for-the-badge&logoColor=white&logo=semantic-release&color=000000)

It uses the **[conventional commit](https://www.conventionalcommits.org/en/v1.0.0/)** strategy.

This is enforced using the **[commitlint](https://github.com/opensource-nepal/commitlint)** pre-commit hook that checks commit messages conform to the conventional commit standard.

We recommend installing and using the tool **[commitizen](https://commitizen-tools.github.io/commitizen/) in order to create commit messages. Once installed, you can use either `cz commit` or `git cz` to create a commitizen generated commit message.

Each change when a new release comes up is listed in the **<a href="https://github.com/populationgenomics/cpg-flow/blob/main/CHANGELOG.md" target="_blank">CHANGELOG.md file</a>**.

Also, you can keep up with changes by watching releases via the **Watch GitHub button** at the top of this page.

#### 🏷️ <a href="https://github.com/populationgenomics/cpg-flow/releases" target="_blank">All releases for this project are available here</a>.

## <a name="github-actions">🎬 GitHub Actions</a>

This project uses **GitHub Actions** to automate some boring tasks.

You can find all the workflows in the **[.github/workflows directory](https://github.com/populationgenomics/cpg-flow/tree/main/.github/workflows).**

### 🎢 Workflows

|                                                   Name                                                   |                                                                                                                        Description & Status                                                                                                                         |                                    Triggered on                                     |
| :------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------: |
|        **[Docker](https://github.com/populationgenomics/cpg-flow/actions/workflows/docker.yaml)**        |             Builds and pushes Docker images for the project.<br/><br/>[![Docker](https://github.com/populationgenomics/cpg-flow/actions/workflows/docker.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/docker.yaml)             | `pull_request` on `main, alpha` and `push` on `main, alpha` and `workflow_dispatch` |
|          **[Lint](https://github.com/populationgenomics/cpg-flow/actions/workflows/lint.yaml)**          |                  Runs linting checks using pre-commit hooks.<br/><br/>[![Lint](https://github.com/populationgenomics/cpg-flow/actions/workflows/lint.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/lint.yaml)                   |                                       `push`                                        |
|       **[Package](https://github.com/populationgenomics/cpg-flow/actions/workflows/package.yaml)**       |  Packages the project and publishes it to PyPI and GitHub Releases.<br/><br/>[![Package](https://github.com/populationgenomics/cpg-flow/actions/workflows/package.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/package.yaml)   |                               `push` on `main, alpha`                               |
|      **[Renovate](https://github.com/populationgenomics/cpg-flow/actions/workflows/renovate.yaml)**      |               Runs Renovate to update dependencies.<br/><br/>[![Renovate](https://github.com/populationgenomics/cpg-flow/actions/workflows/renovate.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/renovate.yaml)                |                         `schedule` and `workflow_dispatch`                          |
|  **[Security Checks](https://github.com/populationgenomics/cpg-flow/actions/workflows/security.yaml)**   |          Performs security checks using pip-audit.<br/><br/>[![Security Checks](https://github.com/populationgenomics/cpg-flow/actions/workflows/security.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/security.yaml)          |                           `workflow_dispatch` and `push`                            |
|          **[Test](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml)**          |                Runs unit tests and generates coverage reports.<br/><br/>[![Test](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml)                 |                                       `push`                                        |
| **[Update Badges](https://github.com/populationgenomics/cpg-flow/actions/workflows/update-badges.yaml)** | Updates badges.yaml with test results and coverage.<br/><br/>[![Update Badges](https://github.com/populationgenomics/cpg-flow/actions/workflows/update-badges.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/update-badges.yaml) |                             `workflow_run` (completed)                              |
|       **[mkdocs](https://github.com/populationgenomics/cpg-flow/actions/workflows/web-docs.yaml)**       |              Deploys API documentation to GitHub Pages.<br/><br/>[![mkdocs](https://github.com/populationgenomics/cpg-flow/actions/workflows/web-docs.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/web-docs.yaml)              |                                  `push` on `alpha`                                  |


## <a name="license">©️ License</a>

This project is licensed under the [MIT License](http://opensource.org/licenses/MIT).

## <a name="contributors">❤️ Contributors</a>

There is no contributor yet. Want to be the first ?

If you want to contribute to this project, please read the [**contribution guide**](https://github.com/populationgenomics/cpg-flow/blob/master/CONTRIBUTING.md).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cpg-flow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.11,>=3.10",
    "maintainer_email": null,
    "keywords": "hail, flow, api, bioinformatics, genomics",
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/59/f8/1a78642c9dfd920b3629c96501c5b69b418cbcf0b0fe3ea7ab46b5b90cbc/cpg_flow-0.1.2.tar.gz",
    "platform": null,
    "description": "<!-- markdownlint-disable MD033 MD024 -->\n# \ud83d\udc19 CPG Flow\n\n<img src=\"/assets/DNA_CURIOUS_FLOYD_CROPPED.png\" height=\"300\" alt=\"CPG Flow logo\" align=\"right\"/>\n\n![Python](https://img.shields.io/badge/-Python-black?style=for-the-badge&logoColor=white&logo=python&color=2F73BF)\n\n[![\u2699\ufe0f Test Workflow](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml)\n[![\ud83d\ude80 Deploy To Production Workflow](https://github.com/populationgenomics/cpg-flow/actions/workflows/package.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/package.yaml)\n[![GitHub Latest Main Release](https://img.shields.io/github/v/release/populationgenomics/cpg-flow?label=main%20release)](https://GitHub.com/populationgenomics/cpg-flow/releases/)\n[![GitHub Release](https://img.shields.io/github/v/release/populationgenomics/cpg-flow?include_prereleases&label=latest)](https://GitHub.com/populationgenomics/cpg-flow/releases/)\n[![semantic-release: conventional commits](https://img.shields.io/badge/semantic--release-conventional%20commits-\u00c61A7DBD?logo=semantic-release&color=1E7FBF)](https://github.com/semantic-release/semantic-release)\n[![GitHub license](https://img.shields.io/github/license/populationgenomics/cpg-flow.svg)](https://github.com/populationgenomics/cpg-flow/blob/main/LICENSE)\n\n[![Technical Debt](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=sqale_index&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n[![Duplicated Lines (%)](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=duplicated_lines_density&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n[![Code Smells](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=code_smells&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n\n<br />\n\n## \ud83d\udccb Table of Contents\n\n1. \ud83d\udc19 [What is this API ?](#what-is-this-api)\n2. \u2728 [Production and development links](#production-and-development-links)\n3. \ud83d\udd28 [Installation](#installation)\n4. \ud83d\ude80 [Build](#build)\n5. \ud83e\udd16 [Usage](#usage)\n6. \ud83d\ude35\u200d\ud83d\udcab [Key Considerations and Limitations](#key-considerations-and-limitations)\n7. \ud83d\udc33 [Docker](#docker)\n8. \ud83d\udcaf [Tests](#tests)\n9. \u2611\ufe0f [Code analysis and consistency](#code-analysis-and-consistency)\n10. \ud83d\udcc8 [Releases & Changelog](#versions)\n11. \ud83c\udfac [GitHub Actions](#github-actions)\n12. \u00a9\ufe0f [License](#license)\n13. \u2764\ufe0f [Contributors](#contributors)\n\n## <a name=\"what-is-this-api\">\ud83d\udc19 What is this API ?</a>\n\nWelcome to CPG Flow!\n\nThis API provides a set of tools and workflows for managing population genomics data pipelines, designed to streamline the processing, analysis, and storage of large-scale genomic datasets. It facilitates automated pipeline execution, enabling reproducible research while integrating with cloud-based resources for scalable computation.\n\nCPG Flow supports various stages of genomic data processing, from raw data ingestion to final analysis outputs, making it easier for researchers to manage and scale their population genomics workflows.\n\nThe API constructs a DAG (Directed Acyclic Graph) structure from a set of chained stages. This DAG structure then forms the **pipeline**.\n\n## <a name=\"documentation\">\u2728 Documentation</a>\n\n### \ud83c\udf10 Production\n\nThe production version of this API is documented at **[populationgenomics.github.io/cpg-flow/](https://populationgenomics.github.io/cpg-flow/)**.\n\nThe documentation is updated automatically when a commit is pushed on the `alpha` (prerelease) or `main` (release) branch.\n\n## <a name=\"installation\">\ud83d\udd28 Installation</a>\n\nThe packages are hosted on:\n\n![PyPI](https://img.shields.io/badge/-PyPI-black?style=for-the-badge&logoColor=white&logo=pypi&color=3776AB)\n\nTo install this project, you will need to have Python and `uv` installed on your machine:\n\n![uv](https://img.shields.io/badge/-uv-black?style=for-the-badge&logoColor=white&logo=uv&color=3776AB&link=https://docs.astral.sh/uv/)\n![Python](https://img.shields.io/badge/-Python-black?style=for-the-badge&logoColor=white&logo=python&color=3776AB)\n\nRun the following commands, to create a virtual environment with `uv` and install the dependencies:\n\n```bash\n# Install the package using uv\nuv sync\n\n# Or equivalently use make (also installs pre-commit)\nmake init\n```\n\n### \ud83d\udee0\ufe0f Development\n\nTo setup for development we recommend using the makefile setup\n\n```bash\nmake init-dev # installs pre-commit as a hook\n```\n\nTo install `cpg-flow` locally, run:\n\n```bash\nmake install-local\n```\n\nTo try out the pre-installed `cpg-flow` in a Docker image, find more information in the **[Docker](#docker)** section.\n\n## <a name=\"build\">\ud83d\ude80 Build</a>\n\nTo build the project, run the following command:\n\n```bash\nmake build\n```\n\nTo make sure that you're actually using the installed build we suggest calling the following to install the build wheel.\n\n```bash\nmake install-build\n```\n\n## <a name=\"usage\">\ud83e\udd16 Usage</a>\n\nThis project provides the framework to construct pipelines but does not offer hosting the logic of any pipelines themselves. This approach offers the benefit of making all components more modular, manageable and decoupled. Pipelines themselves are hosted in a separate repository.\n\nThe [test_workflows_shared repository](https://github.com/populationgenomics/test_workflows_shared) acts as a template and demonstrates how to structure a pipeline using CPG Flow.\n\nThe components required to build pipelines with CPG Flow:\n\n### config `.toml` file\n\nThis file contains the configuration settings to your pipeline. This file allows the pipeline developer to define settings such as:\n\n1. what stages will be run or skipped\n2. what dataset to use\n3. what access level to use\n4. any input cohorts\n5. sequencing type\n\n```toml\n[workflow]\ndataset = 'fewgenomes'\n\n# Note: for fewgenomes and sandbox mentioning datasets by name is not a security risk\n# DO NOT DO THIS FOR OTHER DATASETS\n\ninput_cohorts = ['COH2142']\naccess_level = 'test'\n\n# Force stage rerun\nforce_stages = [\n    'GeneratePrimes', # the first stage\n    'CumulativeCalc', # the second stage\n    'FilterEvens', # the third stage\n    'BuildAPrimePyramid', # the last stage\n]\n\n# Show a workflow graph locally or save to web bucket.\n# Default is false, set to true to show the workflow graph.\nshow_workflow = true\n# ...\n```\n\nFor a full list of supported config options with documentation, see [defaults.toml](src/cpg_flow/defaults.toml)\n\nThis `.toml` file will be may be named anything, as long as it is correctly passed to the `analysis-runner` invocation. The `analysis-runner` supplies its own default settings, and combines it with the settings from this file, before submitting a job.\n\n### `main.py` or equivalent entrypoint for the pipeline\n\nThis file would store the workflow definition as a list of stages, and then run said workflow:\n\n```python\n import os\n from pathlib import Path\n from cpg_flow.workflow import run_workflow\n from cpg_utils.config import set_config_paths\n from stages import BuildAPrimePyramid, CumulativeCalc, FilterEvens, GeneratePrimes\n\n CONFIG_FILE = str(Path(__file__).parent / '<YOUR_CONFIG>.toml')\n\n def run_cpg_flow(dry_run=False):\n\n    #See the 'Key Considerations and Limitations' section for notes on the definition of the `workflow` variable.\n\n    # This represents the flow of the DAG\n     workflow = [GeneratePrimes, CumulativeCalc, FilterEvens, BuildAPrimePyramid]\n\n     config_paths = os.environ['CPG_CONFIG_PATH'].split(',')\n\n     # Inserting after the \"defaults\" config, but before user configs:\n     set_config_paths(config_paths[:1] + [CONFIG_FILE] + config_paths[1:])\n     run_workflow(stages=workflow, dry_run=dry_run)\n\n if __name__ == '__main__':\n   run_cpg_flow()\n```\n\n  The workflow definition here forms a DAG (Directed Acyclic Graph) structure.\n\n  ![DAG](assets/newplot.png)\n\n  > To generate a plot of the DAG, `show_workflow = True` should be included in the config. The DAG plot generated from the pipeline definition is available in the logs via the job URL. To find the link to the plot, search the *Logs* section for the string: \"**INFO - Link to the graph:**\".\n\n  There are some key considerations and limitations to take into account when designing the DAG:\n\n  - [No Forward Discovery](#no-forward-discovery)\n  - [Workflow Definition](#workflow-definition)\n\n### `stages.py` or equivalent file(s) for the `Stage` definitions\n\nA `Stage` represents a node in the DAG. The stages can be abstracted from either a `DatasetStage`, `CohortStage`, `MultiCohortStage`, or a `SequencingGroupStage`.\n\nThe stage definition should use the `@stage` decorator to ***optionally*** set:\n\n- dependent stages (this is used to build the DAG)\n- analysis keys (this determines what outputs should be written to metamist)\n- the analysis type (this determines the analysis-type to be written to metamist)\n\nAll stages require an `expected_outputs` class method definition, that sets the expected output path location for a given `Target` such as a `SequencingGroup`, `Dataset`, `Cohort`, or `MultiCohort`.\n\nAlso required, is a `queue_jobs` class method definition that calls pipeline jobs, and stores the results of these jobs to the paths defined in `expected_outputs`.\n\nIt is good practice to separate the `Stage` definitions into their own files, to keep the code compact, and manageable.\n\n```python\nfrom cpg_flow.stage import SequencingGroupStage, StageInput, StageOutput, stage\nfrom cpg_flow.targets.sequencing_group import SequencingGroup\nfrom jobs import cumulative_calc\n\nWORKFLOW_FOLDER = 'prime_pyramid'\n\n# ...\n# This stage depends on the `GeneratePrimes` stage, and requires outputs from that stage.\n@stage(required_stages=[GeneratePrimes], analysis_keys=['cumulative'], analysis_type='custom')\nclass CumulativeCalc(SequencingGroupStage):\n def expected_outputs(self, sequencing_group: SequencingGroup):\n     return {\n         'cumulative': sequencing_group.dataset.prefix() / WORKFLOW_FOLDER / f'{sequencing_group.id}_cumulative.txt',\n     }\n\n def queue_jobs(self, sequencing_group: SequencingGroup, inputs: StageInput) -> StageOutput | None:\n     input_txt = inputs.as_path(sequencing_group, GeneratePrimes, 'primes')\n     b = get_batch()\n\n     cumulative_calc_output_path = str(self.expected_outputs(sequencing_group).get('cumulative', ''))\n\n     # We define a job instance from the `cumulative_calc` job definition.\n     job_cumulative_calc = cumulative_calc(b, sequencing_group, input_txt, cumulative_calc_output_path)\n\n     jobs = [job_cumulative_calc]\n\n     return self.make_outputs(\n         sequencing_group,\n         data=self.expected_outputs(sequencing_group),\n         jobs=jobs,\n     )\n# ...\n```\n\nThere is a key consideration to take into account when writing the stages:\n\n- [No Forward Discovery](#no-forward-discovery)\n\n### `jobs.py` or equivalent file for `Job` definitions\n\nEvery `Stage` requires a collection of jobs that will be executed within. It is good practice to store these jobs in their own files, as the definitions can often get long.\n\n```python\nfrom cpg_flow.targets.sequencing_group import SequencingGroup\nfrom hailtop.batch import Batch\nfrom hailtop.batch.job import Job\n\n\ndef cumulative_calc(\n    b: Batch,\n    sequencing_group: SequencingGroup,\n    input_file_path: str,\n    output_file_path: str,\n) -> list[Job]:\n    title = f'Cumulative Calc: {sequencing_group.id}'\n    job = b.new_job(name=title)\n    primes_path = b.read_input(input_file_path)\n\n    cmd = f\"\"\"\n    primes=($(cat {primes_path}))\n    csum=0\n    cumulative=()\n    for prime in \"${{primes[@]}}\"; do\n        ((csum += prime))\n        cumulative+=(\"$csum\")\n    done\n    echo \"${{cumulative[@]}}\" > {job.cumulative}\n    \"\"\"\n\n    job.command(cmd)\n\n    print('-----PRINT CUMULATIVE-----')\n    print(output_file_path)\n    b.write_output(job.cumulative, output_file_path)\n\n    return job\n```\n\nOnce these required components are written, the pipeline is ready to be executed against this framework.\n\n### Running the pipeline\n\nAll pipelines can only be exclusively run using the [`analysis-runner` package](https://pypi.org/project/analysis-runner/) which grants the user appropriate permissions based on the dataset and access level defined above. `analysis-runner` requires a repo, commit and the entrypoint file, and then runs the code inside a \"driver\" image on Hail Batch, logging the invocation to `metamist` for future audit and reproducibility.\n\nTherefore, the pipeline code needs to be pushed to a remote version control system, for `analysis-runner` to be able to pull it for execution. A job can then be submitted:\n\n```shell\nanalysis-runner \\\n  --image \"australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_flow:1.0.0\" \\\n  --dataset \"fewgenomes\" \\\n  --description \"cpg-flow_test\" \\\n  --access-level \"test\" \\\n  --output-dir \"cpg-flow_test\" \\\n  --config \"<YOUR_CONFIG>.toml\" \\\n  workflow.py\n```\n\nIf the job is successfully created, the analysis-runner output will include a job URL. This driver job will trigger additional jobs, which can be monitored via the `/batches` page on Hail. Monitoring these jobs helps verify that the workflow ran successfully. When all expected jobs complete without errors, this confirms the successful execution of the workflow and indicates that the `cpg_flow` package is functioning as intended.\n\nSee the [Docker](#docker) section for instruction on pulling valid images releases.\n\n## <a name=\"key-considerations-and-limitations\">\ud83d\ude35\u200d\ud83d\udcab Key Considerations and Limitations</a>\n\n### No Forward Discovery\n\n The framework exclusively relies on backward traversal. If a stage is not explicitly or indirectly linked to one of the final stages through the `required_stages` parameter of the `@stage` decorator, it will not be included in the workflow. In other words, stages that are not reachable from a final stage are effectively ignored. This backward discovery approach ensures that only the stages directly required for the specified final stages are included, optimizing the workflow by excluding irrelevant or unused stages.\n\n### Workflow Definition\n\nThe workflow definition serves as a lookup table for the final stages. If a final stage is not listed in this definition, it will not be part of the workflow, as there is no mechanism for forward discovery to identify it.\n\n```python\nworkflow = [GeneratePrimes, CumulativeCalc, FilterEvens, BuildAPrimePyramid]\n```\n\n### Config Settings for `expected_outputs`\n\nThe `expected_outputs` method is called for every stage in the workflow, even if the `config.toml` configures the stage to be skipped. This ensures that the workflow can validate or reference the expected outputs of all stages.\n\nSince this method may depend on workflow-specific configuration settings, these settings must be present in the workflow configuration, regardless of whether the stage will run. To avoid issues, it is common practice to include dummy values for such settings in the default configuration. This is not the intended behaviour and is marked as an area of improvement in a future release.\n\n### Verifying results of `expected_outputs`\n\nThe API uses the results of the `expected_outputs` method to determine whether a stage needs to run. A stage is scheduled for execution only if one or more Path objects returned by `expected_outputs` do not exist in Google Cloud Platform (GCP). If a returned Path object exists, the stage is considered to have already run successfully, and is therefore skipped.\n\nFor outputs such as Matrix Tables (.mt), Hail Tables (.ht), or Variant Datasets (.vds), which are complex structures of thousands of files, the check is performed on the `object/_SUCCESS` file to verify that the output was written completely. However, it has been observed that the `object/_SUCCESS` file may be written multiple times during processing, contrary to the expectation that it should only be written once after all associated files have been fully processed.\n\n### `String` outputs from `expected_outputs`\n\nString outputs from the `expected_outputs` method are not checked by the API. This is because string outputs cannot reliably be assumed to represent valid file paths and may instead correspond to other forms of outputs.\n\n### Behavior of `queue_jobs` in relation to `expected_outputs`\n\nWhen the `expected_outputs` check determines that one or more required files do not exist, and the stage is not configured to be skipped, the `queue_jobs` method is invoked to define the specific work that needs to be scheduled in the workflow.\n\nThe `queue_jobs` method runs within the driver image, before any jobs in the workflow are executed. Because of this, it cannot access or read files generated by earlier stages, as those outputs have not yet been created. The actual outputs from earlier jobs only become available as the jobs are executed during runtime.\n\n### Explicit dependency between all jobs from `queue_jobs`\n\nWhen the `queue_jobs` method schedules a collection of jobs to Hail Batch, one or more jobs are returned from the method, and the framework sets an explicit dependency between *these* jobs, and all jobs from the `Stages` set in the `required_stages` parameter. Therefore, all jobs that run in a Stage must be returned within `queue_jobs` to ensure no jobs start out of sequence. As an example:\n\n```python\n# test_workflows_shared/cpg_flow_test/jobs/filter_evens.py\ndef filter_evens(\n    b: Batch,\n    inputs: StageInput,\n    previous_stage: Stage,\n    sequencing_groups: list[SequencingGroup],\n    input_files: dict[str, dict[str, Any]],\n    sg_outputs: dict[str, dict[str, Any]],\n    output_file_path: str,\n) -> list[Job]:\n    title = 'Filter Evens'\n\n    # Compute the no evens list for each sequencing group\n    sg_jobs = []\n    sg_output_files = []\n    for sg in sequencing_groups:  # type: ignore\n        job = b.new_job(name=title + ': ' + sg.id)\n        ...\n\n        cmd = f\"\"\"\n        ...\n        \"\"\"\n\n        job.command(cmd)\n        b.write_output(job.sg_no_evens_file, no_evens_output_file_path)\n        sg_jobs.append(job)\n\n    # Merge the no evens lists for all sequencing groups into a single file\n    job = b.new_job(name=title)\n    job.depends_on(*sg_jobs)\n    inputs = ' '.join([b.read_input(f) for f in sg_output_files])\n    job.command(f'cat {inputs} >> {job.no_evens_file}')\n    b.write_output(job.no_evens_file, output_file_path)\n\n    # ALL jobs are returned back to `queue_jobs`\n    # including new jobs created within this job.\n    all_jobs = [job, *sg_jobs]\n    return all_jobs\n```\n\n## <a name=\"docker\">\ud83d\udc33 Docker</a>\n\n\n## Docker Image Usage for cpg-flow Python Package\n\n### Pulling and Using the Docker Image\n\nThese steps are restricted to CPG members only. Anyone will have access to the code in this public repositry and can build a version of cpg-flow themselves. The following requires authentication with the CPG's GCP.\n\nTo pull and use the Docker image for the `cpg-flow` Python package, follow these steps:\n\n1. **Authenticate with Google Cloud Registry**:\n\n    ```sh\n    gcloud auth configure-docker australia-southeast1-docker.pkg.dev\n    ```\n\n2. **Pull the Docker Image**:\n    - For alpha releases:\n\n      ```sh\n      docker pull australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_flow:0.1.0-alpha.11\n      ```\n\n    - For main releases:\n\n      ```sh\n      docker pull australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_flow:1.0.0\n      ```\n\n3. **Run the Docker Container**:\n\n    ```sh\n    docker run -it australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_flow:<tag>\n    ```\n\n### Temporary Images for Development\n\nTemporary images are created for each commit and expire in 30 days. These images are useful for development and testing purposes.\n\n- Example of pulling a temporary image:\n\n  ```sh\n  docker pull australia-southeast1-docker.pkg.dev/cpg-common/images-tmp/cpg_flow:991cf5783d7d35dee56a7ab0452d54e69c695c4e\n  ```\n\n### Accessing Build Images for CPG Members\n\nMembers of the CPG can find the build images in the Google Cloud Registry under the following paths:\n\n- Alpha and main releases: `australia-southeast1-docker.pkg.dev/cpg-common/images/cpg_flow`\n- Temporary images: `australia-southeast1-docker.pkg.dev/cpg-common/images-tmp/cpg_flow`\n\nEnsure you have the necessary permissions and are authenticated with Google Cloud to access these images.\n\n### <a name=\"tests\">\ud83e\uddea Unit and E2E tests</a>\n\n#### Unit Tests\n\nUnit tests are run in the [Test CI workflow](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml) for each branch.\n\n#### E2E Test\n\nWe recommend frequently running the manual test workflow found in [test_workflows_shared](https://github.com/populationgenomics/test_workflows_shared)  specifically the `cpg_flow_test` workflow during development to ensure updates work with the CPG production environment.\n\nDocummentation for running the tests are found in the repository readme.\n\n\n### \u25b6\ufe0f Commands\n\nBefore testing, you must follow the **[installation steps](#installation)**.\n\n## <a name=\"code-analysis-and-consistency\">\u2611\ufe0f Code analysis and consistency</a>\n\n### \ud83d\udd0d Code linting & formatting\n\n![Precommit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)\n\nIn order to keep the code clean, consistent and free of bad python practices, more than **Over 10 pre-commit hooks are enabled** !\n\nComplete list of all enabled rules is available in the **[.pre-commit-config.yaml file](https://github.com/populationgenomics/cpg-flow/blob/main/.pre-commit-config.yaml)**.\n\n### \u25b6\ufe0f Commands\n\nBefore linting, you must follow the [installation steps](#installation).\n\nThen, run the following command\n\n```bash\n# Lint\npre-commit run --all-files\n```\n\nWhen setting up local linting for development you can also run the following once:\n\n```bash\n# Install the pre-commit hook\npre-commit install\n\n# Or equivalently\nmake init || make init-dev\n```\n\n### \ud83e\udd47 Project quality scanner\n\nMultiple tools are set up to maintain the best code quality and to prevent vulnerabilities:\n\n![SonarQube](https://img.shields.io/badge/-SonarQube-black?style=for-the-badge&logoColor=white&logo=sonarqube&color=4E9BCD)\n\nSonarQube summary is available **[here](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)**.\n\n[![Coverage](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=coverage&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n[![Duplicated Lines (%)](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=duplicated_lines_density&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n[![Quality Gate Status](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=alert_status&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n\n[![Technical Debt](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=sqale_index&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n[![Vulnerabilities](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=vulnerabilities&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n[![Code Smells](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=code_smells&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n\n[![Reliability Rating](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=reliability_rating&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n[![Security Rating](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=security_rating&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n[![Bugs](https://sonarqube.populationgenomics.org.au/api/project_badges/measure?project=populationgenomics_cpg-flow&metric=bugs&token=sqb_bd2c5ce00628492c0af714f727ef6f8e939d235c)](https://sonarqube.populationgenomics.org.au/dashboard?id=populationgenomics_cpg-flow)\n\n\n## <a name=\"versions\">\ud83d\udcc8 Releases & Changelog</a>\n\nReleases on **main** branch are generated and published automatically,\npre-releases on the **alpha** branch are also generated and published by:\n\n![Semantic Release](https://img.shields.io/badge/-Semantic%20Release-black?style=for-the-badge&logoColor=white&logo=semantic-release&color=000000)\n\nIt uses the **[conventional commit](https://www.conventionalcommits.org/en/v1.0.0/)** strategy.\n\nThis is enforced using the **[commitlint](https://github.com/opensource-nepal/commitlint)** pre-commit hook that checks commit messages conform to the conventional commit standard.\n\nWe recommend installing and using the tool **[commitizen](https://commitizen-tools.github.io/commitizen/) in order to create commit messages. Once installed, you can use either `cz commit` or `git cz` to create a commitizen generated commit message.\n\nEach change when a new release comes up is listed in the **<a href=\"https://github.com/populationgenomics/cpg-flow/blob/main/CHANGELOG.md\" target=\"_blank\">CHANGELOG.md file</a>**.\n\nAlso, you can keep up with changes by watching releases via the **Watch GitHub button** at the top of this page.\n\n#### \ud83c\udff7\ufe0f <a href=\"https://github.com/populationgenomics/cpg-flow/releases\" target=\"_blank\">All releases for this project are available here</a>.\n\n## <a name=\"github-actions\">\ud83c\udfac GitHub Actions</a>\n\nThis project uses **GitHub Actions** to automate some boring tasks.\n\nYou can find all the workflows in the **[.github/workflows directory](https://github.com/populationgenomics/cpg-flow/tree/main/.github/workflows).**\n\n### \ud83c\udfa2 Workflows\n\n|                                                   Name                                                   |                                                                                                                        Description & Status                                                                                                                         |                                    Triggered on                                     |\n| :------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------: |\n|        **[Docker](https://github.com/populationgenomics/cpg-flow/actions/workflows/docker.yaml)**        |             Builds and pushes Docker images for the project.<br/><br/>[![Docker](https://github.com/populationgenomics/cpg-flow/actions/workflows/docker.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/docker.yaml)             | `pull_request` on `main, alpha` and `push` on `main, alpha` and `workflow_dispatch` |\n|          **[Lint](https://github.com/populationgenomics/cpg-flow/actions/workflows/lint.yaml)**          |                  Runs linting checks using pre-commit hooks.<br/><br/>[![Lint](https://github.com/populationgenomics/cpg-flow/actions/workflows/lint.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/lint.yaml)                   |                                       `push`                                        |\n|       **[Package](https://github.com/populationgenomics/cpg-flow/actions/workflows/package.yaml)**       |  Packages the project and publishes it to PyPI and GitHub Releases.<br/><br/>[![Package](https://github.com/populationgenomics/cpg-flow/actions/workflows/package.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/package.yaml)   |                               `push` on `main, alpha`                               |\n|      **[Renovate](https://github.com/populationgenomics/cpg-flow/actions/workflows/renovate.yaml)**      |               Runs Renovate to update dependencies.<br/><br/>[![Renovate](https://github.com/populationgenomics/cpg-flow/actions/workflows/renovate.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/renovate.yaml)                |                         `schedule` and `workflow_dispatch`                          |\n|  **[Security Checks](https://github.com/populationgenomics/cpg-flow/actions/workflows/security.yaml)**   |          Performs security checks using pip-audit.<br/><br/>[![Security Checks](https://github.com/populationgenomics/cpg-flow/actions/workflows/security.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/security.yaml)          |                           `workflow_dispatch` and `push`                            |\n|          **[Test](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml)**          |                Runs unit tests and generates coverage reports.<br/><br/>[![Test](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/test.yaml)                 |                                       `push`                                        |\n| **[Update Badges](https://github.com/populationgenomics/cpg-flow/actions/workflows/update-badges.yaml)** | Updates badges.yaml with test results and coverage.<br/><br/>[![Update Badges](https://github.com/populationgenomics/cpg-flow/actions/workflows/update-badges.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/update-badges.yaml) |                             `workflow_run` (completed)                              |\n|       **[mkdocs](https://github.com/populationgenomics/cpg-flow/actions/workflows/web-docs.yaml)**       |              Deploys API documentation to GitHub Pages.<br/><br/>[![mkdocs](https://github.com/populationgenomics/cpg-flow/actions/workflows/web-docs.yaml/badge.svg)](https://github.com/populationgenomics/cpg-flow/actions/workflows/web-docs.yaml)              |                                  `push` on `alpha`                                  |\n\n\n## <a name=\"license\">\u00a9\ufe0f License</a>\n\nThis project is licensed under the [MIT License](http://opensource.org/licenses/MIT).\n\n## <a name=\"contributors\">\u2764\ufe0f Contributors</a>\n\nThere is no contributor yet. Want to be the first ?\n\nIf you want to contribute to this project, please read the [**contribution guide**](https://github.com/populationgenomics/cpg-flow/blob/master/CONTRIBUTING.md).\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2022 Centre for Population Genomics  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "CPG Flow API for Hail Batch",
    "version": "0.1.2",
    "project_urls": {
        "Repository": "https://github.com/populationgenomics/cpg-flow"
    },
    "split_keywords": [
        "hail",
        " flow",
        " api",
        " bioinformatics",
        " genomics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "88996b2e8c2a13007cbf556656b10a3995663c6bb3b75d12351c5278b5ace841",
                "md5": "4d890af29afe36e8b5ace1fc59adc22e",
                "sha256": "9cb29663379ef145506c011363395d3208059aeb05ec0a7e08c3a21bb9dfbbc3"
            },
            "downloads": -1,
            "filename": "cpg_flow-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4d890af29afe36e8b5ace1fc59adc22e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.11,>=3.10",
            "size": 69928,
            "upload_time": "2025-02-05T03:10:32",
            "upload_time_iso_8601": "2025-02-05T03:10:32.779538Z",
            "url": "https://files.pythonhosted.org/packages/88/99/6b2e8c2a13007cbf556656b10a3995663c6bb3b75d12351c5278b5ace841/cpg_flow-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "59f81a78642c9dfd920b3629c96501c5b69b418cbcf0b0fe3ea7ab46b5b90cbc",
                "md5": "312e6e81b8d6bc3694a8cf67fb00b828",
                "sha256": "ffec1fe18b0315dabd800f85b7dce7bef5fec37d131859f39d712ecc332d5201"
            },
            "downloads": -1,
            "filename": "cpg_flow-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "312e6e81b8d6bc3694a8cf67fb00b828",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.11,>=3.10",
            "size": 1878967,
            "upload_time": "2025-02-05T03:10:35",
            "upload_time_iso_8601": "2025-02-05T03:10:35.047275Z",
            "url": "https://files.pythonhosted.org/packages/59/f8/1a78642c9dfd920b3629c96501c5b69b418cbcf0b0fe3ea7ab46b5b90cbc/cpg_flow-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-05 03:10:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "populationgenomics",
    "github_project": "cpg-flow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "cpg-flow"
}

None