hybracter


Namehybracter JSON
Version 0.11.1 PyPI version JSON
download
home_pagehttps://github.com/gbouras13/hybracter
SummaryAn automated long-read first bacterial genome assembly pipeline.
upload_time2025-01-21 00:10:54
maintainerNone
docs_urlNone
authorGeorge Bouras
requires_python>=3.9
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gbouras13/hybracter/blob/main/run_hybracter.ipynb)

[![Paper](https://img.shields.io/badge/paper-Microbial_Genomics-green.svg?style=flat-square&maxAge=3600)](https://doi.org/10.1099/mgen.0.001244)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![](https://img.shields.io/static/v1?label=CLI&message=Snaketool&color=blueviolet)](https://github.com/beardymcjohnface/Snaketool)
![GitHub last commit (branch)](https://img.shields.io/github/last-commit/gbouras13/hybracter/dev?color=8a35da)
[![Code DOI](https://zenodo.org/badge/574521745.svg)](https://zenodo.org/badge/latestdoi/574521745)

[![Anaconda-Server Badge](https://anaconda.org/bioconda/hybracter/badges/version.svg)](https://anaconda.org/bioconda/hybracter)
[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/hybracter)](https://img.shields.io/conda/dn/bioconda/hybracter)
[![PyPI version](https://badge.fury.io/py/hybracter.svg)](https://badge.fury.io/py/hybracter)
[![Downloads](https://static.pepy.tech/badge/hybracter)](https://pepy.tech/project/hybracter)

# Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies

`hybracter` is an automated long-read first bacterial genome assembly tool implemented in Snakemake using [Snaketool](https://github.com/beardymcjohnface/Snaketool). 

## Table of Contents

- [Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies](#hybracter-enabling-scalable-automated-complete-and-accurate-bacterial-genome-assemblies)
  - [Table of Contents](#table-of-contents)
  - [Quick Start](#quick-start)
    - [Conda](#conda)
    - [Container](#container)
    - [Google Colab Notebooks](#google-colab-notebooks)
  - [Documentation](#documentation)
  - [Manuscript](#manuscript)
  - [Description](#description)
  - [Pipeline](#pipeline)
  - [Benchmarking](#benchmarking)
  - [Recent Updates](#recent-updates)
    - [v0.11.0 Updates (4 December 2024)](#v0110-updates-4-december-2024)
    - [v0.10.0 Updates (17 October 2024)](#v0100-updates-17-october-2024)
    - [v0.9.0 Updates (18 September 2024)](#v090-updates-18-september-2024)
  - [Why Would You Run Hybracter?](#why-would-you-run-hybracter)
  - [Other Options](#other-options)
      - [Trycycler](#trycycler)
      - [Dragonflye](#dragonflye)
  - [Installation](#installation)
    - [Conda](#conda-1)
    - [Pip](#pip)
    - [Source](#source)
  - [Main Commands](#main-commands)
  - [Input csv](#input-csv)
      - [`hybracter hybrid`](#hybracter-hybrid)
      - [`hybracter long`](#hybracter-long)
  - [Usage](#usage)
      - [`hybracter install`](#hybracter-install)
      - [Installing Dependencies](#installing-dependencies)
      - [`hybracter hybrid`](#hybracter-hybrid-1)
      - [`hybracter hybrid-single`](#hybracter-hybrid-single)
      - [`hybracter long`](#hybracter-long-1)
      - [`hybracter long-single`](#hybracter-long-single)
  - [Outputs](#outputs)
    - [Main Output Files](#main-output-files)
  - [Snakemake Profiles](#snakemake-profiles)
  - [Advanced Configuration](#advanced-configuration)
  - [Older Updates](#older-updates)
    - [v0.7.0 Updates (04 March 2024)](#v070-updates-04-march-2024)
    - [v0.5.0 Updates (08 January 2024)](#v050-updates-08-january-2024)
    - [v0.4.0 Updates (14 November 2023)](#v040-updates-14-november-2023)
    - [v0.2.0 Updates 26 October 2023 - Medaka, Polishing and `--no_medaka`](#v020-updates-26-october-2023---medaka-polishing-and---no_medaka)
  - [Version Log](#version-log)
  - [System](#system)
  - [Bugs and Suggestions](#bugs-and-suggestions)
- [Citation](#citation)


## Quick Start

### Conda

`hybracter` is available to install with `pip` or `conda`.

You will need conda or mamba available so `hybracter` can install all the required dependencies. 

Therefore, it is recommended to install `hybracter` into a conda environment as follows.

```bash
conda create -n hybracterENV -c bioconda -c conda-forge  hybracter
conda activate hybracterENV
hybracter --help
hybracter install
```

Miniforge is **highly highly** recommended. Please see the [documentation](https://hybracter.readthedocs.io/en/latest/install/) for more details on how to install Miniforge.

When you run `hybracter` for the first time, all the required dependencies will be installed as required, so it will take longer than usual (usually a few minutes). Every time you run it afterwards, it will be a lot faster as the dependencies will be installed.

If you intend to run hybracter offline (e.g. on HPC nodes with no access to the internet), I highly recommend running `hybracter test-hybrid` and/or `hybracter test-long` on a node with internet access so hybracter can download the required dependencies. It should take 5-10 minutes. If your computer/node has internet access, please skip this step.

```
hybracter test-hybrid --threads 8
hybracter test-long --threads 8
```

* Note: if you are installing Hybracter on a mac, please use `--mac` - this will install Medaka v1.8 (not v2, which is not available for MacOS). Alternatively, if you want Medaka v2, you should try the container option with Docker.

### Container

Alternatively, a Docker/Singularity Linux container image is available for Hybracter (starting from v0.7.1) [here](https://quay.io/repository/gbouras13/hybracter). This will likely be useful for running Hybracter in HPC environments.

* **Note** the container image comes with the database and all environments installed - there is no need to run `hybracter install` or `hybracter test-hybrid`/`hybracter test-long` or to specify a database directory with `-d`.

To install and run v0.11.0 with singularity

```bash

IMAGE_DIR="<the directory you want the .sif file to be in >"
singularity pull --dir $IMAGE_DIR docker://quay.io/gbouras13/hybracter:0.11.0

containerImage="$IMAGE_DIR/hybracter_0.11.0.sif"

# example command with test fastqs
 singularity exec $containerImage    hybracter hybrid-single -l test_data/Fastqs/test_long_reads.fastq.gz \
 -1 test_data/Fastqs/test_short_reads_R1.fastq.gz  -2 test_data/Fastqs/test_short_reads_R2.fastq.gz \
 -o output_test_singularity -t 4 --auto
```

To install and run v0.11.0 with Docker (recommended if you have a Mac as it has Medaka v2)

```
docker pull quay.io/gbouras13/hybracter:0.11.0
docker run quay.io/gbouras13/hybracter:0.11.0  hybracter -h
# -v mounts directories from your local filesystem to the docker contaier
docker run --rm -v /path/to/my/test/fastqs:/data -v /path/to/where/i/want/the/output:/output quay.io/gbouras13/hybracter:0.11.0 hybracter hybrid-single \
  -l /data/test_long_reads.fastq.gz \
  -1 /data/test_short_reads_R1.fastq.gz \
  -2 /data/test_short_reads_R2.fastq.gz \
  -o /output/output_test_docker -t 4 –auto 
```


### Google Colab Notebooks

If you don't want to install `hybracter` locally, you can run it without any code using the colab notebook [https://colab.research.google.com/github/gbouras13/hybracter/blob/main/run_hybracter.ipynb](https://colab.research.google.com/github/gbouras13/hybracter/blob/main/run_hybracter.ipynb)

This is only recommended if you have one or a few samples to assemble (it takes a while per sample due to the limited nature of Google Colab resources - probably an hour or two a sample). If you have more than this, a local install as described below is suggested.

## Documentation

Documentation for `hybracter` is available [here](https://hybracter.readthedocs.io/en/latest/).

## Manuscript 

`hybracter` has recently been published in _Microbial Genomics_

* George Bouras, Ghais Houtak, Ryan R Wick, Vijini Mallawaarachchi, Michael J. Roach, Bhavya Papudeshi, Louise M Judd, Anna E Sheppard, Robert A Edwards, Sarah Vreugde - Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies. (2024) _Microbial Genomics_ doi: https://doi.org/10.1099/mgen.0.001244.

## Description

`hybracter` is designed for assembling bacterial isolate genomes using a long read first assembly approach. 
It scales massively using the embarrassingly parallel power of HPC and Snakemake profiles. It is designed for applications where you have isolates with Oxford Nanopore Technologies (ONT) long reads and optionally matched paired-end short reads for polishing.

`hybracter` is designed to straddle the fine line between being as fully feature-rich as possible with as much information as you need to decide upon the best assembly, while also being a one-line automated program. In other words, as awesome as Unicycler, but updated for 2023. Perfect for lazy people like myself.

`hybracter` is largely based off Ryan Wick's [magnificent tutorial](https://github.com/rrwick/Perfect-bacterial-genome-tutorial) and associated [paper](https://doi.org/10.1371/journal.pcbi.1010905). `hybracter` differs in that it adds some additional steps regarding targeted plasmid assembly with [plassembler](https://github.com/gbouras13/plassembler), contig reorientation with [dnaapler](https://github.com/gbouras13/dnaapler) and extra polishing and statistical summaries.

Note: if you have Pacbio reads, as of 2023, you can run  `hybracter long` with `--no_medaka` to turn off polishing, and  `--flyeModel pacbio-hifi`. You can also probably just run [Flye](https://github.com/fenderglass/Flye) or [Dragonflye](https://github.com/rpetit3/dragonflye) (or of course [Trycyler](https://github.com/rrwick/Trycycler) ) and reorient the contigs with [dnaapler](https://github.com/gbouras13/dnaapler) without polishing. See Ryan Wick's [blogpost](https://doi.org/10.5281/zenodo.7703461) for more details. 

## Pipeline

<p align="center">
  <img src="img/hybracter.png" alt="Hybracter" height=600>
</p>

- A. Reads are quality controlled with [Filtlong](https://github.com/rrwick/Filtlong), [Porechop](https://github.com/rrwick/Porechop), [fastp](https://github.com/OpenGene/fastp) and optionally contaminant removal using modules from [trimnami](https://github.com/beardymcjohnface/Trimnami).
- B. Long-read assembly is conducted with [Flye](https://github.com/fenderglass/Flye). Each sample is classified if the chromosome(s) were assembled (marked as 'complete') or not (marked as 'incomplete') based on the given minimum chromosome length.
- C. For complete isolates, plasmid recovery with [Plassembler](https://github.com/gbouras13/plassembler).
- D. For all isolates, long read polishing with [Medaka](https://github.com/nanoporetech/medaka).
- E. For complete isolates, the chromosome is reorientated to begin with the dnaA gene with [dnaapler](https://github.com/gbouras13/dnaapler).
- F. For all isolates, if short reads are provided, short read polishing with [Polypolish](https://github.com/rrwick/Polypolish) and [pypolca](https://github.com/gbouras13/pypolca).
- G. For all isolates, assessment of all assemblies with [ALE](https://github.com/sc932/ALE) for `hybracter hybrid` or [Pyrodigal](https://github.com/althonos/pyrodigal) for `hybracter long`.
- H. The best assembly is selected and output along with final assembly statistics.

## Benchmarking

`hybracter` was benchmarked in both hybrid and long modes (specifically using the `hybrid-single` and `long-single` commands) against [Unicycler](https://github.com/rrwick/Unicycler) v0.5.0 and [Dragonflye](https://github.com/rpetit3/dragonflye) v1.1.2.

30 samples from 5 studies with available reference genomes were benchmarked. You can see the full explanation and results [here](https://hybracter.readthedocs.io/en/latest/benchmarking/). You can find all the output [here](https://doi.org/10.5281/zenodo.10906937).

To summarise the conclusions:

* `Hybracter hybrid` was superior to Unicycler in terms of accuracy, time taken and (slightly) in terms of plasmid recovery. It should be preferred to Unicycler.
* You should use `hybracter long` if you care about plasmids and have only long reads. It performs similarly to hybrid methods and its inclusion of [Plassembler](https://github.com/gbouras13/plassembler) largely seems to solve the [problem of long read assemblers recovering small plasmids](https://doi.org/10.1099/mgen.0.001024).
* `Hybracter` in both modes is inferior to Dragonflye in terms of time though better in terms of chromosome accuracy. 
* If you want the fastest possible chromosome assemblies for applications like species ID or sequence typing that retain a high level of accuracy, Dragonflye is a good option.
* Dragonflye should not be used if you care about recovering plasmids.

## Recent Updates

### v0.11.0 Updates (4 December 2024)

* Replaces [kmc](https://github.com/refresh-bio/KMC) with [lrge](https://github.com/mbhall88/lrge) when using `--auto`, a much faster tool designed for the purpose of estimating genome size from long reads. It is very very fast and robust. 
    * If your input has more than 5000 long reads (it should!), [lrge](https://github.com/mbhall88/lrge) will run in default settings. If it has under this, then it will run a (slightly) more computationally expensive all-vs-all mode with all input reads. In practice, if you have such low read counts, you should take all downstream analysis (inclduing lrge and hybracter) with a lot of caution anyway. 
    * According to the [preprint](https://www.biorxiv.org/content/10.1101/2024.11.27.625777v1) (and my less exhaustive testing), lrge is more accurate and much faster than kmc, but I would still be careful using it on data that has lower quality than < Q15. 
* Nothing else changes - the estimated chromosome size used by Hybracter will still be 80% of the estimate, as it needs to account for plasmids
* Adds `r1041_e82_400bps_bacterial_methylation` as an option for `--medakaModel` thanks to [this issue](https://github.com/gbouras13/hybracter/issues/108).
* Note this won't work if you run `hybracter` on a Mac (as medaka v2 is not available)

### v0.10.0 Updates (17 October 2024)

* Updates Medaka to v2.0.1, implementing the `--bacteria` option by default.
* This is based on the recommendations of Ryan Wick [here](https://rrwick.github.io/2024/10/17/medaka-v2.html) who found it improved assemblies due to (likely) enhanced methylation error correction.
* If you still want to specify a Medaka model, the flag `--medaka_override` has been added. You need to include this along with your model via `--medakaModel`. This is most likely useful for older R9 data.
* * Adds `--extra_params_flye` parameter if you want to specify extra commands for the Flye assembly step.

### v0.9.0 Updates (18 September 2024)

**`--auto` for automatic estimation of chromosome size**

**Note: if you have low quality long read sets (e.g. R9 FAST/HAC or sub Q15 reads), `--auto` is _not_ recommended. Users have reported that it can tend to overestimate the chromosome size as more erroneous 21-mers will be counted by kmc than expected. Please specify a chromosome size for this type of data.**

* Thanks to an [issue](https://github.com/gbouras13/hybracter/issues/90) and code from @[richardstoeckl](https://github.com/richardstoeckl), Hybracter can now estimate the estimated chromosome size for each sample by passing `--auto`. 
* The implementation uses [kmc](https://github.com/refresh-bio/KMC). Specifically, Hybracter uses kmc to count the number of unique 21mers that appear at least 10 times in your long-read FASTQ file. This is because, for a given assembly of length L,  and a k-mer size of k, the total number of unique possible k-mers  will be given by ( L – k ) + 1, and if L >> k, then it suffices as an estimate of total assembly size
* The estimated chromosome size used by Hybracter will actually be 80% of the number of 21-mers found at least 10 times, as it needs to account for plasmids
* If you aren't sure whether you have enough data for assembly (i.e. coverage lower than 20x), be careful using `--auto`, because the actual assembly size will tend to be larger than the number of unique 21mers found at least 10 times. Therefore, the estimated chromosome size will almost certainly be an underestimate and may lead to Hybracter considering your assembly "complete" when in fact it isn't.

* If you use `--auto`, you do not need to specify the chromosome length in the input. This means you don't need to `-c` with `long-single` or `hybrid-single` and in the input csv sample sheet, you do not need a column with chromosome length.
  
e.g. for `hybracter long` you only need 2 columns with sample name and long-read FASTQ file path:

```bash
s_aureus_sample1,sample1_long_read.fastq.gz
p_aeruginosa_sample2,sample2_long_read.fastq.gz
```

and for `hybracter hybrid` you only need 4 columns with sample name, long-read FASTQ, and R1 and R2 short-read FASTQ file paths:

```bash
s_aureus_sample1,sample1_long_read.fastq.gz,sample1_SR_R1.fastq.gz,sample1_SR_R2.fastq.gz
p_aeruginosa_sample2,sample2_long_read.fastq.gz,sample2_SR_R1.fastq.gz,sample2_SR_R2.fastq.gz
```

**Other changes**

* Hybracter v0.9.0 will automatically support the reorientation of archaeal chromosomes (thanks @[richardstoeckl](https://github.com/richardstoeckl)) to begin with the cog1474 Orc1/cdc6 gene.
* `--datadir` can now also accept 2 paths separated by a comma, if you have long reads and short reads in separate directories e.g. `--datadir "long_read_dir,short_read_dir"` (https://github.com/gbouras13/hybracter/issues/76).
* `--min_depth` parameter added. Hybracter will error out if your QC'd long reads have a coverage lower than `min_depth` for a sample (https://github.com/gbouras13/hybracter/issues/89).

## Why Would You Run Hybracter?

* If you want the best possible _automated_ long read only or hybrid bacterial isolate genome assembly.
* In other words, if you love Unicycler like I do, but want something faster and more accurate.
* If you need to assemble many (e.g. 10+) bacterial isolates as efficiently as possible.
* If you want all information about from assembly pipeline such as whether your polishing probably improved the genome, whether your assembly was likely complete, and how many plasmids you probably assembled.

## Other Options

#### Trycycler

If you are looking for the best possible (manual) bacterial assembly for a single isolate, please definitely use [Trycyler](https://github.com/rrwick/Trycycler). 

  * `hybracter` will almost certainly not give you better assemblies than Trycycler. Trycycler is the gold standard for a reason.
  * `hybracter` is automated, scalable, faster and requires less bioinformatics/microbial genomics expertise to run. 
  * If you use Trycycler, I would also highly recommend using (disclaimer: my own program) [plassembler](https://github.com/gbouras13/plassembler) (which is built into hybracter) alongside Trycycler to assemble small plasmids if you are especially interested in those, because long read only assemblies often [miss small plasmids](https://doi.org/10.1099/mgen.0.001024).

#### Dragonflye

[Dragonflye](https://github.com/rpetit3/dragonflye) by the awesome @[rpetit3](https://github.com/rpetit3) is a good alternative for automated assembly if `hybracter` doesn't fit your needs, particularly if you are familiar with [Shovill](https://github.com/tseemann/shovill). Some pros and cons between `hybracter` and `dragonflye` are listed below.

  * `dragonflye` allows for more options with regards to assemblers (it supports [Miniasm](https://github.com/lh3/miniasm) or [Raven](https://github.com/lbcb-sci/raven) as well as Flye).
  * On a single isolate, `dragonflye` should be faster.
  * `hybracter` should be more accurate, due to the extra round of polishing following reorientation, and integration of Plassembler.
  * `hybracter` has the advantage of scalability across multiple samples due to its Snakemake and Snaketool implementation. 
  * So if you have access to a cluster, `hybracter` is for you and likely faster.
  * `hybracter` gives more accurate plasmid assemblies because it uses [plassembler](https://github.com/gbouras13/plassembler)
  * `hybracter` will suggest automatically whether an assembly is 'complete' or 'incomplete'
  * `hybracter` will assess each polishing step and choose the genome most likely to be the best quality.

## Installation

You will need conda  (**highly recommended** through miniforge) to run `hybracter`, because it is required for the installation of each compartmentalised environment (e.g. Flye will have its own environment). Please see the [documentation](https://hybracter.readthedocs.io/en/latest/install/) for more details on how to install miniforge.

### Conda

`hybracter` is available to install with `conda`. To install `hybracter` into a conda environment called `hybracterENV`:

```bash
conda create -n hybracterENV hybracter
conda activate hybracterENV
hybracter --help
hybracter install
```

### Pip

`hybracter` is available to install with `pip` . 

You will also need conda available so `hybracter` can install all the required dependencies. Therefore, it is recommended to install `hybracter` into a conda environment as follows.


```bash
conda create -n hybracterENV pip
conda activate hybracterENV
pip install hybracter
hybracter --help
hybracter install
```

### Source

Alternatively, the development version of `hybracter` (which may include new, untested features) can be installed manually via github. 

```bash
git clone https://github.com/gbouras13/hybracter.git
cd hybracter
pip install -e .
hybracter --help
```

## Main Commands

* `hybracter hybrid`: Assemble multiple genomes from isolates that have long-reads and paired-end short reads.
* `hybracter hybrid-single`: Assembles a single genome from an isolate with long-reads and paired-end short reads. It takes similar parameters to [Unicycler](https://github.com/rrwick/Unicycler).
* `hybracter long`: Assemble multiple genomes from isolates that have long-reads only.
* `hybracter long-single`: Assembles a single genome from an isolate with long-reads only.
* `hybracter install`: Downloads and installs the required `plassembler` database.

```bash
 _           _                    _            
| |__  _   _| |__  _ __ __ _  ___| |_ ___ _ __ 
| '_ \| | | | '_ \| '__/ _` |/ __| __/ _ \ '__|
| | | | |_| | |_) | | | (_| | (__| ||  __/ |   
|_| |_|\__, |_.__/|_|  \__,_|\___|\__\___|_|   
       |___/


Usage: hybracter [OPTIONS] COMMAND [ARGS]...

  For more options, run: hybracter command --help

Options:
  -h, --help  Show this message and exit.

Commands:
  install        Downloads and installs the plassembler database
  hybrid         Run hybracter with hybrid long and paired end short reads
  hybrid-single  Run hybracter hybrid on 1 isolate
  long           Run hybracter with only long reads
  long-single    Run hybracter long on 1 isolate
  test-hybrid    Test hybracter hybrid
  test-long      Test hybracter long
  config         Copy the system default config file
  citation       Print the citation(s) for hybracter
  version        Print the version for hybracter
```

## Input csv

`hybracter hybrid` and `hybracter long` require an input csv file to be specified with `--input`. No other inputs are required.

* This file requires no headers.
* Other than the reads, `hybracter` requires a value for a lower bound the minimum chromosome length for each isolate in base pairs. It must be an integer.
* `hybracter` will denote contigs about this value as chromosome(s) and if it can recover a chromosome, it will denote the isolate as complete.
* In practice, I suggest choosing 90% of the estimated chromosome size for this value.
* e.g. for _S. aureus_, I'd choose 2500000, _E. coli_, 4000000, _P. aeruginosa_ 5500000.

#### `hybracter hybrid`

* `hybracter hybrid` requires an input csv file with 5 columns. 
* Each row is a sample.
* Column 1 is the sample name you want for this isolate. 
* Column 2 is the long read fastq file.
* Column 3 is the minimum chromosome length for that sample.
* Column 4 is the R1 short read fastq file
* Column 5 is the R2 short read fastq file.

e.g.

```bash
s_aureus_sample1,sample1_long_read.fastq.gz,2500000,sample1_SR_R1.fastq.gz,sample1_SR_R2.fastq.gz
p_aeruginosa_sample2,sample2_long_read.fastq.gz,5500000,sample2_SR_R1.fastq.gz,sample2_SR_R2.fastq.gz
```

**Using `--auto`**

* If you use `--auto`, you can remove the column with the chromosome length

e.g.

```bash
s_aureus_sample1,sample1_long_read.fastq.gz,sample1_SR_R1.fastq.gz,sample1_SR_R2.fastq.gz
p_aeruginosa_sample2,sample2_long_read.fastq.gz,sample2_SR_R1.fastq.gz,sample2_SR_R2.fastq.gz
```

#### `hybracter long`

`hybracter long` also requires an input csv with no headers, but only 3 columns.

* `hybracter long` requires an input csv file with 3 columns. 
* Each row is a sample.
* Column 1 is the sample name you want for this isolate. 
* Column 2 is the long read fastq file.
* Column 3 is the minimum chromosome length for that sample.

e.g.

```bash
s_aureus_sample1,sample1_long_read.fastq.gz,2500000
p_aeruginosa_sample2,sample2_long_read.fastq.gz,5500000
```

**Using `--auto`**

* If you use `--auto`, you can remove the column with the chromosome length

```bash
s_aureus_sample1,sample1_long_read.fastq.gz
p_aeruginosa_sample2,sample2_long_read.fastq.gz
```

## Usage

#### `hybracter install`

You will first need to install the `hybracter` databases.

```bash
hybracter install
```

Alternatively, can also specify a particular directory to store them - you will need to specify this with `-d <databases directory>` when you run `hybracter`.

```bash
hybracter install -d  <databases directory>
```

#### Installing Dependencies

**If you have internet access on the machine or node where you are running hybracter, you can skip this step.**

When you run `hybracter` for the first time, all the required dependencies will be installed as required, so it will take longer than usual (usually a few minutes). Every time you run it afterwards, it will be a lot faster as the dependencies will be installed.

If you intend to run hybracter offline (e.g. on HPC nodes with no access to the internet), I highly recommend running `hybracter  test-hybrid ` and/or `hybracter test-long` on a node with internet access so hybracter can download the required dependencies. It should take 5-10 minutes.

```bash
hybracter test-hybrid 
hybracter test-long
hybracter --help
```

Once that is done, run `hybracter hybrid` or `hybracter long` as follows.

#### `hybracter hybrid`

```bash
hybracter hybrid -i <input.csv> -o <output_dir> -t <threads> 
```

* `hybracter hybrid` requires only a CSV file specified with `-i` or `--input`
* `--no_pypolca` will turn off pypolca polishing.
* Use `--min_length` to specify the minimum long-read length for Filtlong.
* Use `--min_quality` to specify the minimum long-read quality for Filtlong.
* You can specify a FASTA file containing contaminants with `--contaminants`. All long reads that map to contaminants will be filtered out.
  * You can specify Escherichia phage lambda (a common contaminant in Nanopore library preparation) using `--contaminants lambda`.
* `--skip_qc` will skip all read QC steps.
* You can change the `--medakaModel` (all available options are listed in `hybracter hybrid -h`)
* You can change the `--flyeModel` (all available options are listed in `hybracter hybrid -h`)
* You can turn off Medaka polishing using `--no_medaka`
* You can turn off pypolca polishing using `--no_pypolca`
* You can force `hybracter` to pick the last polishing round (not the best according to ALE) with `--logic last`. `hybracter` defaults to picking the best (according to ALE) i.e. `--logic best`.

#### `hybracter hybrid-single`

```bash
hybracter hybrid-single -l <longread FASTQ> -1 <R1 short reads FASTQ> -2 <R2 short reads FASTQ> -s <sample name> -c <chromosome size> -o <output_dir> -t <threads>  [other arguments]
```

#### `hybracter long`

```bash
hybracter long -i <input.csv> -o <output_dir> -t <threads> [other arguments]
```

* `hybracter long` requires only a CSV file specified with `-i` or `--input`
* Use `--min_length` to specify the minimum long-read length for Filtlong.
* Use `--min_quality` to specify the minimum long-read quality for Filtlong.
* You can specify a FASTA file containing contaminants with `--contaminants`. All long reads that map to contaminants will be filtered.
  * You can specify Escherichia phage lambda (a common contaminant in Nanopore library preparation) using `--contaminants lambda`.
* `--skip_qc` will skip all read QC steps.
* You can change the `--medakaModel` (all available options are listed in `hybracter long -h`)
* You can change the `--flyeModel` (all available options are listed in `hybracter long -h`)
* You can turn off Medaka polishing using `--no_medaka`
* You can force `hybracter` to pick the last polishing round (not the best according to pyrodigal mean CDS length) with `--logic last`. `hybracter` defaults to picking the best i.e. `--logic best`.


#### `hybracter long-single`

```bash
hybracter long-single -l <longread FASTQ> -s <sample name> -c <chromosome size>  -o <output_dir> -t <threads>  [other arguments]
```

## Outputs 

`hybracter` creates a number of output files in different formats. 

For more information about all possible file outputs, please see the documentation here.

### Main Output Files

The main outputs are in the `FINAL_OUTPUT` directory.

This directory will include:

1. `hybracter_summary.tsv` file. This gives the summary statistics for your assemblies with the following columns:


|Sample |Complete (True or False) | Total_assembly_length |	Number_of_contigs |	Most_accurate_polishing_round |	Longest_contig_length |	Longest_contig_coverage |Number_circular_plasmids |
|--------|-----------------------|-------------------------|-------------------|--------|--|--|--|


2. `complete` and `incomplete` directories.

All samples that are denoted by hybracter to be complete will have 5 outputs in the `complete` directory:

   * `sample`_summary.tsv containing the summary statistics for that sample.
   * `sample`_per_contig_stats.tsv containing the contig names, lengths, GC% and whether the contig is circular.
   * `sample`_final.fasta containing the final assembly for that sample.
   * `sample`_chromosome.fasta containing only the final chromosome(s) assembly for that sample.
   * `sample`_plasmid.fasta containing only the final plasmid(s) assembly for that sample. Note this may be empty. If this is empty, then that sample had no plasmids. 

All samples that are denoted by hybracter to be incomplete will have 3 outputs in the `incomplete` directory:

   * `sample`_summary.tsv containing the summary statistics for that sample.
   * `sample`_per_contig_stats.tsv containing the contig names, lengths, GC% and whether the contig is circular.
   * `sample`_final.fasta containing the final assembly for that sample.


## Snakemake Profiles

I would highly highly recommend running hybracter using a Snakemake profile. Please see this blog [post](https://fame.flinders.edu.au/blog/2021/08/02/snakemake-profiles-updated) for more details. I have included an example slurm profile in the profile directory, but check out this [link](https://github.com/Snakemake-Profiles) for more detail on other HPC job scheduler profiles.

```bash
hybracter hybrid --input <input.csv> --output <output_dir> --threads <threads> --profile profiles/hybracter
```

## Advanced Configuration

Thanks to its Snakemake backend, you can modify resource requirements for each job contained within `hybracter` using the configuration file. A default can be created using the `hybracter config` command. This can make it even more efficient in server environment, as many jobs can be more efficiently parallelised than the default settings. For more information, please see the [documentation](https://hybracter.readthedocs.io/en/latest/configuration/) 

## Older Updates 

### v0.7.0 Updates (04 March 2024)

**Changes to short read polishing**

* Logic added to run `polypolish` v0.6.0 with `--careful` and skip pypolca if the SR coverage estimate is below 5x (note: FASTA files for pypolca will be generated in the processing directory to play nice with Snakemake, but these will be identical to the polypolish output). 
* For 5-25x coverage, `polypolish --careful` and `pypolca` with `--careful` will be run. 
* For >25x coverage, `polypolish` default and `pypolca` with `--careful` will be run. 
* A preprint justifying these changes will be available soon.

**`--logic` changes**

* By default, `--logic` defaults to `last` for `hybracter hybrid`, as there we have found that the polishing strategy implemented above never makes the assembly worse. We suggest never using `--logic best` with `hybracter hybrid`.

**Changes for chromosome contigs and circularity**

* If hybracter assembles a contig that is greater than the minimum chromosome length but not marked as circular by Flye, this will now be denoted as a chromosome, but not circular. The genome will be marked as complete also. 
    * These will usually be assemblies with some issue (e.g. prophages, circularisation issues, heterogeneity) and probably require some more attention.
    * For example, with the _Vibrio cholerae_ larger chromosome described [here](https://rrwick.github.io/2024/02/15/misassemblies.html), the genome will be marked as 'complete' but the contig will not be marked as 'circular' in the `hybracter` output.
    * Such contigs will be polished and be in the final `_chromosome.fasta` output, but they will not be rotated by `dnaapler`. 
    * These were previously being excluded, which was missing assemblies with structural heterogeneity (causing the chromosome not to completely circularise) or even bacteria with linear chromosomes like [_Borrelia_](https://www.nature.com/articles/37551). 

**Adds `--depth_filter`** 

* This is passed to [Plassembler](https://github.com/gbouras13/plassembler) and will filter out all putative plasmid contigs that are lower than this depth fraction compared to the chromosome.
* Defaults to 0.25 like Unicycler's implementation.

### v0.5.0 Updates (08 January 2024)

Ryan Wick recently ran `hybracter long` on the latest Dorado v0.5.0 basecalled Nanopore reads (his [blog post](https://rrwick.github.io/2023/12/18/ont-only-accuracy-update.html)). You can read a write-up of the results [here](https://hybracter.readthedocs.io/en/latest/dorado_ryan_louise_0_5_0/). As a result, subsampling has been added to Hybracter. 

* Adds subsampling using `--subsample_depth` using Filtlong, based on some benchmarking of Dorado v0.5.0. Defaults to 100x of the estimated chromosome size `-c`.
* Also adds stricter criteria for complete assemblies (aka ensures that identified chromosomes must be circularised according to Flye).
  
### v0.4.0 Updates (14 November 2023)

* Adds `--logic` parameter. You have 2 choices: `--logic best` (the default) or `--logic last`.
* `--logic best` will run `hybracter` as normal and the best assembly (by ALE or pyrodigal mean length) will be selected as the final assembly.
* `--logic last` will force hybracter to pick the last polished round as the final assembly even if it is not the best as per ALE/pyrodigal. So for `hybracter hybrid` this will default to the pypolca polished round, for hybracter long it will be Medaka round 2. You may wish to use this if you want all your isolates to be consistently assembled.
* Adds reorientation of pre polished chromosome in case it is selected as the best assembly
* Adds fixes to the chromosome comparisons - now it is much easier to interpret any changes between polishing rounds.


### v0.2.0 Updates 26 October 2023 - Medaka, Polishing and `--no_medaka`

Ryan Wick's [blogpost](https://rrwick.github.io/2023/10/24/ont-only-accuracy-update.html) on 24 October 2023 suggests that if you have new 5Hz SUP or Res (bacterial model specific) ONT reads, Medaka polishing often makes things worse! It also implies that Nanopore reads are almost good enough to assemble perfect bacterial genomes (at least with Trycycler) which is pretty awesome.

Combined with the difficulty and randomness in installing Medaka from Nanopore, I have therefore decided to add a `--no_medaka` flag into v0.2.0. 

I have also set Medaka to be v1.8.0 and I do not intend to upgrade this going forward, as this is the most recent stable bioconda version that doesn't seem to cause too much grief. 

If you have trouble with Medaka installation, I'd therefore suggest please using `--no_medaka`.

`hybracter` should still handle cases where Medaka makes assemblies worse. If Medaka makes your assembly appreciably worse, `hybracter` should choose the best most accurate assembly as the unpolished one in long mode. 


## Version Log

A brief description of what is new in each update of `hybracter` can be found in the HISTORY.md file.

## System

`hybracter` is tested on Linux and on MacOS.

## Bugs and Suggestions

If you come across bugs with `hybracter`, or would like to make any suggestions to improve the program, please open an issue or email george.bouras@adelaide.edu.au.

# Citation

If you use Hybracter, please cite the manuscript along with core dependencies (they are also our tools!):

Hybracter Manuscript
* George Bouras, Ghais Houtak, Ryan R Wick, Vijini Mallawaarachchi, Michael J. Roach, Bhavya Papudeshi, Louise M Judd, Anna E Sheppard, Robert A Edwards, Sarah Vreugde - Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies. (2024) _Microbial Genomics_ doi: https://doi.org/10.1099/mgen.0.001244.

Plassembler:
* Bouras G., Sheppard A.E., Mallawaarachchi V., Vreugde S., Plassembler: an automated bacterial plasmid assembly tool, Bioinformatics, Volume 39, Issue 7, July 2023, btad409, https://doi.org/10.1093/bioinformatics/btad409. 

Dnaapler:
* George Bouras, Susanna R. Grigson, Bhavya Papudeshi, Vijini Mallawaarachchi, Michael J. Roach (2024). Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93), 5968, https://doi.org/10.21105/joss.05968.

Ryan Wick et al's Assembling the perfect bacterial genome paper, which provided the intellectual framework for hybracter:
* Wick RR, Judd LM, Holt KE (2023) Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing. PLoS Comput Biol 19(3): e1010905. https://doi.org/10.1371/journal.pcbi.1010905

I would also recommend citing Hybracter's other dependencies if you can where they are used:

Flye:
* Kolmogorov, M., Yuan, J., Lin, Y. et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37, 540–546 (2019). https://doi.org/10.1038/s41587-019-0072-8

Snaketool:
* Roach MJ, Pierce-Ward NT, Suchecki R, Mallawaarachchi V, Papudeshi B, Handley SA, et al. (2022) Ten simple rules and a template for creating workflows-as-applications. PLoS Comput Biol 18(12): e1010705. https://doi.org/10.1371/journal.pcbi.1010705

Trimnami:
* Roach MJ. (2023) Trimnami. https://github.com/beardymcjohnface/Trimnami.

Filtlong:
* Wick RR (2018) Filtlong. https://github.com/rrwick/Filtlong.

Porechop and Porechop_abi:
* Quentin Bonenfant, Laurent Noé, Hélène Touzet, Porechop_ABI: discovering unknown adapters in Oxford Nanopore Technology sequencing reads for downstream trimming, Bioinformatics Advances, Volume 3, Issue 1, 2023, vbac085, https://doi.org/10.1093/bioadv/vbac085
* Wick RR (2017) https://github.com/rrwick/Porechop.

fastp:
* Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560. 

ALE:
* Scott C. Clark, Rob Egan, Peter I. Frazier, Zhong Wang, ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies, Bioinformatics, Volume 29, Issue 4, February 2013, Pages 435–443, https://doi.org/10.1093/bioinformatics/bts723

Medaka:
* Oxford Nanopore Technologies, Medaka. https://github.com/nanoporetech/medaka.

Pyrodigal:
* Larralde, M., (2022). Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. Journal of Open Source Software, 7(72), 4296, https://doi.org/10.21105/joss.04296.

Polypolish:
* Wick RR, Holt KE (2022) Polypolish: Short-read polishing of long-read bacterial genome assemblies. PLoS Comput Biol 18(1): e1009802. https://doi.org/10.1371/journal.pcbi.1009802.

Pypolca:
* Bouras G, Judd LM, Edwards RA, Vreugde S, Stinear TP, Wick RR (2024) How low can you go? Short-read polishing of Oxford Nanopore bacterial genome assemblies. bioRxiv 2024.03.07.584013; doi: [https://doi.org/10.1101/2024.03.07.584013](https://doi.org/10.1101/2024.03.07.584013).
* Zimin AV, Salzberg SL (2020) The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput Biol 16(6): e1007981. https://doi.org/10.1371/journal.pcbi.1007981. 

Snakemake:
* Mölder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2021, 10:33 (https://doi.org/10.12688/f1000research.29032.1).

KMC:
* Marek Kokot, Maciej Długosz, Sebastian Deorowicz, KMC 3: counting and manipulating k-mer statistics, Bioinformatics, Volume 33, Issue 17, 01 September 2017, Pages 2759–2761, (https://doi.org/10.1093/bioinformatics/btx304).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gbouras13/hybracter",
    "name": "hybracter",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "George Bouras",
    "author_email": "george.bouras@adelaide.edu.au",
    "download_url": "https://files.pythonhosted.org/packages/ba/92/c7cb23da78df7ab9026da77f6d41cb2673b224da73cc38b6633174530dc1/hybracter-0.11.1.tar.gz",
    "platform": null,
    "description": "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gbouras13/hybracter/blob/main/run_hybracter.ipynb)\n\n[![Paper](https://img.shields.io/badge/paper-Microbial_Genomics-green.svg?style=flat-square&maxAge=3600)](https://doi.org/10.1099/mgen.0.001244)\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![](https://img.shields.io/static/v1?label=CLI&message=Snaketool&color=blueviolet)](https://github.com/beardymcjohnface/Snaketool)\n![GitHub last commit (branch)](https://img.shields.io/github/last-commit/gbouras13/hybracter/dev?color=8a35da)\n[![Code DOI](https://zenodo.org/badge/574521745.svg)](https://zenodo.org/badge/latestdoi/574521745)\n\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/hybracter/badges/version.svg)](https://anaconda.org/bioconda/hybracter)\n[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/hybracter)](https://img.shields.io/conda/dn/bioconda/hybracter)\n[![PyPI version](https://badge.fury.io/py/hybracter.svg)](https://badge.fury.io/py/hybracter)\n[![Downloads](https://static.pepy.tech/badge/hybracter)](https://pepy.tech/project/hybracter)\n\n# Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies\n\n`hybracter` is an automated long-read first bacterial genome assembly tool implemented in Snakemake using [Snaketool](https://github.com/beardymcjohnface/Snaketool). \n\n## Table of Contents\n\n- [Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies](#hybracter-enabling-scalable-automated-complete-and-accurate-bacterial-genome-assemblies)\n  - [Table of Contents](#table-of-contents)\n  - [Quick Start](#quick-start)\n    - [Conda](#conda)\n    - [Container](#container)\n    - [Google Colab Notebooks](#google-colab-notebooks)\n  - [Documentation](#documentation)\n  - [Manuscript](#manuscript)\n  - [Description](#description)\n  - [Pipeline](#pipeline)\n  - [Benchmarking](#benchmarking)\n  - [Recent Updates](#recent-updates)\n    - [v0.11.0 Updates (4 December 2024)](#v0110-updates-4-december-2024)\n    - [v0.10.0 Updates (17 October 2024)](#v0100-updates-17-october-2024)\n    - [v0.9.0 Updates (18 September 2024)](#v090-updates-18-september-2024)\n  - [Why Would You Run Hybracter?](#why-would-you-run-hybracter)\n  - [Other Options](#other-options)\n      - [Trycycler](#trycycler)\n      - [Dragonflye](#dragonflye)\n  - [Installation](#installation)\n    - [Conda](#conda-1)\n    - [Pip](#pip)\n    - [Source](#source)\n  - [Main Commands](#main-commands)\n  - [Input csv](#input-csv)\n      - [`hybracter hybrid`](#hybracter-hybrid)\n      - [`hybracter long`](#hybracter-long)\n  - [Usage](#usage)\n      - [`hybracter install`](#hybracter-install)\n      - [Installing Dependencies](#installing-dependencies)\n      - [`hybracter hybrid`](#hybracter-hybrid-1)\n      - [`hybracter hybrid-single`](#hybracter-hybrid-single)\n      - [`hybracter long`](#hybracter-long-1)\n      - [`hybracter long-single`](#hybracter-long-single)\n  - [Outputs](#outputs)\n    - [Main Output Files](#main-output-files)\n  - [Snakemake Profiles](#snakemake-profiles)\n  - [Advanced Configuration](#advanced-configuration)\n  - [Older Updates](#older-updates)\n    - [v0.7.0 Updates (04 March 2024)](#v070-updates-04-march-2024)\n    - [v0.5.0 Updates (08 January 2024)](#v050-updates-08-january-2024)\n    - [v0.4.0 Updates (14 November 2023)](#v040-updates-14-november-2023)\n    - [v0.2.0 Updates 26 October 2023 - Medaka, Polishing and `--no_medaka`](#v020-updates-26-october-2023---medaka-polishing-and---no_medaka)\n  - [Version Log](#version-log)\n  - [System](#system)\n  - [Bugs and Suggestions](#bugs-and-suggestions)\n- [Citation](#citation)\n\n\n## Quick Start\n\n### Conda\n\n`hybracter` is available to install with `pip` or `conda`.\n\nYou will need conda or mamba available so `hybracter` can install all the required dependencies. \n\nTherefore, it is recommended to install `hybracter` into a conda environment as follows.\n\n```bash\nconda create -n hybracterENV -c bioconda -c conda-forge  hybracter\nconda activate hybracterENV\nhybracter --help\nhybracter install\n```\n\nMiniforge is **highly highly** recommended. Please see the [documentation](https://hybracter.readthedocs.io/en/latest/install/) for more details on how to install Miniforge.\n\nWhen you run `hybracter` for the first time, all the required dependencies will be installed as required, so it will take longer than usual (usually a few minutes). Every time you run it afterwards, it will be a lot faster as the dependencies will be installed.\n\nIf you intend to run hybracter offline (e.g. on HPC nodes with no access to the internet), I highly recommend running `hybracter test-hybrid` and/or `hybracter test-long` on a node with internet access so hybracter can download the required dependencies. It should take 5-10 minutes. If your computer/node has internet access, please skip this step.\n\n```\nhybracter test-hybrid --threads 8\nhybracter test-long --threads 8\n```\n\n* Note: if you are installing Hybracter on a mac, please use `--mac` - this will install Medaka v1.8 (not v2, which is not available for MacOS). Alternatively, if you want Medaka v2, you should try the container option with Docker.\n\n### Container\n\nAlternatively, a Docker/Singularity Linux container image is available for Hybracter (starting from v0.7.1) [here](https://quay.io/repository/gbouras13/hybracter). This will likely be useful for running Hybracter in HPC environments.\n\n* **Note** the container image comes with the database and all environments installed - there is no need to run `hybracter install` or `hybracter test-hybrid`/`hybracter test-long` or to specify a database directory with `-d`.\n\nTo install and run v0.11.0 with singularity\n\n```bash\n\nIMAGE_DIR=\"<the directory you want the .sif file to be in >\"\nsingularity pull --dir $IMAGE_DIR docker://quay.io/gbouras13/hybracter:0.11.0\n\ncontainerImage=\"$IMAGE_DIR/hybracter_0.11.0.sif\"\n\n# example command with test fastqs\n singularity exec $containerImage    hybracter hybrid-single -l test_data/Fastqs/test_long_reads.fastq.gz \\\n -1 test_data/Fastqs/test_short_reads_R1.fastq.gz  -2 test_data/Fastqs/test_short_reads_R2.fastq.gz \\\n -o output_test_singularity -t 4 --auto\n```\n\nTo install and run v0.11.0 with Docker (recommended if you have a Mac as it has Medaka v2)\n\n```\ndocker pull quay.io/gbouras13/hybracter:0.11.0\ndocker run quay.io/gbouras13/hybracter:0.11.0  hybracter -h\n# -v mounts directories from your local filesystem to the docker contaier\ndocker run --rm -v /path/to/my/test/fastqs:/data -v /path/to/where/i/want/the/output:/output quay.io/gbouras13/hybracter:0.11.0 hybracter hybrid-single \\\n  -l /data/test_long_reads.fastq.gz \\\n  -1 /data/test_short_reads_R1.fastq.gz \\\n  -2 /data/test_short_reads_R2.fastq.gz \\\n  -o /output/output_test_docker -t 4 \u2013auto \n```\n\n\n### Google Colab Notebooks\n\nIf you don't want to install `hybracter` locally, you can run it without any code using the colab notebook [https://colab.research.google.com/github/gbouras13/hybracter/blob/main/run_hybracter.ipynb](https://colab.research.google.com/github/gbouras13/hybracter/blob/main/run_hybracter.ipynb)\n\nThis is only recommended if you have one or a few samples to assemble (it takes a while per sample due to the limited nature of Google Colab resources - probably an hour or two a sample). If you have more than this, a local install as described below is suggested.\n\n## Documentation\n\nDocumentation for `hybracter` is available [here](https://hybracter.readthedocs.io/en/latest/).\n\n## Manuscript \n\n`hybracter` has recently been published in _Microbial Genomics_\n\n* George Bouras, Ghais Houtak, Ryan R Wick, Vijini Mallawaarachchi, Michael J. Roach, Bhavya Papudeshi, Louise M Judd, Anna E Sheppard, Robert A Edwards, Sarah Vreugde - Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies. (2024) _Microbial Genomics_ doi: https://doi.org/10.1099/mgen.0.001244.\n\n## Description\n\n`hybracter` is designed for assembling bacterial isolate genomes using a long read first assembly approach. \nIt scales massively using the embarrassingly parallel power of HPC and Snakemake profiles. It is designed for applications where you have isolates with Oxford Nanopore Technologies (ONT) long reads and optionally matched paired-end short reads for polishing.\n\n`hybracter` is designed to straddle the fine line between being as fully feature-rich as possible with as much information as you need to decide upon the best assembly, while also being a one-line automated program. In other words, as awesome as Unicycler, but updated for 2023. Perfect for lazy people like myself.\n\n`hybracter` is largely based off Ryan Wick's [magnificent tutorial](https://github.com/rrwick/Perfect-bacterial-genome-tutorial) and associated [paper](https://doi.org/10.1371/journal.pcbi.1010905). `hybracter` differs in that it adds some additional steps regarding targeted plasmid assembly with [plassembler](https://github.com/gbouras13/plassembler), contig reorientation with [dnaapler](https://github.com/gbouras13/dnaapler) and extra polishing and statistical summaries.\n\nNote: if you have Pacbio reads, as of 2023, you can run  `hybracter long` with `--no_medaka` to turn off polishing, and  `--flyeModel pacbio-hifi`. You can also probably just run [Flye](https://github.com/fenderglass/Flye) or [Dragonflye](https://github.com/rpetit3/dragonflye) (or of course [Trycyler](https://github.com/rrwick/Trycycler) ) and reorient the contigs with [dnaapler](https://github.com/gbouras13/dnaapler) without polishing. See Ryan Wick's [blogpost](https://doi.org/10.5281/zenodo.7703461) for more details. \n\n## Pipeline\n\n<p align=\"center\">\n  <img src=\"img/hybracter.png\" alt=\"Hybracter\" height=600>\n</p>\n\n- A. Reads are quality controlled with [Filtlong](https://github.com/rrwick/Filtlong), [Porechop](https://github.com/rrwick/Porechop), [fastp](https://github.com/OpenGene/fastp) and optionally contaminant removal using modules from [trimnami](https://github.com/beardymcjohnface/Trimnami).\n- B. Long-read assembly is conducted with [Flye](https://github.com/fenderglass/Flye). Each sample is classified if the chromosome(s) were assembled (marked as 'complete') or not (marked as 'incomplete') based on the given minimum chromosome length.\n- C. For complete isolates, plasmid recovery with [Plassembler](https://github.com/gbouras13/plassembler).\n- D. For all isolates, long read polishing with [Medaka](https://github.com/nanoporetech/medaka).\n- E. For complete isolates, the chromosome is reorientated to begin with the dnaA gene with [dnaapler](https://github.com/gbouras13/dnaapler).\n- F. For all isolates, if short reads are provided, short read polishing with [Polypolish](https://github.com/rrwick/Polypolish) and [pypolca](https://github.com/gbouras13/pypolca).\n- G. For all isolates, assessment of all assemblies with [ALE](https://github.com/sc932/ALE) for `hybracter hybrid` or [Pyrodigal](https://github.com/althonos/pyrodigal) for `hybracter long`.\n- H. The best assembly is selected and output along with final assembly statistics.\n\n## Benchmarking\n\n`hybracter` was benchmarked in both hybrid and long modes (specifically using the `hybrid-single` and `long-single` commands) against [Unicycler](https://github.com/rrwick/Unicycler) v0.5.0 and [Dragonflye](https://github.com/rpetit3/dragonflye) v1.1.2.\n\n30 samples from 5 studies with available reference genomes were benchmarked. You can see the full explanation and results [here](https://hybracter.readthedocs.io/en/latest/benchmarking/). You can find all the output [here](https://doi.org/10.5281/zenodo.10906937).\n\nTo summarise the conclusions:\n\n* `Hybracter hybrid` was superior to Unicycler in terms of accuracy, time taken and (slightly) in terms of plasmid recovery. It should be preferred to Unicycler.\n* You should use `hybracter long` if you care about plasmids and have only long reads. It performs similarly to hybrid methods and its inclusion of [Plassembler](https://github.com/gbouras13/plassembler) largely seems to solve the [problem of long read assemblers recovering small plasmids](https://doi.org/10.1099/mgen.0.001024).\n* `Hybracter` in both modes is inferior to Dragonflye in terms of time though better in terms of chromosome accuracy. \n* If you want the fastest possible chromosome assemblies for applications like species ID or sequence typing that retain a high level of accuracy, Dragonflye is a good option.\n* Dragonflye should not be used if you care about recovering plasmids.\n\n## Recent Updates\n\n### v0.11.0 Updates (4 December 2024)\n\n* Replaces [kmc](https://github.com/refresh-bio/KMC) with [lrge](https://github.com/mbhall88/lrge) when using `--auto`, a much faster tool designed for the purpose of estimating genome size from long reads. It is very very fast and robust. \n    * If your input has more than 5000 long reads (it should!), [lrge](https://github.com/mbhall88/lrge) will run in default settings. If it has under this, then it will run a (slightly) more computationally expensive all-vs-all mode with all input reads. In practice, if you have such low read counts, you should take all downstream analysis (inclduing lrge and hybracter) with a lot of caution anyway. \n    * According to the [preprint](https://www.biorxiv.org/content/10.1101/2024.11.27.625777v1) (and my less exhaustive testing), lrge is more accurate and much faster than kmc, but I would still be careful using it on data that has lower quality than < Q15. \n* Nothing else changes - the estimated chromosome size used by Hybracter will still be 80% of the estimate, as it needs to account for plasmids\n* Adds `r1041_e82_400bps_bacterial_methylation` as an option for `--medakaModel` thanks to [this issue](https://github.com/gbouras13/hybracter/issues/108).\n* Note this won't work if you run `hybracter` on a Mac (as medaka v2 is not available)\n\n### v0.10.0 Updates (17 October 2024)\n\n* Updates Medaka to v2.0.1, implementing the `--bacteria` option by default.\n* This is based on the recommendations of Ryan Wick [here](https://rrwick.github.io/2024/10/17/medaka-v2.html) who found it improved assemblies due to (likely) enhanced methylation error correction.\n* If you still want to specify a Medaka model, the flag `--medaka_override` has been added. You need to include this along with your model via `--medakaModel`. This is most likely useful for older R9 data.\n* * Adds `--extra_params_flye` parameter if you want to specify extra commands for the Flye assembly step.\n\n### v0.9.0 Updates (18 September 2024)\n\n**`--auto` for automatic estimation of chromosome size**\n\n**Note: if you have low quality long read sets (e.g. R9 FAST/HAC or sub Q15 reads), `--auto` is _not_ recommended. Users have reported that it can tend to overestimate the chromosome size as more erroneous 21-mers will be counted by kmc than expected. Please specify a chromosome size for this type of data.**\n\n* Thanks to an [issue](https://github.com/gbouras13/hybracter/issues/90) and code from @[richardstoeckl](https://github.com/richardstoeckl), Hybracter can now estimate the estimated chromosome size for each sample by passing `--auto`. \n* The implementation uses [kmc](https://github.com/refresh-bio/KMC). Specifically, Hybracter uses kmc to count the number of unique 21mers that appear at least 10 times in your long-read FASTQ file. This is because, for a given assembly of length L,  and a k-mer size of k, the total number of unique possible k-mers  will be given by ( L \u2013 k ) + 1, and if L >> k, then it suffices as an estimate of total assembly size\n* The estimated chromosome size used by Hybracter will actually be 80% of the number of 21-mers found at least 10 times, as it needs to account for plasmids\n* If you aren't sure whether you have enough data for assembly (i.e. coverage lower than 20x), be careful using `--auto`, because the actual assembly size will tend to be larger than the number of unique 21mers found at least 10 times. Therefore, the estimated chromosome size will almost certainly be an underestimate and may lead to Hybracter considering your assembly \"complete\" when in fact it isn't.\n\n* If you use `--auto`, you do not need to specify the chromosome length in the input. This means you don't need to `-c` with `long-single` or `hybrid-single` and in the input csv sample sheet, you do not need a column with chromosome length.\n  \ne.g. for `hybracter long` you only need 2 columns with sample name and long-read FASTQ file path:\n\n```bash\ns_aureus_sample1,sample1_long_read.fastq.gz\np_aeruginosa_sample2,sample2_long_read.fastq.gz\n```\n\nand for `hybracter hybrid` you only need 4 columns with sample name, long-read FASTQ, and R1 and R2 short-read FASTQ file paths:\n\n```bash\ns_aureus_sample1,sample1_long_read.fastq.gz,sample1_SR_R1.fastq.gz,sample1_SR_R2.fastq.gz\np_aeruginosa_sample2,sample2_long_read.fastq.gz,sample2_SR_R1.fastq.gz,sample2_SR_R2.fastq.gz\n```\n\n**Other changes**\n\n* Hybracter v0.9.0 will automatically support the reorientation of archaeal chromosomes (thanks @[richardstoeckl](https://github.com/richardstoeckl)) to begin with the cog1474 Orc1/cdc6 gene.\n* `--datadir` can now also accept 2 paths separated by a comma, if you have long reads and short reads in separate directories e.g. `--datadir \"long_read_dir,short_read_dir\"` (https://github.com/gbouras13/hybracter/issues/76).\n* `--min_depth` parameter added. Hybracter will error out if your QC'd long reads have a coverage lower than `min_depth` for a sample (https://github.com/gbouras13/hybracter/issues/89).\n\n## Why Would You Run Hybracter?\n\n* If you want the best possible _automated_ long read only or hybrid bacterial isolate genome assembly.\n* In other words, if you love Unicycler like I do, but want something faster and more accurate.\n* If you need to assemble many (e.g. 10+) bacterial isolates as efficiently as possible.\n* If you want all information about from assembly pipeline such as whether your polishing probably improved the genome, whether your assembly was likely complete, and how many plasmids you probably assembled.\n\n## Other Options\n\n#### Trycycler\n\nIf you are looking for the best possible (manual) bacterial assembly for a single isolate, please definitely use [Trycyler](https://github.com/rrwick/Trycycler). \n\n  * `hybracter` will almost certainly not give you better assemblies than Trycycler. Trycycler is the gold standard for a reason.\n  * `hybracter` is automated, scalable, faster and requires less bioinformatics/microbial genomics expertise to run. \n  * If you use Trycycler, I would also highly recommend using (disclaimer: my own program) [plassembler](https://github.com/gbouras13/plassembler) (which is built into hybracter) alongside Trycycler to assemble small plasmids if you are especially interested in those, because long read only assemblies often [miss small plasmids](https://doi.org/10.1099/mgen.0.001024).\n\n#### Dragonflye\n\n[Dragonflye](https://github.com/rpetit3/dragonflye) by the awesome @[rpetit3](https://github.com/rpetit3) is a good alternative for automated assembly if `hybracter` doesn't fit your needs, particularly if you are familiar with [Shovill](https://github.com/tseemann/shovill). Some pros and cons between `hybracter` and `dragonflye` are listed below.\n\n  * `dragonflye` allows for more options with regards to assemblers (it supports [Miniasm](https://github.com/lh3/miniasm) or [Raven](https://github.com/lbcb-sci/raven) as well as Flye).\n  * On a single isolate, `dragonflye` should be faster.\n  * `hybracter` should be more accurate, due to the extra round of polishing following reorientation, and integration of Plassembler.\n  * `hybracter` has the advantage of scalability across multiple samples due to its Snakemake and Snaketool implementation. \n  * So if you have access to a cluster, `hybracter` is for you and likely faster.\n  * `hybracter` gives more accurate plasmid assemblies because it uses [plassembler](https://github.com/gbouras13/plassembler)\n  * `hybracter` will suggest automatically whether an assembly is 'complete' or 'incomplete'\n  * `hybracter` will assess each polishing step and choose the genome most likely to be the best quality.\n\n## Installation\n\nYou will need conda  (**highly recommended** through miniforge) to run `hybracter`, because it is required for the installation of each compartmentalised environment (e.g. Flye will have its own environment). Please see the [documentation](https://hybracter.readthedocs.io/en/latest/install/) for more details on how to install miniforge.\n\n### Conda\n\n`hybracter` is available to install with `conda`. To install `hybracter` into a conda environment called `hybracterENV`:\n\n```bash\nconda create -n hybracterENV hybracter\nconda activate hybracterENV\nhybracter --help\nhybracter install\n```\n\n### Pip\n\n`hybracter` is available to install with `pip` . \n\nYou will also need conda available so `hybracter` can install all the required dependencies. Therefore, it is recommended to install `hybracter` into a conda environment as follows.\n\n\n```bash\nconda create -n hybracterENV pip\nconda activate hybracterENV\npip install hybracter\nhybracter --help\nhybracter install\n```\n\n### Source\n\nAlternatively, the development version of `hybracter` (which may include new, untested features) can be installed manually via github. \n\n```bash\ngit clone https://github.com/gbouras13/hybracter.git\ncd hybracter\npip install -e .\nhybracter --help\n```\n\n## Main Commands\n\n* `hybracter hybrid`: Assemble multiple genomes from isolates that have long-reads and paired-end short reads.\n* `hybracter hybrid-single`: Assembles a single genome from an isolate with long-reads and paired-end short reads. It takes similar parameters to [Unicycler](https://github.com/rrwick/Unicycler).\n* `hybracter long`: Assemble multiple genomes from isolates that have long-reads only.\n* `hybracter long-single`: Assembles a single genome from an isolate with long-reads only.\n* `hybracter install`: Downloads and installs the required `plassembler` database.\n\n```bash\n _           _                    _            \n| |__  _   _| |__  _ __ __ _  ___| |_ ___ _ __ \n| '_ \\| | | | '_ \\| '__/ _` |/ __| __/ _ \\ '__|\n| | | | |_| | |_) | | | (_| | (__| ||  __/ |   \n|_| |_|\\__, |_.__/|_|  \\__,_|\\___|\\__\\___|_|   \n       |___/\n\n\nUsage: hybracter [OPTIONS] COMMAND [ARGS]...\n\n  For more options, run: hybracter command --help\n\nOptions:\n  -h, --help  Show this message and exit.\n\nCommands:\n  install        Downloads and installs the plassembler database\n  hybrid         Run hybracter with hybrid long and paired end short reads\n  hybrid-single  Run hybracter hybrid on 1 isolate\n  long           Run hybracter with only long reads\n  long-single    Run hybracter long on 1 isolate\n  test-hybrid    Test hybracter hybrid\n  test-long      Test hybracter long\n  config         Copy the system default config file\n  citation       Print the citation(s) for hybracter\n  version        Print the version for hybracter\n```\n\n## Input csv\n\n`hybracter hybrid` and `hybracter long` require an input csv file to be specified with `--input`. No other inputs are required.\n\n* This file requires no headers.\n* Other than the reads, `hybracter` requires a value for a lower bound the minimum chromosome length for each isolate in base pairs. It must be an integer.\n* `hybracter` will denote contigs about this value as chromosome(s) and if it can recover a chromosome, it will denote the isolate as complete.\n* In practice, I suggest choosing 90% of the estimated chromosome size for this value.\n* e.g. for _S. aureus_, I'd choose 2500000, _E. coli_, 4000000, _P. aeruginosa_ 5500000.\n\n#### `hybracter hybrid`\n\n* `hybracter hybrid` requires an input csv file with 5 columns. \n* Each row is a sample.\n* Column 1 is the sample name you want for this isolate. \n* Column 2 is the long read fastq file.\n* Column 3 is the minimum chromosome length for that sample.\n* Column 4 is the R1 short read fastq file\n* Column 5 is the R2 short read fastq file.\n\ne.g.\n\n```bash\ns_aureus_sample1,sample1_long_read.fastq.gz,2500000,sample1_SR_R1.fastq.gz,sample1_SR_R2.fastq.gz\np_aeruginosa_sample2,sample2_long_read.fastq.gz,5500000,sample2_SR_R1.fastq.gz,sample2_SR_R2.fastq.gz\n```\n\n**Using `--auto`**\n\n* If you use `--auto`, you can remove the column with the chromosome length\n\ne.g.\n\n```bash\ns_aureus_sample1,sample1_long_read.fastq.gz,sample1_SR_R1.fastq.gz,sample1_SR_R2.fastq.gz\np_aeruginosa_sample2,sample2_long_read.fastq.gz,sample2_SR_R1.fastq.gz,sample2_SR_R2.fastq.gz\n```\n\n#### `hybracter long`\n\n`hybracter long` also requires an input csv with no headers, but only 3 columns.\n\n* `hybracter long` requires an input csv file with 3 columns. \n* Each row is a sample.\n* Column 1 is the sample name you want for this isolate. \n* Column 2 is the long read fastq file.\n* Column 3 is the minimum chromosome length for that sample.\n\ne.g.\n\n```bash\ns_aureus_sample1,sample1_long_read.fastq.gz,2500000\np_aeruginosa_sample2,sample2_long_read.fastq.gz,5500000\n```\n\n**Using `--auto`**\n\n* If you use `--auto`, you can remove the column with the chromosome length\n\n```bash\ns_aureus_sample1,sample1_long_read.fastq.gz\np_aeruginosa_sample2,sample2_long_read.fastq.gz\n```\n\n## Usage\n\n#### `hybracter install`\n\nYou will first need to install the `hybracter` databases.\n\n```bash\nhybracter install\n```\n\nAlternatively, can also specify a particular directory to store them - you will need to specify this with `-d <databases directory>` when you run `hybracter`.\n\n```bash\nhybracter install -d  <databases directory>\n```\n\n#### Installing Dependencies\n\n**If you have internet access on the machine or node where you are running hybracter, you can skip this step.**\n\nWhen you run `hybracter` for the first time, all the required dependencies will be installed as required, so it will take longer than usual (usually a few minutes). Every time you run it afterwards, it will be a lot faster as the dependencies will be installed.\n\nIf you intend to run hybracter offline (e.g. on HPC nodes with no access to the internet), I highly recommend running `hybracter  test-hybrid ` and/or `hybracter test-long` on a node with internet access so hybracter can download the required dependencies. It should take 5-10 minutes.\n\n```bash\nhybracter test-hybrid \nhybracter test-long\nhybracter --help\n```\n\nOnce that is done, run `hybracter hybrid` or `hybracter long` as follows.\n\n#### `hybracter hybrid`\n\n```bash\nhybracter hybrid -i <input.csv> -o <output_dir> -t <threads> \n```\n\n* `hybracter hybrid` requires only a CSV file specified with `-i` or `--input`\n* `--no_pypolca` will turn off pypolca polishing.\n* Use `--min_length` to specify the minimum long-read length for Filtlong.\n* Use `--min_quality` to specify the minimum long-read quality for Filtlong.\n* You can specify a FASTA file containing contaminants with `--contaminants`. All long reads that map to contaminants will be filtered out.\n  * You can specify Escherichia phage lambda (a common contaminant in Nanopore library preparation) using `--contaminants lambda`.\n* `--skip_qc` will skip all read QC steps.\n* You can change the `--medakaModel` (all available options are listed in `hybracter hybrid -h`)\n* You can change the `--flyeModel` (all available options are listed in `hybracter hybrid -h`)\n* You can turn off Medaka polishing using `--no_medaka`\n* You can turn off pypolca polishing using `--no_pypolca`\n* You can force `hybracter` to pick the last polishing round (not the best according to ALE) with `--logic last`. `hybracter` defaults to picking the best (according to ALE) i.e. `--logic best`.\n\n#### `hybracter hybrid-single`\n\n```bash\nhybracter hybrid-single -l <longread FASTQ> -1 <R1 short reads FASTQ> -2 <R2 short reads FASTQ> -s <sample name> -c <chromosome size> -o <output_dir> -t <threads>  [other arguments]\n```\n\n#### `hybracter long`\n\n```bash\nhybracter long -i <input.csv> -o <output_dir> -t <threads> [other arguments]\n```\n\n* `hybracter long` requires only a CSV file specified with `-i` or `--input`\n* Use `--min_length` to specify the minimum long-read length for Filtlong.\n* Use `--min_quality` to specify the minimum long-read quality for Filtlong.\n* You can specify a FASTA file containing contaminants with `--contaminants`. All long reads that map to contaminants will be filtered.\n  * You can specify Escherichia phage lambda (a common contaminant in Nanopore library preparation) using `--contaminants lambda`.\n* `--skip_qc` will skip all read QC steps.\n* You can change the `--medakaModel` (all available options are listed in `hybracter long -h`)\n* You can change the `--flyeModel` (all available options are listed in `hybracter long -h`)\n* You can turn off Medaka polishing using `--no_medaka`\n* You can force `hybracter` to pick the last polishing round (not the best according to pyrodigal mean CDS length) with `--logic last`. `hybracter` defaults to picking the best i.e. `--logic best`.\n\n\n#### `hybracter long-single`\n\n```bash\nhybracter long-single -l <longread FASTQ> -s <sample name> -c <chromosome size>  -o <output_dir> -t <threads>  [other arguments]\n```\n\n## Outputs \n\n`hybracter` creates a number of output files in different formats. \n\nFor more information about all possible file outputs, please see the documentation here.\n\n### Main Output Files\n\nThe main outputs are in the `FINAL_OUTPUT` directory.\n\nThis directory will include:\n\n1. `hybracter_summary.tsv` file. This gives the summary statistics for your assemblies with the following columns:\n\n\n|Sample |Complete (True or False) | Total_assembly_length |\tNumber_of_contigs |\tMost_accurate_polishing_round |\tLongest_contig_length |\tLongest_contig_coverage |Number_circular_plasmids |\n|--------|-----------------------|-------------------------|-------------------|--------|--|--|--|\n\n\n2. `complete` and `incomplete` directories.\n\nAll samples that are denoted by hybracter to be complete will have 5 outputs in the `complete` directory:\n\n   * `sample`_summary.tsv containing the summary statistics for that sample.\n   * `sample`_per_contig_stats.tsv containing the contig names, lengths, GC% and whether the contig is circular.\n   * `sample`_final.fasta containing the final assembly for that sample.\n   * `sample`_chromosome.fasta containing only the final chromosome(s) assembly for that sample.\n   * `sample`_plasmid.fasta containing only the final plasmid(s) assembly for that sample. Note this may be empty. If this is empty, then that sample had no plasmids. \n\nAll samples that are denoted by hybracter to be incomplete will have 3 outputs in the `incomplete` directory:\n\n   * `sample`_summary.tsv containing the summary statistics for that sample.\n   * `sample`_per_contig_stats.tsv containing the contig names, lengths, GC% and whether the contig is circular.\n   * `sample`_final.fasta containing the final assembly for that sample.\n\n\n## Snakemake Profiles\n\nI would highly highly recommend running hybracter using a Snakemake profile. Please see this blog [post](https://fame.flinders.edu.au/blog/2021/08/02/snakemake-profiles-updated) for more details. I have included an example slurm profile in the profile directory, but check out this [link](https://github.com/Snakemake-Profiles) for more detail on other HPC job scheduler profiles.\n\n```bash\nhybracter hybrid --input <input.csv> --output <output_dir> --threads <threads> --profile profiles/hybracter\n```\n\n## Advanced Configuration\n\nThanks to its Snakemake backend, you can modify resource requirements for each job contained within `hybracter` using the configuration file. A default can be created using the `hybracter config` command. This can make it even more efficient in server environment, as many jobs can be more efficiently parallelised than the default settings. For more information, please see the [documentation](https://hybracter.readthedocs.io/en/latest/configuration/) \n\n## Older Updates \n\n### v0.7.0 Updates (04 March 2024)\n\n**Changes to short read polishing**\n\n* Logic added to run `polypolish` v0.6.0 with `--careful` and skip pypolca if the SR coverage estimate is below 5x (note: FASTA files for pypolca will be generated in the processing directory to play nice with Snakemake, but these will be identical to the polypolish output). \n* For 5-25x coverage, `polypolish --careful` and `pypolca` with `--careful` will be run. \n* For >25x coverage, `polypolish` default and `pypolca` with `--careful` will be run. \n* A preprint justifying these changes will be available soon.\n\n**`--logic` changes**\n\n* By default, `--logic` defaults to `last` for `hybracter hybrid`, as there we have found that the polishing strategy implemented above never makes the assembly worse. We suggest never using `--logic best` with `hybracter hybrid`.\n\n**Changes for chromosome contigs and circularity**\n\n* If hybracter assembles a contig that is greater than the minimum chromosome length but not marked as circular by Flye, this will now be denoted as a chromosome, but not circular. The genome will be marked as complete also. \n    * These will usually be assemblies with some issue (e.g. prophages, circularisation issues, heterogeneity) and probably require some more attention.\n    * For example, with the _Vibrio cholerae_ larger chromosome described [here](https://rrwick.github.io/2024/02/15/misassemblies.html), the genome will be marked as 'complete' but the contig will not be marked as 'circular' in the `hybracter` output.\n    * Such contigs will be polished and be in the final `_chromosome.fasta` output, but they will not be rotated by `dnaapler`. \n    * These were previously being excluded, which was missing assemblies with structural heterogeneity (causing the chromosome not to completely circularise) or even bacteria with linear chromosomes like [_Borrelia_](https://www.nature.com/articles/37551). \n\n**Adds `--depth_filter`** \n\n* This is passed to [Plassembler](https://github.com/gbouras13/plassembler) and will filter out all putative plasmid contigs that are lower than this depth fraction compared to the chromosome.\n* Defaults to 0.25 like Unicycler's implementation.\n\n### v0.5.0 Updates (08 January 2024)\n\nRyan Wick recently ran `hybracter long` on the latest Dorado v0.5.0 basecalled Nanopore reads (his [blog post](https://rrwick.github.io/2023/12/18/ont-only-accuracy-update.html)). You can read a write-up of the results [here](https://hybracter.readthedocs.io/en/latest/dorado_ryan_louise_0_5_0/). As a result, subsampling has been added to Hybracter. \n\n* Adds subsampling using `--subsample_depth` using Filtlong, based on some benchmarking of Dorado v0.5.0. Defaults to 100x of the estimated chromosome size `-c`.\n* Also adds stricter criteria for complete assemblies (aka ensures that identified chromosomes must be circularised according to Flye).\n  \n### v0.4.0 Updates (14 November 2023)\n\n* Adds `--logic` parameter. You have 2 choices: `--logic best` (the default) or `--logic last`.\n* `--logic best` will run `hybracter` as normal and the best assembly (by ALE or pyrodigal mean length) will be selected as the final assembly.\n* `--logic last` will force hybracter to pick the last polished round as the final assembly even if it is not the best as per ALE/pyrodigal. So for `hybracter hybrid` this will default to the pypolca polished round, for hybracter long it will be Medaka round 2. You may wish to use this if you want all your isolates to be consistently assembled.\n* Adds reorientation of pre polished chromosome in case it is selected as the best assembly\n* Adds fixes to the chromosome comparisons - now it is much easier to interpret any changes between polishing rounds.\n\n\n### v0.2.0 Updates 26 October 2023 - Medaka, Polishing and `--no_medaka`\n\nRyan Wick's [blogpost](https://rrwick.github.io/2023/10/24/ont-only-accuracy-update.html) on 24 October 2023 suggests that if you have new 5Hz SUP or Res (bacterial model specific) ONT reads, Medaka polishing often makes things worse! It also implies that Nanopore reads are almost good enough to assemble perfect bacterial genomes (at least with Trycycler) which is pretty awesome.\n\nCombined with the difficulty and randomness in installing Medaka from Nanopore, I have therefore decided to add a `--no_medaka` flag into v0.2.0. \n\nI have also set Medaka to be v1.8.0 and I do not intend to upgrade this going forward, as this is the most recent stable bioconda version that doesn't seem to cause too much grief. \n\nIf you have trouble with Medaka installation, I'd therefore suggest please using `--no_medaka`.\n\n`hybracter` should still handle cases where Medaka makes assemblies worse. If Medaka makes your assembly appreciably worse, `hybracter` should choose the best most accurate assembly as the unpolished one in long mode. \n\n\n## Version Log\n\nA brief description of what is new in each update of `hybracter` can be found in the HISTORY.md file.\n\n## System\n\n`hybracter` is tested on Linux and on MacOS.\n\n## Bugs and Suggestions\n\nIf you come across bugs with `hybracter`, or would like to make any suggestions to improve the program, please open an issue or email george.bouras@adelaide.edu.au.\n\n# Citation\n\nIf you use Hybracter, please cite the manuscript along with core dependencies (they are also our tools!):\n\nHybracter Manuscript\n* George Bouras, Ghais Houtak, Ryan R Wick, Vijini Mallawaarachchi, Michael J. Roach, Bhavya Papudeshi, Louise M Judd, Anna E Sheppard, Robert A Edwards, Sarah Vreugde - Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies. (2024) _Microbial Genomics_ doi: https://doi.org/10.1099/mgen.0.001244.\n\nPlassembler:\n* Bouras G., Sheppard A.E., Mallawaarachchi V., Vreugde S., Plassembler: an automated bacterial plasmid assembly tool, Bioinformatics, Volume 39, Issue 7, July 2023, btad409, https://doi.org/10.1093/bioinformatics/btad409. \n\nDnaapler:\n* George Bouras, Susanna R. Grigson, Bhavya Papudeshi, Vijini Mallawaarachchi, Michael J. Roach (2024). Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93), 5968, https://doi.org/10.21105/joss.05968.\n\nRyan Wick et al's Assembling the perfect bacterial genome paper, which provided the intellectual framework for hybracter:\n* Wick RR, Judd LM, Holt KE (2023) Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing. PLoS Comput Biol 19(3): e1010905. https://doi.org/10.1371/journal.pcbi.1010905\n\nI would also recommend citing Hybracter's other dependencies if you can where they are used:\n\nFlye:\n* Kolmogorov, M., Yuan, J., Lin, Y. et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37, 540\u2013546 (2019). https://doi.org/10.1038/s41587-019-0072-8\n\nSnaketool:\n* Roach MJ, Pierce-Ward NT, Suchecki R, Mallawaarachchi V, Papudeshi B, Handley SA, et al. (2022) Ten simple rules and a template for creating workflows-as-applications. PLoS Comput Biol 18(12): e1010705. https://doi.org/10.1371/journal.pcbi.1010705\n\nTrimnami:\n* Roach MJ. (2023) Trimnami. https://github.com/beardymcjohnface/Trimnami.\n\nFiltlong:\n* Wick RR (2018) Filtlong. https://github.com/rrwick/Filtlong.\n\nPorechop and Porechop_abi:\n* Quentin Bonenfant, Laurent No\u00e9, H\u00e9l\u00e8ne Touzet, Porechop_ABI: discovering unknown adapters in Oxford Nanopore Technology sequencing reads for downstream trimming, Bioinformatics Advances, Volume 3, Issue 1, 2023, vbac085, https://doi.org/10.1093/bioadv/vbac085\n* Wick RR (2017) https://github.com/rrwick/Porechop.\n\nfastp:\n* Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, September 2018, Pages i884\u2013i890, https://doi.org/10.1093/bioinformatics/bty560. \n\nALE:\n* Scott C. Clark, Rob Egan, Peter I. Frazier, Zhong Wang, ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies, Bioinformatics, Volume 29, Issue 4, February 2013, Pages 435\u2013443, https://doi.org/10.1093/bioinformatics/bts723\n\nMedaka:\n* Oxford Nanopore Technologies, Medaka. https://github.com/nanoporetech/medaka.\n\nPyrodigal:\n* Larralde, M., (2022). Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. Journal of Open Source Software, 7(72), 4296, https://doi.org/10.21105/joss.04296.\n\nPolypolish:\n* Wick RR, Holt KE (2022) Polypolish: Short-read polishing of long-read bacterial genome assemblies. PLoS Comput Biol 18(1): e1009802. https://doi.org/10.1371/journal.pcbi.1009802.\n\nPypolca:\n* Bouras G, Judd LM, Edwards RA, Vreugde S, Stinear TP, Wick RR (2024) How low can you go? Short-read polishing of Oxford Nanopore bacterial genome assemblies. bioRxiv 2024.03.07.584013; doi: [https://doi.org/10.1101/2024.03.07.584013](https://doi.org/10.1101/2024.03.07.584013).\n* Zimin AV, Salzberg SL (2020) The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput Biol 16(6): e1007981. https://doi.org/10.1371/journal.pcbi.1007981. \n\nSnakemake:\n* M\u00f6lder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2021, 10:33 (https://doi.org/10.12688/f1000research.29032.1).\n\nKMC:\n* Marek Kokot, Maciej D\u0142ugosz, Sebastian Deorowicz, KMC 3: counting and manipulating k-mer statistics, Bioinformatics, Volume 33, Issue 17, 01 September 2017, Pages 2759\u20132761, (https://doi.org/10.1093/bioinformatics/btx304).\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An automated long-read first bacterial genome assembly pipeline.",
    "version": "0.11.1",
    "project_urls": {
        "Homepage": "https://github.com/gbouras13/hybracter"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4e8552757e2bcefb4a8c5539ed9bb6a4877ac4d0434414ed9820ed5b33ade44c",
                "md5": "e2b9ec8e8e895982d4c80bd5c6328a22",
                "sha256": "559fa532d2661558332052126e9ecfb984820dfdbed283baa174778ca92c2d98"
            },
            "downloads": -1,
            "filename": "hybracter-0.11.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e2b9ec8e8e895982d4c80bd5c6328a22",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16268686,
            "upload_time": "2025-01-21T00:10:50",
            "upload_time_iso_8601": "2025-01-21T00:10:50.986905Z",
            "url": "https://files.pythonhosted.org/packages/4e/85/52757e2bcefb4a8c5539ed9bb6a4877ac4d0434414ed9820ed5b33ade44c/hybracter-0.11.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ba92c7cb23da78df7ab9026da77f6d41cb2673b224da73cc38b6633174530dc1",
                "md5": "fb8cd32dc260e02c58656aa26e8d85eb",
                "sha256": "242342f8d58b928a8462c652630289213fc422bdb3fe39fe0a716a7d763d456a"
            },
            "downloads": -1,
            "filename": "hybracter-0.11.1.tar.gz",
            "has_sig": false,
            "md5_digest": "fb8cd32dc260e02c58656aa26e8d85eb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 16168133,
            "upload_time": "2025-01-21T00:10:54",
            "upload_time_iso_8601": "2025-01-21T00:10:54.732825Z",
            "url": "https://files.pythonhosted.org/packages/ba/92/c7cb23da78df7ab9026da77f6d41cb2673b224da73cc38b6633174530dc1/hybracter-0.11.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-21 00:10:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gbouras13",
    "github_project": "hybracter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "hybracter"
}
        
Elapsed time: 0.98169s