dnaapler

Name	dnaapler JSON
Version	1.0.1 JSON
	download
home_page	https://github.com/gbouras13/dnaapler
Summary	Reorients assembled microbial sequences
upload_time	2024-11-22 04:07:20
maintainer	None
docs_url	None
author	George Bouras
requires_python	<4.0,>=3.8
license	MIT
keywords	microbial bioinformatics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gbouras13/dnaapler/blob/master/run_dnaapler.ipynb)

[![DOI](https://joss.theoj.org/papers/10.21105/joss.05968/status.svg)](https://doi.org/10.21105/joss.05968)

[![CI](https://github.com/gbouras13/dnaapler/actions/workflows/ci.yaml/badge.svg)](https://github.com/gbouras13/dnaapler/actions/workflows/ci.yaml)
[![codecov](https://codecov.io/gh/gbouras13/dnaapler/branch/main/graph/badge.svg?token=4B1T2PGM9V)](https://codecov.io/gh/gbouras13/dnaapler)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![DOI](https://zenodo.org/badge/550095292.svg)](https://zenodo.org/doi/10.5281/zenodo.10039420)

[![Anaconda-Server Badge](https://anaconda.org/bioconda/dnaapler/badges/version.svg)](https://anaconda.org/bioconda/dnaapler)
[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/dnaapler)](https://img.shields.io/conda/dn/bioconda/dnaapler)
[![PyPI version](https://badge.fury.io/py/dnaapler.svg)](https://badge.fury.io/py/dnaapler)
[![Downloads](https://static.pepy.tech/badge/dnaapler)](https://pepy.tech/project/dnaapler)


# dnaapler

Dnaapler is a simple tool that reorients complete circular microbial genomes.

## Quick Start

```
# creates empty conda environment
conda create -n dnaapler_env

# activates conda environment
conda activate dnaapler_env

# installs dnaapler
conda install -c bioconda dnaapler

# runs dnaapler all 
dnaapler all -i input_mixed_contigs.fasta -o output_directory_path -p my_bacteria_name -t 8

# runs dnaapler chromosome
dnaapler chromosome -i input_chromosome.fasta -o output_directory_path -p my_bacteria_name -t 8

```

## Paper

Dnaapler has been published in JOSS [here](https://joss.theoj.org/papers/10.21105/joss.05968). If you use Dnaapler in your work, please cite it as follows:

```

George Bouras, Susanna R. Grigson, Bhavya Papudeshi, Vijini Mallawaarachchi, Michael J. Roach (2024). Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93), 5968, https://doi.org/10.21105/joss.05968

```

Additionally, please consider citing the dependencies where relevant:

```
Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. doi: 10.1016/S0022-2836(05)80360-2. PMID: 2231712.

Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988.

Larralde, M., (2022). Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. Journal of Open Source Software, 7(72), 4296, https://doi.org/10.21105/joss.04296.

Hyatt, D., Chen, GL., LoCascio, P.F. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010). https://doi.org/10.1186/1471-2105-11-119.
```

## v1

* **BREAKING CHANGE** - `dnaapler` now uses `MMSeqs2 v13.45111` rather than `BLAST`. You will need to install [MMSeqs2](https://github.com/soedinglab/MMseqs2) if you upgrade (if you use conda, it should be handled for you). The CLI is identical.
* There are 2 reasons for this:
    1. Users reported problems installing BLAST on MacOS with Apple Silicon (see e.g. [here](https://github.com/gbouras13/pharokka/issues/368)). MMseqs2 works on all platforms and is dilligently maintained.
    2. MMSeqs2 is much much faster than BLAST (what took BLAST a few minutes takes MMSeqs2 seconds). We probably should have written `dnaapler` with `MMseqs2` to begin with. `MMSeqs2 v13.45111` was chosen to ensure interoperability with [pharokka](https://github.com/gbouras13/pharokka)
* The alignment resuls may not be identicial to ` dnaapler v0.8.1` (i.e. they might find different top hits), but the actual reorientation is likely to be identical (at least in my tests). Please reach out or make an issue if you notice any discrepancies

For example - on my machine (Ubuntu 20.04, Intel i9 13th gen 13900 CPU with 32 threads), for a _Staphylococcus aureus_ genome with 1 small plasmid, `dnaapler -i staph.fasta -o staph_dnaapler -t 8`  took ~129 seconds wallclock with `v0.8.1` using `BLAST`, while it took ~3 seconds wallclock with `v1.0.0` using `MMseqs2`.


# Google Colab Notebooks

If you don't want to install `dnaapler` locally, you can run `dnaapler all` without any code using the [Google Colab notebook](https://colab.research.google.com/github/gbouras13/dnaapler/blob/master/run_dnaapler.ipynb).

## Table of Contents
- [dnaapler](#dnaapler)
  - [Quick Start](#quick-start)
  - [Paper](#paper)
  - [v1](#v1)
- [Google Colab Notebooks](#google-colab-notebooks)
  - [Table of Contents](#table-of-contents)
  - [Description](#description)
  - [Documentation](#documentation)
  - [Commands](#commands)
  - [Installation](#installation)
    - [Conda](#conda)
    - [Pip](#pip)
  - [Usage](#usage)
  - [Example Usage](#example-usage)
  - [Databases](#databases)
  - [Motivation](#motivation)
  - [Contributing](#contributing)
  - [Acknowledgements](#acknowledgements)

## Description

<p align="center">
  <img src="paper/Dnaapler_figure.png" alt="Dnaapler Figure">
</p>

`dnaapler` is a simple python program that takes a single nucleotide input sequence (in FASTA format), finds the desired start gene using `MMseqs2` against an amino acid sequence database, checks that the start codon of this gene is found, and if so, then reorients the chromosome to begin with this gene on the forward strand. 

It was originally designed to replicate the reorientation functionality of [Unicycler](https://github.com/rrwick/Unicycler/blob/main/unicycler/gene_data/repA.fasta) with dnaA, but for for long-read first assembled chromosomes. We have extended it to work with plasmids (`dnaapler plasmid`) and phages (`dnaapler phage`), or for any input FASTA desired with `dnaapler custom`, `dnaapler mystery` or `dnaapler nearest`.

For bacterial chromosomes, `dnaapler chromosome` should ensure the chromosome breakpoint never interrupts genes or mobile genetic elements like prophages. It is intended to be used with good-quality completed bacterial genomes, generated with methods such as [Trycycler](https://github.com/rrwick/Trycycler/wiki), [Dragonflye](https://github.com/rpetit3/dragonflye) or my own pipeline [hybracter](https://github.com/gbouras13/hybracter).

Additionally, you can also reorient multiple bacterial chromosomes/plasmids/phages at once using the `dnaapler bulk` subcommand.

If your input FASTA is mixed (e.g. has chromosome and plasmids), you can also use `dnaapler all`, with the option to ignore some contigs with the `--ignore` parameter.

## Documentation

The full documentation for `dnaapler` can be found [here](https://dnaapler.readthedocs.io).

## Commands

* `dnaapler all`: Reorients 1 or more contigs to begin with any of dnaA, terL, repA or COG1474. 
  - Practically, this should be the most useful command for most users.

* `dnaapler chromosome`: Reorients your sequence to begin with the dnaA chromosomal replication initiator gene
* `dnaapler plasmid`: Reorients your sequence to begin with the repA plasmid replication initiation gene
* `dnaapler phage`: Reorients your sequence to begin with the terL large terminase subunit gene
* `dnaapler archaea`: Reorients your sequence to begin with the [COG1474 archaeal Orc1/cdc6 gene](https://www.ncbi.nlm.nih.gov/research/cog/cog/COG1474/).
* `dnaapler custom`: Reorients your sequence to begin with a custom amino acid FASTA format gene that you specify
* `dnaapler mystery`: Reorients your sequence to begin with a random CDS
* `dnaapler largest`: Reorients your sequence to begin with the largest CDS
* `dnaapler nearest`: Reorients your sequence to begin with the first CDS (nearest to the start). Designed for fixing sequences where a CDS spans the breakpoint.
* `dnaapler bulk`: Reorients multiple contigs to begin with the desired start gene - either dnaA, terL, repA or a custom gene.


## Installation

`dnaapler` requires only `MMseqs2 v13.45111` as an external dependency. 

Installation from conda is highly recommended as this will install `MMseqs2` automatically.

### Conda

`dnaapler` is available on bioconda.

```
conda install -c bioconda dnaapler
```

### Pip

You can also install `dnaapler` with pip.

```
pip install dnaapler
```

* If you install `dnaapler` with pip, then you will then need to install `MMseqs2 v13.45111` separately. It will need to be available in the `$PATH` or else `dnaapler` will not work. 


## Usage

```
Usage: dnaapler [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help     Show this message and exit.
  -V, --version  Show the version and exit.

Commands:
  all         Reorients contigs to begin with any of dnaA, repA...
  archaea     Reorients your genome to begin with the archaeal COG1474...
  bulk        Reorients multiple genomes to begin with the same gene
  chromosome  Reorients your genome to begin with the dnaA chromosomal...
  citation    Print the citation(s) for this tool
  custom      Reorients your genome with a custom database
  largest     Reorients your genome the begin with the largest CDS as...
  mystery     Reorients your genome with a random CDS
  nearest     Reorients your genome the begin with the first CDS as...
  phage       Reorients your genome to begin with the terL large...
  plasmid     Reorients your genome to begin with the repA replication...
  ```

  ```
Usage: dnaapler all [OPTIONS]

  Reorients contigs to begin with any of dnaA, repA, terL or archaeal COG1474 Orc1/cdc6

Options:
  -h, --help               Show this message and exit.
  -V, --version            Show the version and exit.
  -i, --input PATH         Path to input file in FASTA format  [required]
  -o, --output PATH        Output directory   [default: output.dnaapler]
  -t, --threads INTEGER    Number of threads to use with MMseqs2  [default: 1]
  -p, --prefix TEXT        Prefix for output files  [default: dnaapler]
  -f, --force              Force overwrites the output directory
  -e, --evalue TEXT        e value for MMseqs2  [default: 1e-10]
  --ignore PATH            Text file listing contigs (one per row) that are to
                           be ignored
  -a, --autocomplete TEXT  Choose an option to autocomplete reorientation if
                           MMseqs2 based approach fails. Must be one of: none,
                           mystery, largest, or nearest [default: none]
  --seed_value INTEGER     Random seed to ensure reproducibility.  [default:
                           13]
  ```

The reoriented output FASTA will be `{prefix}_reoriented.fasta` in the specified output directory.

## Example Usage

* For more detailed example usage, please see the [examples](https://dnaapler.readthedocs.io/en/latest/example/) section of the documentation. 

```
dnaapler all -i input.fasta -o output_directory_path -p my_genome_name --ignore list_of_contigs_to_ignore.txt
```

```
dnaapler chromosome -i input.fasta -o output_directory_path -p my_bacteria_name -t 8
```

```
dnaapler phage -i input.fasta -o output_directory_path -p my_phage_name -t 8
```

```
dnaapler plasmid -i input.fasta -o output_directory_path -p my_plasmid_name -t 8
```

```
dnaapler archaea -i input.fasta -o output_directory_path -p my_archaea_name -t 8
```

```
dnaapler custom -i input.fasta -o output_directory_path -p my_genome_name -t 8 -c my_custom_database_file
```

```
dnaapler mystery -i input.fasta -o output_directory_path -p my_genome_name
```

```
dnaapler nearest -i input.fasta -o output_directory_path -p my_genome_name
```

```
dnaapler largest -i input.fasta -o output_directory_path -p my_genome_name
```

```
# to reorient multiple bacterial chromosomes
dnaapler bulk -i input_file_with_multiple_chromosomes.fasta -m chromosome -o output_directory_path -p my_genome_name 
```

## Databases

`dnaapler chromosome` uses 584 proteins downloaded from Swissprot with the query "Chromosomal replication initiator protein DnaA" on 24 May 2023 as its database for dnaA. All hits from the query were also filtered to ensure "GN=dnaA" was included in the header of the FASTA entry.

`dnaapler plasmid` uses the repA database curated by Ryan Wick in [Unicycler](https://github.com/rrwick/Unicycler/blob/main/unicycler/gene_data/repA.fasta).

`dnaapler phage` uses a terL database curated using [PHROGs](https://phrogs.lmge.uca.fr). All the AA sequences of the 55 phrogs annotated as 'large terminase subunit' were downloaded, combined and depduplicated using [seqkit](https://github.com/shenwei356/seqkit) `seqkit rmdup -s -o terL.faa phrog_terL.faa`.

`dnaapler archaea` uses a database of 403 archaeal COG1474 Orc1/cdc6 genes curated from [here](https://ftp.ncbi.nlm.nih.gov/pub/wolf/COGs/arCOG/).

`dnaapler all` uses all four databases combined into one. 

`dnaapler custom` uses a custom amino acid FASTA format file that you specify using `-c`. 

The matching is strict - it requires a strong MMseqs2 match (default e-value 1E-10), and the first amino acid of a MMseqs2 hit gene to be identified as Methionine, Valine or Leucine, the 3 most used start codons in bacteria/phages. 

For the most commonly studied microbes (ESKAPE pathogens, etc), the dnaA database should suffice.

If you try `dnaapler` on a more novel or under-studied microbe with a dnaA gene that has little sequence similarity to the database, you may need to provide your own dnaA gene(s) in amino acid FASTA format using `dnaapler custom`.

After this [issue](https://github.com/gbouras13/dnaapler/issues/1), `dnaapler mystery` was added. It predicts all ORFs in the input using [pyrodigal](https://github.com/althonos/pyrodigal), then picks a random gene to re-orient your sequence with.

## Motivation

1. I couldn't get [Circlator](https://sanger-pathogens.github.io/circlator/) to work and it is no longer supported.
2. [berokka](https://github.com/tseemann/berokka) doesn't orient chromosomes to begin with dnaa.
3. After reading Ryan Wick's masterful bacterial genome assembly [tutorial](https://github.com/rrwick/Perfect-bacterial-genome-tutorial/wiki), I realised that it is probably optimal to run 2 polishing steps, once before then once after rotating the chromosome, to ensure the breakpoint is polished. Further, for some "complete" long read bacterial assemblies that didn't circularise properly, I figured that as long as you have a complete assembly (even if not "circular" as marked as in Flye), polishing after a re-orientation would be likely to circularise the chromosome. A bit like Ryan's [rotate_circular_gfa.py](https://github.com/rrwick/Perfect-bacterial-genome-tutorial/blob/main/scripts/rotate_circular_gfa.py) script, without the requirement of strict circularity.
4. While researching MGEs in _S. aureus_ whole genome sequences, I repeatedly found instances where MGEs were interrupted by the chromosome breakpoint. So I thought I'd add a tool to automate it in my pipeline. 
5. It's probably good to have all your sequences start at the same location for synteny analyses.

## Contributing

If you would like to help improve  `dnaapler` you are very welcome!

For changes to be accepted, they must pass the CI checks. 

Please see [CONTRIBUTING.md](CONTRIBUTING.md) for more details.

## Acknowledgements

Thanks to Torsten Seemann, Ryan Wick and the Circlator team for their existing work in the space. Also to [Michael Hall](https://github.com/mbhall88), whose repository [tbpore](https://github.com/mbhall88/tbpore) we took and adapted a lot of scaffolding code from because he writes really nice code.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gbouras13/dnaapler",
    "name": "dnaapler",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": "microbial, bioinformatics",
    "author": "George Bouras",
    "author_email": "george.bouras@adelaide.edu.au",
    "download_url": "https://files.pythonhosted.org/packages/11/d6/6a02f018d864f0cdb1762ab3f990d1996b1c5e5484f03d82e19b1c517c96/dnaapler-1.0.1.tar.gz",
    "platform": null,
    "description": "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gbouras13/dnaapler/blob/master/run_dnaapler.ipynb)\n\n[![DOI](https://joss.theoj.org/papers/10.21105/joss.05968/status.svg)](https://doi.org/10.21105/joss.05968)\n\n[![CI](https://github.com/gbouras13/dnaapler/actions/workflows/ci.yaml/badge.svg)](https://github.com/gbouras13/dnaapler/actions/workflows/ci.yaml)\n[![codecov](https://codecov.io/gh/gbouras13/dnaapler/branch/main/graph/badge.svg?token=4B1T2PGM9V)](https://codecov.io/gh/gbouras13/dnaapler)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![DOI](https://zenodo.org/badge/550095292.svg)](https://zenodo.org/doi/10.5281/zenodo.10039420)\n\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/dnaapler/badges/version.svg)](https://anaconda.org/bioconda/dnaapler)\n[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/dnaapler)](https://img.shields.io/conda/dn/bioconda/dnaapler)\n[![PyPI version](https://badge.fury.io/py/dnaapler.svg)](https://badge.fury.io/py/dnaapler)\n[![Downloads](https://static.pepy.tech/badge/dnaapler)](https://pepy.tech/project/dnaapler)\n\n\n# dnaapler\n\nDnaapler is a simple tool that reorients complete circular microbial genomes.\n\n## Quick Start\n\n```\n# creates empty conda environment\nconda create -n dnaapler_env\n\n# activates conda environment\nconda activate dnaapler_env\n\n# installs dnaapler\nconda install -c bioconda dnaapler\n\n# runs dnaapler all \ndnaapler all -i input_mixed_contigs.fasta -o output_directory_path -p my_bacteria_name -t 8\n\n# runs dnaapler chromosome\ndnaapler chromosome -i input_chromosome.fasta -o output_directory_path -p my_bacteria_name -t 8\n\n```\n\n## Paper\n\nDnaapler has been published in JOSS [here](https://joss.theoj.org/papers/10.21105/joss.05968). If you use Dnaapler in your work, please cite it as follows:\n\n```\n\nGeorge Bouras, Susanna R. Grigson, Bhavya Papudeshi, Vijini Mallawaarachchi, Michael J. Roach (2024). Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software, 9(93), 5968, https://doi.org/10.21105/joss.05968\n\n```\n\nAdditionally, please consider citing the dependencies where relevant:\n\n```\nAltschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. doi: 10.1016/S0022-2836(05)80360-2. PMID: 2231712.\n\nSteinegger M, S\u00f6ding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988.\n\nLarralde, M., (2022). Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. Journal of Open Source Software, 7(72), 4296, https://doi.org/10.21105/joss.04296.\n\nHyatt, D., Chen, GL., LoCascio, P.F. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010). https://doi.org/10.1186/1471-2105-11-119.\n```\n\n## v1\n\n* **BREAKING CHANGE** - `dnaapler` now uses `MMSeqs2 v13.45111` rather than `BLAST`. You will need to install [MMSeqs2](https://github.com/soedinglab/MMseqs2) if you upgrade (if you use conda, it should be handled for you). The CLI is identical.\n* There are 2 reasons for this:\n    1. Users reported problems installing BLAST on MacOS with Apple Silicon (see e.g. [here](https://github.com/gbouras13/pharokka/issues/368)). MMseqs2 works on all platforms and is dilligently maintained.\n    2. MMSeqs2 is much much faster than BLAST (what took BLAST a few minutes takes MMSeqs2 seconds). We probably should have written `dnaapler` with `MMseqs2` to begin with. `MMSeqs2 v13.45111` was chosen to ensure interoperability with [pharokka](https://github.com/gbouras13/pharokka)\n* The alignment resuls may not be identicial to ` dnaapler v0.8.1` (i.e. they might find different top hits), but the actual reorientation is likely to be identical (at least in my tests). Please reach out or make an issue if you notice any discrepancies\n\nFor example - on my machine (Ubuntu 20.04, Intel i9 13th gen 13900 CPU with 32 threads), for a _Staphylococcus aureus_ genome with 1 small plasmid, `dnaapler -i staph.fasta -o staph_dnaapler -t 8`  took ~129 seconds wallclock with `v0.8.1` using `BLAST`, while it took ~3 seconds wallclock with `v1.0.0` using `MMseqs2`.\n\n\n# Google Colab Notebooks\n\nIf you don't want to install `dnaapler` locally, you can run `dnaapler all` without any code using the [Google Colab notebook](https://colab.research.google.com/github/gbouras13/dnaapler/blob/master/run_dnaapler.ipynb).\n\n## Table of Contents\n- [dnaapler](#dnaapler)\n  - [Quick Start](#quick-start)\n  - [Paper](#paper)\n  - [v1](#v1)\n- [Google Colab Notebooks](#google-colab-notebooks)\n  - [Table of Contents](#table-of-contents)\n  - [Description](#description)\n  - [Documentation](#documentation)\n  - [Commands](#commands)\n  - [Installation](#installation)\n    - [Conda](#conda)\n    - [Pip](#pip)\n  - [Usage](#usage)\n  - [Example Usage](#example-usage)\n  - [Databases](#databases)\n  - [Motivation](#motivation)\n  - [Contributing](#contributing)\n  - [Acknowledgements](#acknowledgements)\n\n## Description\n\n<p align=\"center\">\n  <img src=\"paper/Dnaapler_figure.png\" alt=\"Dnaapler Figure\">\n</p>\n\n`dnaapler` is a simple python program that takes a single nucleotide input sequence (in FASTA format), finds the desired start gene using `MMseqs2` against an amino acid sequence database, checks that the start codon of this gene is found, and if so, then reorients the chromosome to begin with this gene on the forward strand. \n\nIt was originally designed to replicate the reorientation functionality of [Unicycler](https://github.com/rrwick/Unicycler/blob/main/unicycler/gene_data/repA.fasta) with dnaA, but for for long-read first assembled chromosomes. We have extended it to work with plasmids (`dnaapler plasmid`) and phages (`dnaapler phage`), or for any input FASTA desired with `dnaapler custom`, `dnaapler mystery` or `dnaapler nearest`.\n\nFor bacterial chromosomes, `dnaapler chromosome` should ensure the chromosome breakpoint never interrupts genes or mobile genetic elements like prophages. It is intended to be used with good-quality completed bacterial genomes, generated with methods such as [Trycycler](https://github.com/rrwick/Trycycler/wiki), [Dragonflye](https://github.com/rpetit3/dragonflye) or my own pipeline [hybracter](https://github.com/gbouras13/hybracter).\n\nAdditionally, you can also reorient multiple bacterial chromosomes/plasmids/phages at once using the `dnaapler bulk` subcommand.\n\nIf your input FASTA is mixed (e.g. has chromosome and plasmids), you can also use `dnaapler all`, with the option to ignore some contigs with the `--ignore` parameter.\n\n## Documentation\n\nThe full documentation for `dnaapler` can be found [here](https://dnaapler.readthedocs.io).\n\n## Commands\n\n* `dnaapler all`: Reorients 1 or more contigs to begin with any of dnaA, terL, repA or COG1474. \n  - Practically, this should be the most useful command for most users.\n\n* `dnaapler chromosome`: Reorients your sequence to begin with the dnaA chromosomal replication initiator gene\n* `dnaapler plasmid`: Reorients your sequence to begin with the repA plasmid replication initiation gene\n* `dnaapler phage`: Reorients your sequence to begin with the terL large terminase subunit gene\n* `dnaapler archaea`: Reorients your sequence to begin with the [COG1474 archaeal Orc1/cdc6 gene](https://www.ncbi.nlm.nih.gov/research/cog/cog/COG1474/).\n* `dnaapler custom`: Reorients your sequence to begin with a custom amino acid FASTA format gene that you specify\n* `dnaapler mystery`: Reorients your sequence to begin with a random CDS\n* `dnaapler largest`: Reorients your sequence to begin with the largest CDS\n* `dnaapler nearest`: Reorients your sequence to begin with the first CDS (nearest to the start). Designed for fixing sequences where a CDS spans the breakpoint.\n* `dnaapler bulk`: Reorients multiple contigs to begin with the desired start gene - either dnaA, terL, repA or a custom gene.\n\n\n## Installation\n\n`dnaapler` requires only `MMseqs2 v13.45111` as an external dependency. \n\nInstallation from conda is highly recommended as this will install `MMseqs2` automatically.\n\n### Conda\n\n`dnaapler` is available on bioconda.\n\n```\nconda install -c bioconda dnaapler\n```\n\n### Pip\n\nYou can also install `dnaapler` with pip.\n\n```\npip install dnaapler\n```\n\n* If you install `dnaapler` with pip, then you will then need to install `MMseqs2 v13.45111` separately. It will need to be available in the `$PATH` or else `dnaapler` will not work. \n\n\n## Usage\n\n```\nUsage: dnaapler [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  -h, --help     Show this message and exit.\n  -V, --version  Show the version and exit.\n\nCommands:\n  all         Reorients contigs to begin with any of dnaA, repA...\n  archaea     Reorients your genome to begin with the archaeal COG1474...\n  bulk        Reorients multiple genomes to begin with the same gene\n  chromosome  Reorients your genome to begin with the dnaA chromosomal...\n  citation    Print the citation(s) for this tool\n  custom      Reorients your genome with a custom database\n  largest     Reorients your genome the begin with the largest CDS as...\n  mystery     Reorients your genome with a random CDS\n  nearest     Reorients your genome the begin with the first CDS as...\n  phage       Reorients your genome to begin with the terL large...\n  plasmid     Reorients your genome to begin with the repA replication...\n  ```\n\n  ```\nUsage: dnaapler all [OPTIONS]\n\n  Reorients contigs to begin with any of dnaA, repA, terL or archaeal COG1474 Orc1/cdc6\n\nOptions:\n  -h, --help               Show this message and exit.\n  -V, --version            Show the version and exit.\n  -i, --input PATH         Path to input file in FASTA format  [required]\n  -o, --output PATH        Output directory   [default: output.dnaapler]\n  -t, --threads INTEGER    Number of threads to use with MMseqs2  [default: 1]\n  -p, --prefix TEXT        Prefix for output files  [default: dnaapler]\n  -f, --force              Force overwrites the output directory\n  -e, --evalue TEXT        e value for MMseqs2  [default: 1e-10]\n  --ignore PATH            Text file listing contigs (one per row) that are to\n                           be ignored\n  -a, --autocomplete TEXT  Choose an option to autocomplete reorientation if\n                           MMseqs2 based approach fails. Must be one of: none,\n                           mystery, largest, or nearest [default: none]\n  --seed_value INTEGER     Random seed to ensure reproducibility.  [default:\n                           13]\n  ```\n\nThe reoriented output FASTA will be `{prefix}_reoriented.fasta` in the specified output directory.\n\n## Example Usage\n\n* For more detailed example usage, please see the [examples](https://dnaapler.readthedocs.io/en/latest/example/) section of the documentation. \n\n```\ndnaapler all -i input.fasta -o output_directory_path -p my_genome_name --ignore list_of_contigs_to_ignore.txt\n```\n\n```\ndnaapler chromosome -i input.fasta -o output_directory_path -p my_bacteria_name -t 8\n```\n\n```\ndnaapler phage -i input.fasta -o output_directory_path -p my_phage_name -t 8\n```\n\n```\ndnaapler plasmid -i input.fasta -o output_directory_path -p my_plasmid_name -t 8\n```\n\n```\ndnaapler archaea -i input.fasta -o output_directory_path -p my_archaea_name -t 8\n```\n\n```\ndnaapler custom -i input.fasta -o output_directory_path -p my_genome_name -t 8 -c my_custom_database_file\n```\n\n```\ndnaapler mystery -i input.fasta -o output_directory_path -p my_genome_name\n```\n\n```\ndnaapler nearest -i input.fasta -o output_directory_path -p my_genome_name\n```\n\n```\ndnaapler largest -i input.fasta -o output_directory_path -p my_genome_name\n```\n\n```\n# to reorient multiple bacterial chromosomes\ndnaapler bulk -i input_file_with_multiple_chromosomes.fasta -m chromosome -o output_directory_path -p my_genome_name \n```\n\n## Databases\n\n`dnaapler chromosome` uses 584 proteins downloaded from Swissprot with the query \"Chromosomal replication initiator protein DnaA\" on 24 May 2023 as its database for dnaA. All hits from the query were also filtered to ensure \"GN=dnaA\" was included in the header of the FASTA entry.\n\n`dnaapler plasmid` uses the repA database curated by Ryan Wick in [Unicycler](https://github.com/rrwick/Unicycler/blob/main/unicycler/gene_data/repA.fasta).\n\n`dnaapler phage` uses a terL database curated using [PHROGs](https://phrogs.lmge.uca.fr). All the AA sequences of the 55 phrogs annotated as 'large terminase subunit' were downloaded, combined and depduplicated using [seqkit](https://github.com/shenwei356/seqkit) `seqkit rmdup -s -o terL.faa phrog_terL.faa`.\n\n`dnaapler archaea` uses a database of 403 archaeal COG1474 Orc1/cdc6 genes curated from [here](https://ftp.ncbi.nlm.nih.gov/pub/wolf/COGs/arCOG/).\n\n`dnaapler all` uses all four databases combined into one. \n\n`dnaapler custom` uses a custom amino acid FASTA format file that you specify using `-c`. \n\nThe matching is strict - it requires a strong MMseqs2 match (default e-value 1E-10), and the first amino acid of a MMseqs2 hit gene to be identified as Methionine, Valine or Leucine, the 3 most used start codons in bacteria/phages. \n\nFor the most commonly studied microbes (ESKAPE pathogens, etc), the dnaA database should suffice.\n\nIf you try `dnaapler` on a more novel or under-studied microbe with a dnaA gene that has little sequence similarity to the database, you may need to provide your own dnaA gene(s) in amino acid FASTA format using `dnaapler custom`.\n\nAfter this [issue](https://github.com/gbouras13/dnaapler/issues/1), `dnaapler mystery` was added. It predicts all ORFs in the input using [pyrodigal](https://github.com/althonos/pyrodigal), then picks a random gene to re-orient your sequence with.\n\n## Motivation\n\n1. I couldn't get [Circlator](https://sanger-pathogens.github.io/circlator/) to work and it is no longer supported.\n2. [berokka](https://github.com/tseemann/berokka) doesn't orient chromosomes to begin with dnaa.\n3. After reading Ryan Wick's masterful bacterial genome assembly [tutorial](https://github.com/rrwick/Perfect-bacterial-genome-tutorial/wiki), I realised that it is probably optimal to run 2 polishing steps, once before then once after rotating the chromosome, to ensure the breakpoint is polished. Further, for some \"complete\" long read bacterial assemblies that didn't circularise properly, I figured that as long as you have a complete assembly (even if not \"circular\" as marked as in Flye), polishing after a re-orientation would be likely to circularise the chromosome. A bit like Ryan's [rotate_circular_gfa.py](https://github.com/rrwick/Perfect-bacterial-genome-tutorial/blob/main/scripts/rotate_circular_gfa.py) script, without the requirement of strict circularity.\n4. While researching MGEs in _S. aureus_ whole genome sequences, I repeatedly found instances where MGEs were interrupted by the chromosome breakpoint. So I thought I'd add a tool to automate it in my pipeline. \n5. It's probably good to have all your sequences start at the same location for synteny analyses.\n\n## Contributing\n\nIf you would like to help improve  `dnaapler` you are very welcome!\n\nFor changes to be accepted, they must pass the CI checks. \n\nPlease see [CONTRIBUTING.md](CONTRIBUTING.md) for more details.\n\n## Acknowledgements\n\nThanks to Torsten Seemann, Ryan Wick and the Circlator team for their existing work in the space. Also to [Michael Hall](https://github.com/mbhall88), whose repository [tbpore](https://github.com/mbhall88/tbpore) we took and adapted a lot of scaffolding code from because he writes really nice code.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Reorients assembled microbial sequences",
    "version": "1.0.1",
    "project_urls": {
        "Homepage": "https://github.com/gbouras13/dnaapler",
        "Repository": "https://github.com/gbouras13/dnaapler"
    },
    "split_keywords": [
        "microbial",
        " bioinformatics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c6071518644c55168d836ba46f06190f01d978781d6532ef09661ba72a12cafc",
                "md5": "468b891d12c80cb7cd338ecbb6ad557c",
                "sha256": "b735bef053dbf968e3e7da434037736ec0ef23348f693c9c69d9f333a1a949b6"
            },
            "downloads": -1,
            "filename": "dnaapler-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "468b891d12c80cb7cd338ecbb6ad557c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 15590018,
            "upload_time": "2024-11-22T04:07:17",
            "upload_time_iso_8601": "2024-11-22T04:07:17.534076Z",
            "url": "https://files.pythonhosted.org/packages/c6/07/1518644c55168d836ba46f06190f01d978781d6532ef09661ba72a12cafc/dnaapler-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "11d66a02f018d864f0cdb1762ab3f990d1996b1c5e5484f03d82e19b1c517c96",
                "md5": "98a8a814ca32db8b1f81c318bbbf1279",
                "sha256": "f43104d096374e6c7aa11e24b95f74a4f18b84da02f1470d4bd7ffa91830c7d2"
            },
            "downloads": -1,
            "filename": "dnaapler-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "98a8a814ca32db8b1f81c318bbbf1279",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 15535571,
            "upload_time": "2024-11-22T04:07:20",
            "upload_time_iso_8601": "2024-11-22T04:07:20.590810Z",
            "url": "https://files.pythonhosted.org/packages/11/d6/6a02f018d864f0cdb1762ab3f990d1996b1c5e5484f03d82e19b1c517c96/dnaapler-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-22 04:07:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gbouras13",
    "github_project": "dnaapler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "dnaapler"
}

George Bouras