pypolca


Namepypolca JSON
Version 0.3.1 PyPI version JSON
download
home_pagehttps://github.com/gbouras13/pypolca
SummaryStandalone Python implementation of the POLCA polisher from MaSuRCA
upload_time2024-02-02 06:22:26
maintainer
docs_urlNone
authorGeorge Bouras
requires_python>=3.7,<4.0
licenseMIT
keywords microbial bioinformatics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![CI](https://github.com/gbouras13/pypolca/actions/workflows/ci.yaml/badge.svg)](https://github.com/gbouras13/pypolca/actions/workflows/ci.yaml)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![DOI](https://zenodo.org/badge/700722839.svg)](https://zenodo.org/badge/latestdoi/700722839)

[![Anaconda-Server Badge](https://anaconda.org/bioconda/pypolca/badges/version.svg)](https://anaconda.org/bioconda/pypolca)
[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/pypolca)](https://img.shields.io/conda/dn/bioconda/pypolca)
[![PyPI version](https://badge.fury.io/py/pypolca.svg)](https://badge.fury.io/py/pypolca)
[![Downloads](https://static.pepy.tech/badge/pypolca)](https://pepy.tech/project/pypolca)


# pypolca

`pypolca` is a Standalone Python re-implementation of the POLCA polisher from the [MaSuRCA genome assembly and analysis toolkit](https://github.com/alekseyzimin/masurca).

## Quick Start

```
# creates conda environment with pypolca 
conda create -n pypolca_env pypolca

# activates conda environment
conda activate pypolca_env

# runs pypolca
pypolca run -a <genome> -1 <R1 short reads file> -2 <R2 short reads file> -t <threads> -o <output directory> 
```

## Table of Contents
- [pypolca](#pypolca)
  - [Quick Start](#quick-start)
  - [Table of Contents](#table-of-contents)
  - [Description](#description)
    - [Note of Caution for Large (e.g. Eukaryotic) Genomes](#note-of-caution-for-large-eg-eukaryotic-genomes)
  - [Installation](#installation)
    - [Conda](#conda)
    - [Pip](#pip)
    - [Source](#source)
  - [Usage](#usage)
- [Benchmarking](#benchmarking)
- [Citation](#citation)

## Description

`pypolca` is a python reimplenetation of the POLCA polisher from the [MaSuRCA genome assembly and analysis toolkit](https://github.com/alekseyzimin/masurca) that was made for inclusion into the hybrid bacterial genome assembly tool [hybracter](https://github.com/gbouras13/hybracter).

It was written for a number of reasons:

* MaSuRCA is only available on Linux, not for MacOS.
* The original `polca.sh` script from MaSuRCA was difficult to use because you could not specify an output directory. Additionally, due to its shell implementation, both FASTQ read files needed to be input together as a string
* To use `polca.sh`, you need to install the entire MaSuRCA assembly toolkit.
* POLCA is recommended for long-read first bacterial assembly polishing (see [this paper](https://doi.org/10.1371/journal.pcbi.1010905)) and I wanted to include it for MacOS and Linux in my assembly tool [hybracter](https://github.com/gbouras13/hybracter).

Note: I neither guarantee nor desire that `pypolca` will give identical results to POLCA implemented in MaSuRCA. This is because of the different versions of [freebayes](https://github.com/freebayes/freebayes) and Samtools that might be used as a dependency. 

In testing, `pypolca` v0.2.0 (running Freebayes v1.3.6 and Samtools v1.18) was extremely similar, but not identical to POLCA (running Freebayes v1.3.1-dirty and Samtools v0.1.20). Please see [benchmarking](benchmarking.md) for more details.

I have decided to use the newest versions of freebayes and Samtools possible rather than the version installed with MaSuRCA, for ease of maintenance and particularly because the version of Samtools used is a major version behind and the CLI has changed. 

### Note of Caution for Large (e.g. Eukaryotic) Genomes

* I have implemeted `pypolca` predominantly for the use-case of polishing long-read bacterial genome assemblies with short reads. Therefore, I decided not to implement the batched multiprocessing of freebayes included in POLCA, because it was a lot of work for no benefit for most bacterial genomes. 
* However, this is certainly not true for larger genomes such as eukaryotic organisms. `pypolca` should be a lot slower than POLCA for such organisms if you run both with more than 1 thread. 
* I do not intend to implement multiprocessing but if someone wants to feel free to make a PR.

## Installation

Installation from conda is recommended as this will install all non-python dependencies.

### Conda

`pypolca` will soon be available on bioconda.

```
conda install -c bioconda pypolca
```

### Pip

You can also install the Python components of `pypolca` with pip.

```
pip install pypolca
```

### Source

Alternatively, the development version of `pypolca` can be installed manually via github.

```
git clone https://github.com/gbouras13/pypolca.git
cd pypolca
pip install -e .
pypolca -h
```

If you have install `pypolca` with pip or from source, you will then need to install the external dependencies separately, which can be found in `build/environment.yaml`

* [bwa](https://github.com/lh3/bwa) >=0.7.17
* [Samtools](https://github.com/samtools/samtools) >=1.18
* [freebayes](https://github.com/freebayes/freebayes) >=1.3.1,<1.3.7

## Usage

```
Usage: pypolca [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help     Show this message and exit.
  -V, --version  Show the version and exit.

Commands:
  citation  Print the citations for polca
  run       Python implementation of the POLCA polisher from MaSuRCA
```

```
Usage: pypolca run [OPTIONS]

  Python implementation of the POLCA polisher from MaSuRCA

Options:
  -h, --help               Show this message and exit.
  -V, --version            Show the version and exit.
  -a, --assembly PATH      Path to assembly contigs or scaffolds.  [required]
  -1, --reads1 PATH        Path to polishing reads R1 FASTQ. Can be FASTQ or
                           FASTQ gzipped. Required.  [required]
  -2, --reads2 PATH        Path to polishing reads R2 FASTQ. Can be FASTQ or
                           FASTQ gzipped. Optional. Only use -1 if you have
                           single end reads.
  -t, --threads INTEGER    Number of threads.  [default: 1]
  -o, --output PATH        Output directory path  [default: output_polca]
  -f, --force              Force overwrites the output directory
  --min_alt INTEGER        Minimum alt allele count to make a change
                           [default: 2]
  --min_ratio FLOAT        Minimum alt allele to ref allele ratio to make a
                           change  [default: 2.0]
  --careful                Equivalent to --min_alt 4 --min_ratio 3
  -n, --no_polish          do not polish, just create vcf file, evaluate the
                           assembly and exit
  -m, --memory_limit TEXT  Memory per thread to use in samtools sort, set to
                           2G or more for large genomes  [default: 2G]
  -p, --prefix TEXT        prefix for output files  [default: polca]
```

The polished output FASTA will be `{prefix}_corrected.fasta` in the specified output directory and the POLCA report will be the textfile `{prefix}.report`

# Benchmarking

Please see [benchmarking](benchmarking.md) for more details. As can be seen, `pypolca` v0.2.0 was extremely similar, but not identical to POLCA.


# Citation

Please cite `pypolca` in your paper using:

Bouras G, Wick RR (2023) pypolca: Standalone Python reimplementation of the genome polishing tool POLCA. https://github.com/gbouras13/pypolca. 

Zimin AV, Salzberg SL (2020) The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput Biol 16(6): e1007981. https://doi.org/10.1371/journal.pcbi.1007981.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gbouras13/pypolca",
    "name": "pypolca",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7,<4.0",
    "maintainer_email": "",
    "keywords": "microbial,bioinformatics",
    "author": "George Bouras",
    "author_email": "george.bouras@adelaide.edu.au",
    "download_url": "https://files.pythonhosted.org/packages/04/00/960354994f2e51de06ad2a559b13aebb6f8f9c0d94ef5ef63bead92d5691/pypolca-0.3.1.tar.gz",
    "platform": null,
    "description": "[![CI](https://github.com/gbouras13/pypolca/actions/workflows/ci.yaml/badge.svg)](https://github.com/gbouras13/pypolca/actions/workflows/ci.yaml)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![DOI](https://zenodo.org/badge/700722839.svg)](https://zenodo.org/badge/latestdoi/700722839)\n\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/pypolca/badges/version.svg)](https://anaconda.org/bioconda/pypolca)\n[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/pypolca)](https://img.shields.io/conda/dn/bioconda/pypolca)\n[![PyPI version](https://badge.fury.io/py/pypolca.svg)](https://badge.fury.io/py/pypolca)\n[![Downloads](https://static.pepy.tech/badge/pypolca)](https://pepy.tech/project/pypolca)\n\n\n# pypolca\n\n`pypolca` is a Standalone Python re-implementation of the POLCA polisher from the [MaSuRCA genome assembly and analysis toolkit](https://github.com/alekseyzimin/masurca).\n\n## Quick Start\n\n```\n# creates conda environment with pypolca \nconda create -n pypolca_env pypolca\n\n# activates conda environment\nconda activate pypolca_env\n\n# runs pypolca\npypolca run -a <genome> -1 <R1 short reads file> -2 <R2 short reads file> -t <threads> -o <output directory> \n```\n\n## Table of Contents\n- [pypolca](#pypolca)\n  - [Quick Start](#quick-start)\n  - [Table of Contents](#table-of-contents)\n  - [Description](#description)\n    - [Note of Caution for Large (e.g. Eukaryotic) Genomes](#note-of-caution-for-large-eg-eukaryotic-genomes)\n  - [Installation](#installation)\n    - [Conda](#conda)\n    - [Pip](#pip)\n    - [Source](#source)\n  - [Usage](#usage)\n- [Benchmarking](#benchmarking)\n- [Citation](#citation)\n\n## Description\n\n`pypolca` is a python reimplenetation of the POLCA polisher from the [MaSuRCA genome assembly and analysis toolkit](https://github.com/alekseyzimin/masurca) that was made for inclusion into the hybrid bacterial genome assembly tool [hybracter](https://github.com/gbouras13/hybracter).\n\nIt was written for a number of reasons:\n\n* MaSuRCA is only available on Linux, not for MacOS.\n* The original `polca.sh` script from MaSuRCA was difficult to use because you could not specify an output directory. Additionally, due to its shell implementation, both FASTQ read files needed to be input together as a string\n* To use `polca.sh`, you need to install the entire MaSuRCA assembly toolkit.\n* POLCA is recommended for long-read first bacterial assembly polishing (see [this paper](https://doi.org/10.1371/journal.pcbi.1010905)) and I wanted to include it for MacOS and Linux in my assembly tool [hybracter](https://github.com/gbouras13/hybracter).\n\nNote: I neither guarantee nor desire that `pypolca` will give identical results to POLCA implemented in MaSuRCA. This is because of the different versions of [freebayes](https://github.com/freebayes/freebayes) and Samtools that might be used as a dependency. \n\nIn testing, `pypolca` v0.2.0 (running Freebayes v1.3.6 and Samtools v1.18) was extremely similar, but not identical to POLCA (running Freebayes v1.3.1-dirty and Samtools v0.1.20). Please see [benchmarking](benchmarking.md) for more details.\n\nI have decided to use the newest versions of freebayes and Samtools possible rather than the version installed with MaSuRCA, for ease of maintenance and particularly because the version of Samtools used is a major version behind and the CLI has changed. \n\n### Note of Caution for Large (e.g. Eukaryotic) Genomes\n\n* I have implemeted `pypolca` predominantly for the use-case of polishing long-read bacterial genome assemblies with short reads. Therefore, I decided not to implement the batched multiprocessing of freebayes included in POLCA, because it was a lot of work for no benefit for most bacterial genomes. \n* However, this is certainly not true for larger genomes such as eukaryotic organisms. `pypolca` should be a lot slower than POLCA for such organisms if you run both with more than 1 thread. \n* I do not intend to implement multiprocessing but if someone wants to feel free to make a PR.\n\n## Installation\n\nInstallation from conda is recommended as this will install all non-python dependencies.\n\n### Conda\n\n`pypolca` will soon be available on bioconda.\n\n```\nconda install -c bioconda pypolca\n```\n\n### Pip\n\nYou can also install the Python components of `pypolca` with pip.\n\n```\npip install pypolca\n```\n\n### Source\n\nAlternatively, the development version of `pypolca` can be installed manually via github.\n\n```\ngit clone https://github.com/gbouras13/pypolca.git\ncd pypolca\npip install -e .\npypolca -h\n```\n\nIf you have install `pypolca` with pip or from source, you will then need to install the external dependencies separately, which can be found in `build/environment.yaml`\n\n* [bwa](https://github.com/lh3/bwa) >=0.7.17\n* [Samtools](https://github.com/samtools/samtools) >=1.18\n* [freebayes](https://github.com/freebayes/freebayes) >=1.3.1,<1.3.7\n\n## Usage\n\n```\nUsage: pypolca [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  -h, --help     Show this message and exit.\n  -V, --version  Show the version and exit.\n\nCommands:\n  citation  Print the citations for polca\n  run       Python implementation of the POLCA polisher from MaSuRCA\n```\n\n```\nUsage: pypolca run [OPTIONS]\n\n  Python implementation of the POLCA polisher from MaSuRCA\n\nOptions:\n  -h, --help               Show this message and exit.\n  -V, --version            Show the version and exit.\n  -a, --assembly PATH      Path to assembly contigs or scaffolds.  [required]\n  -1, --reads1 PATH        Path to polishing reads R1 FASTQ. Can be FASTQ or\n                           FASTQ gzipped. Required.  [required]\n  -2, --reads2 PATH        Path to polishing reads R2 FASTQ. Can be FASTQ or\n                           FASTQ gzipped. Optional. Only use -1 if you have\n                           single end reads.\n  -t, --threads INTEGER    Number of threads.  [default: 1]\n  -o, --output PATH        Output directory path  [default: output_polca]\n  -f, --force              Force overwrites the output directory\n  --min_alt INTEGER        Minimum alt allele count to make a change\n                           [default: 2]\n  --min_ratio FLOAT        Minimum alt allele to ref allele ratio to make a\n                           change  [default: 2.0]\n  --careful                Equivalent to --min_alt 4 --min_ratio 3\n  -n, --no_polish          do not polish, just create vcf file, evaluate the\n                           assembly and exit\n  -m, --memory_limit TEXT  Memory per thread to use in samtools sort, set to\n                           2G or more for large genomes  [default: 2G]\n  -p, --prefix TEXT        prefix for output files  [default: polca]\n```\n\nThe polished output FASTA will be `{prefix}_corrected.fasta` in the specified output directory and the POLCA report will be the textfile `{prefix}.report`\n\n# Benchmarking\n\nPlease see [benchmarking](benchmarking.md) for more details. As can be seen, `pypolca` v0.2.0 was extremely similar, but not identical to POLCA.\n\n\n# Citation\n\nPlease cite `pypolca` in your paper using:\n\nBouras G, Wick RR (2023) pypolca: Standalone Python reimplementation of the genome polishing tool POLCA. https://github.com/gbouras13/pypolca. \n\nZimin AV, Salzberg SL (2020) The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput Biol 16(6): e1007981. https://doi.org/10.1371/journal.pcbi.1007981.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Standalone Python implementation of the POLCA polisher from MaSuRCA",
    "version": "0.3.1",
    "project_urls": {
        "Homepage": "https://github.com/gbouras13/pypolca",
        "Repository": "https://github.com/gbouras13/pypolca"
    },
    "split_keywords": [
        "microbial",
        "bioinformatics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0400960354994f2e51de06ad2a559b13aebb6f8f9c0d94ef5ef63bead92d5691",
                "md5": "aafdadfb4d7d182a6e961e1e2e48cba5",
                "sha256": "bb48e81a8a3a11979a3edc8091dbfac4b14ec3a57f0952075ae44c8638bac620"
            },
            "downloads": -1,
            "filename": "pypolca-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "aafdadfb4d7d182a6e961e1e2e48cba5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7,<4.0",
            "size": 18968,
            "upload_time": "2024-02-02T06:22:26",
            "upload_time_iso_8601": "2024-02-02T06:22:26.338328Z",
            "url": "https://files.pythonhosted.org/packages/04/00/960354994f2e51de06ad2a559b13aebb6f8f9c0d94ef5ef63bead92d5691/pypolca-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-02 06:22:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gbouras13",
    "github_project": "pypolca",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pypolca"
}
        
Elapsed time: 0.41753s