SuperPang


NameSuperPang JSON
Version 1.3.0 PyPI version JSON
download
home_page
SummaryNon-redundant pangenome assemblies from multiple genomes or bins
upload_time2024-03-18 11:34:41
maintainer
docs_urlNone
author
requires_python>=3.8
licenseBSD 3-Clause License Copyright (c) 2021, Fernando Puente-Sánchez All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
keywords bioinformatics assembly metagenomics microbial-genomics genomics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SuperPang: non-redundant pangenome assemblies from multiple genomes or bins

**Check our paper:** Puente-Sánchez F, Hoetzinger M, Buck M and Bertilsson S. [Exploring intra-species diversity through non-redundant pangenome assemblies](https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13826) _Molecular Ecology Resources_ (2023) DOI: 10.1111/1755-0998.13826

_... but note that performance is now better (3x less memory usage, 20% faster execution time) than when we first benchmarked Superpang._

## Installation
Requires [graph-tool](https://graph-tool.skewed.de/), [speedict](https://github.com/Congyuwang/RocksDict), [mOTUlizer v0.2.4](https://github.com/moritzbuck/mOTUlizer), [minimap2](https://github.com/lh3/minimap2) and [mappy](https://pypi.org/project/mappy/). The easiest way to get it running is using conda.
```
# Install into a new conda environment
conda create -n SuperPang -c conda-forge -c bioconda -c fpusan superpang
# Check that it works for you!
conda activate SuperPang
test-SuperPang.py
```

## Usage
`SuperPang.py --fasta <genome1.fasta> <genome2.fasta> <genomeN.fasta> --checkm <check_results> --output-dir <output_directory>`


## Input files and choice of parameters
- The input genomes can be genomes from isolates, MAGs (Metagenome-Assembled Genomes) or SAGs (Single-cell Assembled Genomes).
- The input genomes can have different qualities, for normal usage we recommend that you provide completeness estimates for each input genome through the `-q/--checkm` parameter.
- If you are certain that all your input genomes are complete, you can use the `--assume-complete` flag or manually tweak the `-a/--genome-assignment-threshold` and `-x/--default-completeness` parameters instead of providing a file with completeness estimates.
- The default parameter values in SuperPang assume that all of the input genomes come from the same species (ANI>=0.95). This can be controlled by changing the values of the `-i/--identity_threshold` and `-b/--bubble-identity-threshold` to the expected ANI. However SuperPang has currently only been tested in species-level clusters.


## Arguments
* *-f/--fasta*: Input fasta files with the sequences for each bin/genome, or a single file containing the path to one input fasta file per line.
* *-q/--checkm*: CheckM output for the bins. This can be the STDOUT of running checkm on all the fasta files passed in *--fasta*, or a tab-delimited file ended with a `.tsv` extension, in the form `genome1 percent_completeness`. Genome names should not contain the file extension (e.g. `.fna`). If empty, completeness will be estimated by [mOTUpan](https://www.biorxiv.org/content/10.1101/2021.06.25.449606v1) but this may lead to wrong estimations for very incomplete genomes.
* *-i/--identity_threshold*: Identity threshold (fraction) to initiate correction with minimap2. Values of 1 or higher will skip the correction step entirely. Default `0.95`.
* *-m/--mismatch-size-threshold*: Maximum contiguous mismatch size that will be corrected. Default `100`.
* *-g/--indel-size-threshold*: Maximum contiguous indel size that will be corrected. Default `100`.
* *-r/--correction-repeats*: Maximum iterations for sequence correction. Default `20`.
* *-n/--correction-repeats-min*: Minimum iterations for sequence correction. Default `5`.
* *-k/--ksize*: Kmer-size. Default `301`.
* *-l/--minlen*: Scaffold length cutoff. Default `0` (no cutoff).
* *-c/--mincov*: Scaffold coverage cutoff. Default `0` (no cutoff).
* *-b/--bubble-identity-threshold*: Minimum identity (matches / length) required to remove a bubble in the sequence graph. Default `0.95`.
* *-a/--genome-assignment-threshold*. Fraction of shared kmers required to assign a contig to an input genome (0 means a single shared kmer is enough) (DEPRECATED).
* *-x/--default-completeness*: Default genome completeness to assume if a CheckM output is not provided with *--checkm*. Default `70`.
* *-t/--threads*: Number of processors to use. Default `1`.
* *-o/--output*: Output directory. Default `output`.
* *-d/--temp-dir*: Directory for temp files. Default `tmp`.
* *-u/--header-prefix*: Prefix to be added to output sequence names. No prefix is added by default.
* *--assume-complete*: Assume that the input genomes are complete (*--default-completeness 99*).
* *--lowmem*: Use disk storages instead of memory when possible, reduces memory usage at the cost of execution time.
* *--minimap2-path*: Path to the minimap2 executable. Default `minimap2`.
* *--keep-intermediate*: Keep intermediate files.
* *--keep-temporary*: Keep temporary files.
* *--verbose-mOTUpan*: Print out mOTUpan logs.
* *--nice-headers*: Removes semicolons from non-branching-path names.
* *--output-as-file-prefix*: Use the output dir name also as a prefix for output file names.
* *--force-overwrite*: Write results even if the output directory already exists.
* *--debug*: Run additional sanity checks (increases execution time).

## Output
* `assembly.fasta`: contigs.
* `assembly.info`: core/auxiliary and path information for each contig.
* `NBPs.fasta`: non-branching paths.
* `NBPs.core.fasta`: non-branching paths deemed to belong to the core genome of the species by [mOTUpan](https://www.biorxiv.org/content/10.1101/2021.06.25.449606v1).
* `NBPs.accessory.fasta`: non-branching paths deemed to belong to the accessory genome of the species.
* `NBP2origins.tsv`: tab-separated file with the non-branching path IDs, a comma-separated list of the input sequences in which that NBP was deemed present, a comma-separated list of the input genomes in which that NBP was deemed present, and the number of input genomes in which that NBP was deemed present.
* `graph.fastg`: assembly graph in a format compatible with [bandage](https://rrwick.github.io/Bandage/).
* `graph.NBP2origins.csv`: file with similar structure as NBP2origins.tsv, formatted for use together with the "Load CSV file" option in Bandage. This allows using the information in the file as node labels in Bandage.
* `params.tsv`: parameters used in the run.

## About
*SuperPang* is developed by Fernando Puente-Sánchez (Sveriges lantbruksuniversitet). Feel free to open an issue or reach out for support [fernando.puente.sanchez@slu.se](mailto:fernando.puente.sanchez@slu.se).

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "SuperPang",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "bioinformatics,assembly,metagenomics,microbial-genomics,genomics",
    "author": "",
    "author_email": "Fernando Puente-S\u00e1nchez <fernando.puente.sanchez@slu.se>",
    "download_url": "https://files.pythonhosted.org/packages/4d/c9/307ff96cc6740664c6efad693ff780f5d8f14f8a5cfcf19039f54b5e7c43/SuperPang-1.3.0.tar.gz",
    "platform": null,
    "description": "# SuperPang: non-redundant pangenome assemblies from multiple genomes or bins\n\n**Check our paper:** Puente-S\u00e1nchez F, Hoetzinger M, Buck M and Bertilsson S. [Exploring intra-species diversity through non-redundant pangenome assemblies](https://onlinelibrary.wiley.com/doi/full/10.1111/1755-0998.13826) _Molecular Ecology Resources_ (2023) DOI: 10.1111/1755-0998.13826\n\n_... but note that performance is now better (3x less memory usage, 20% faster execution time) than when we first benchmarked Superpang._\n\n## Installation\nRequires [graph-tool](https://graph-tool.skewed.de/), [speedict](https://github.com/Congyuwang/RocksDict), [mOTUlizer v0.2.4](https://github.com/moritzbuck/mOTUlizer), [minimap2](https://github.com/lh3/minimap2) and [mappy](https://pypi.org/project/mappy/). The easiest way to get it running is using conda.\n```\n# Install into a new conda environment\nconda create -n SuperPang -c conda-forge -c bioconda -c fpusan superpang\n# Check that it works for you!\nconda activate SuperPang\ntest-SuperPang.py\n```\n\n## Usage\n`SuperPang.py --fasta <genome1.fasta> <genome2.fasta> <genomeN.fasta> --checkm <check_results> --output-dir <output_directory>`\n\n\n## Input files and choice of parameters\n- The input genomes can be genomes from isolates, MAGs (Metagenome-Assembled Genomes) or SAGs (Single-cell Assembled Genomes).\n- The input genomes can have different qualities, for normal usage we recommend that you provide completeness estimates for each input genome through the `-q/--checkm` parameter.\n- If you are certain that all your input genomes are complete, you can use the `--assume-complete` flag or manually tweak the `-a/--genome-assignment-threshold` and `-x/--default-completeness` parameters instead of providing a file with completeness estimates.\n- The default parameter values in SuperPang assume that all of the input genomes come from the same species (ANI>=0.95). This can be controlled by changing the values of the `-i/--identity_threshold` and `-b/--bubble-identity-threshold` to the expected ANI. However SuperPang has currently only been tested in species-level clusters.\n\n\n## Arguments\n* *-f/--fasta*: Input fasta files with the sequences for each bin/genome, or a single file containing the path to one input fasta file per line.\n* *-q/--checkm*: CheckM output for the bins. This can be the STDOUT of running checkm on all the fasta files passed in *--fasta*, or a tab-delimited file ended with a `.tsv` extension, in the form `genome1 percent_completeness`. Genome names should not contain the file extension (e.g. `.fna`). If empty, completeness will be estimated by [mOTUpan](https://www.biorxiv.org/content/10.1101/2021.06.25.449606v1) but this may lead to wrong estimations for very incomplete genomes.\n* *-i/--identity_threshold*: Identity threshold (fraction) to initiate correction with minimap2. Values of 1 or higher will skip the correction step entirely. Default `0.95`.\n* *-m/--mismatch-size-threshold*: Maximum contiguous mismatch size that will be corrected. Default `100`.\n* *-g/--indel-size-threshold*: Maximum contiguous indel size that will be corrected. Default `100`.\n* *-r/--correction-repeats*: Maximum iterations for sequence correction. Default `20`.\n* *-n/--correction-repeats-min*: Minimum iterations for sequence correction. Default `5`.\n* *-k/--ksize*: Kmer-size. Default `301`.\n* *-l/--minlen*: Scaffold length cutoff. Default `0` (no cutoff).\n* *-c/--mincov*: Scaffold coverage cutoff. Default `0` (no cutoff).\n* *-b/--bubble-identity-threshold*: Minimum identity (matches / length) required to remove a bubble in the sequence graph. Default `0.95`.\n* *-a/--genome-assignment-threshold*. Fraction of shared kmers required to assign a contig to an input genome (0 means a single shared kmer is enough) (DEPRECATED).\n* *-x/--default-completeness*: Default genome completeness to assume if a CheckM output is not provided with *--checkm*. Default `70`.\n* *-t/--threads*: Number of processors to use. Default `1`.\n* *-o/--output*: Output directory. Default `output`.\n* *-d/--temp-dir*: Directory for temp files. Default `tmp`.\n* *-u/--header-prefix*: Prefix to be added to output sequence names. No prefix is added by default.\n* *--assume-complete*: Assume that the input genomes are complete (*--default-completeness 99*).\n* *--lowmem*: Use disk storages instead of memory when possible, reduces memory usage at the cost of execution time.\n* *--minimap2-path*: Path to the minimap2 executable. Default `minimap2`.\n* *--keep-intermediate*: Keep intermediate files.\n* *--keep-temporary*: Keep temporary files.\n* *--verbose-mOTUpan*: Print out mOTUpan logs.\n* *--nice-headers*: Removes semicolons from non-branching-path names.\n* *--output-as-file-prefix*: Use the output dir name also as a prefix for output file names.\n* *--force-overwrite*: Write results even if the output directory already exists.\n* *--debug*: Run additional sanity checks (increases execution time).\n\n## Output\n* `assembly.fasta`: contigs.\n* `assembly.info`: core/auxiliary and path information for each contig.\n* `NBPs.fasta`: non-branching paths.\n* `NBPs.core.fasta`: non-branching paths deemed to belong to the core genome of the species by [mOTUpan](https://www.biorxiv.org/content/10.1101/2021.06.25.449606v1).\n* `NBPs.accessory.fasta`: non-branching paths deemed to belong to the accessory genome of the species.\n* `NBP2origins.tsv`: tab-separated file with the non-branching path IDs, a comma-separated list of the input sequences in which that NBP was deemed present, a comma-separated list of the input genomes in which that NBP was deemed present, and the number of input genomes in which that NBP was deemed present.\n* `graph.fastg`: assembly graph in a format compatible with [bandage](https://rrwick.github.io/Bandage/).\n* `graph.NBP2origins.csv`: file with similar structure as NBP2origins.tsv, formatted for use together with the \"Load CSV file\" option in Bandage. This allows using the information in the file as node labels in Bandage.\n* `params.tsv`: parameters used in the run.\n\n## About\n*SuperPang* is developed by Fernando Puente-S\u00e1nchez (Sveriges lantbruksuniversitet). Feel free to open an issue or reach out for support [fernando.puente.sanchez@slu.se](mailto:fernando.puente.sanchez@slu.se).\n",
    "bugtrack_url": null,
    "license": "BSD 3-Clause License  Copyright (c) 2021, Fernando Puente-S\u00e1nchez All rights reserved.  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ",
    "summary": "Non-redundant pangenome assemblies from multiple genomes or bins",
    "version": "1.3.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/fpusan/SuperPang/issues",
        "Homepage": "https://github.com/fpusan/SuperPang"
    },
    "split_keywords": [
        "bioinformatics",
        "assembly",
        "metagenomics",
        "microbial-genomics",
        "genomics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4dc9307ff96cc6740664c6efad693ff780f5d8f14f8a5cfcf19039f54b5e7c43",
                "md5": "3073c62dd36184d3ba7cf949fa1bd96e",
                "sha256": "9ae3b85b53af33329548530eb6847ad37af6b2446ab1886eade4d233ce48ff57"
            },
            "downloads": -1,
            "filename": "SuperPang-1.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3073c62dd36184d3ba7cf949fa1bd96e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 6159542,
            "upload_time": "2024-03-18T11:34:41",
            "upload_time_iso_8601": "2024-03-18T11:34:41.343945Z",
            "url": "https://files.pythonhosted.org/packages/4d/c9/307ff96cc6740664c6efad693ff780f5d8f14f8a5cfcf19039f54b5e7c43/SuperPang-1.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-18 11:34:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fpusan",
    "github_project": "SuperPang",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "superpang"
}
        
Elapsed time: 0.20519s