gecco-tool


Namegecco-tool JSON
Version 0.9.10 PyPI version JSON
download
home_pagehttps://gecco.embl.de
SummaryGene cluster prediction with Conditional random fields.
upload_time2024-02-27 16:10:51
maintainer
docs_urlNone
authorMartin Larralde
requires_python>=3.7
licenseGPLv3
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <img align="right" width="180" height="180" src="https://raw.githubusercontent.com/zellerlab/GECCO/v0.6.2/static/gecco-square.png">

# Hi, I'm GECCO!

## 🦎 ️Overview

GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and
scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs)
in genomic and metagenomic data using Conditional Random Fields (CRFs).

[![Actions](https://img.shields.io/github/actions/workflow/status/zellerlab/GECCO/test.yml?branch=master&style=flat-square&maxAge=300)](https://github.com/zellerlab/GECCO/actions/workflows/test.yml)
[![License](https://img.shields.io/badge/license-GPLv3-blue.svg?style=flat-square&maxAge=2678400)](https://choosealicense.com/licenses/gpl-3.0/)
[![Coverage](https://img.shields.io/codecov/c/gh/zellerlab/GECCO?style=flat-square&maxAge=600)]( https://codecov.io/gh/zellerlab/GECCO/)
[![Docs](https://img.shields.io/badge/docs-gecco.embl.de-green.svg?maxAge=2678400&style=flat-square)](https://gecco.embl.de)
[![Source](https://img.shields.io/badge/source-GitHub-303030.svg?maxAge=2678400&style=flat-square)](https://github.com/zellerlab/GECCO/)
[![Mirror](https://img.shields.io/badge/mirror-EMBL-009f4d?style=flat-square&maxAge=2678400)](https://git.embl.de/grp-zeller/GECCO/)
[![Changelog](https://img.shields.io/badge/keep%20a-changelog-8A0707.svg?maxAge=2678400&style=flat-square)](https://github.com/zellerlab/GECCO/blob/master/CHANGELOG.md)
[![Issues](https://img.shields.io/github/issues/zellerlab/GECCO.svg?style=flat-square&maxAge=600)](https://github.com/zellerlab/GECCO/issues)
[![Preprint](https://img.shields.io/badge/preprint-bioRxiv-darkblue?style=flat-square&maxAge=2678400)](https://www.biorxiv.org/content/10.1101/2021.05.03.442509v1)
[![PyPI](https://img.shields.io/pypi/v/gecco-tool.svg?style=flat-square&maxAge=3600)](https://pypi.python.org/pypi/gecco-tool)
[![Bioconda](https://img.shields.io/conda/vn/bioconda/gecco?style=flat-square&maxAge=3600)](https://anaconda.org/bioconda/gecco)
[![Galaxy](https://img.shields.io/badge/Galaxy-GECCO-darkblue?style=flat-square&maxAge=3600)](https://toolshed.g2.bx.psu.edu/repository?repository_id=c29bc911b3fc5f8c)
[![Versions](https://img.shields.io/pypi/pyversions/gecco-tool.svg?style=flat-square&maxAge=3600)](https://pypi.org/project/gecco-tool/#files)
[![Wheel](https://img.shields.io/pypi/wheel/gecco-tool?style=flat-square&maxAge=3600)](https://pypi.org/project/gecco-tool/#files)


## 🔧 Installing GECCO

GECCO is implemented in [Python](https://www.python.org/), and supports [all
versions](https://endoflife.date/python) from Python 3.7. It requires
additional libraries that can be installed directly from
[PyPI](https://pypi.org), the Python Package Index.

Use [`pip`](https://pip.pypa.io/en/stable/) to install GECCO on your
machine:
```console
$ pip install gecco-tool
```

If you'd rather use [Conda](https://conda.io), a package is available
in the [`bioconda`](https://bioconda.github.io/) channel. You can install
with:
```console
$ conda install -c bioconda gecco
```

This will install GECCO, its dependencies, and the data needed to run
predictions. This requires around 40MB of data to be downloaded, so
it could take some time depending on your Internet connection. Once done,
you will have a ``gecco`` command available in your $PATH.

*Note that GECCO uses [HMMER3](http://hmmer.org/), which can only run
on PowerPC and recent x86-64 machines running a POSIX operating system.
Therefore, GECCO will work on Linux and OSX, but not on Windows.*


## 🧬 Running GECCO

Once `gecco` is installed, you can run it from the terminal by giving it a
FASTA or GenBank file with the genomic sequence you want to analyze, as
well as an output directory:

```console
$ gecco run --genome some_genome.fna -o some_output_dir
```

Additional parameters of interest are:

- `--jobs`, which controls the number of threads that will be spawned by
  GECCO whenever a step can be parallelized. The default, *0*, will
  autodetect the number of CPUs on the machine using
  [`os.cpu_count`](https://docs.python.org/3/library/os.html#os.cpu_count).
- `--cds`, controlling the minimum number of consecutive genes a BGC region
  must have to be detected by GECCO. The default is *3*.
- `--threshold`, controlling the minimum probability for a gene to be
  considered part of a BGC region. Using a lower number will increase the
  number (and possibly length) of predictions, but reduce accuracy. The
  default of *0.8* was selected to optimize precision/recall on a test set
  of 364 BGCs from [MIBiG 2.0](https://mibig.secondarymetabolites.org/).
- `--cds-feature`, which can be supplied a feature name to extract genes
  if the input file already contains gene annotations instead of predicting
  genes with [Pyrodigal](https://pyrodigal.readthedocs.io). A common value
  for records downloaded from GenBank is `--cds-feature CDS`.

## 🔎 Results

GECCO will create the following files:

- `{genome}.genes.tsv`: The *genes* file, containing the genes extracted
  or predicted from the input file, and per-gene BGC probabilities
  predicted by the CRF.
- `{genome}.features.tsv`: The *features* file, containing the identified
  domains in the input sequences, in tabular format.
- `{genome}.clusters.tsv`: If any were found, a *clusters* file, containing
  the coordinates of the predicted clusters along their putative biosynthetic
  type, in tabular format.
- `{genome}_cluster_{N}.gbk`: If any were found, a GenBank file per cluster,
  containing the cluster sequence annotated with its member proteins and domains.

GECCO can also convert results to other formats that may be more convenient
depending on the downstream usage. GECCO can convert results into:

- GFF3 format so they can be loaded into a genomic viewer 
  (`gecco convert clusters --format gff`).
- GenBank files with antiSMASH-style features so they can be loaded into 
  [BiG-SLiCE](https://github.com/medema-group/bigslice) for further analysis
  (`gecco convert gbk --format bigslice`).
- FASTA files with the sequences of all the predicted BGCs (`gecco convert gbk --format fna`)
  or with the sequences of all their proteins (`gecco convert gbk --format faa`).

To get a more visual way of exploring of the predictions, you
can open the GenBank files in a genome editing software like [UGENE](http://ugene.net/).
You can otherwise load the results into an AntiSMASH report: check the
[Integrations](https://gecco.embl.de/integrations.html#antismash) page of the
documentation for a step-by-step guide. 


## 🔖 Reference

GECCO can be cited using the following preprint:

> **Accurate de novo identification of biosynthetic gene clusters with GECCO**.
> Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller.
> bioRxiv 2021.05.03.442509; [doi:10.1101/2021.05.03.442509](https://doi.org/10.1101/2021.05.03.442509)


## 💭 Feedback

### ⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the [GitHub issue
tracker](https://github.com/zellerlab/GECCO/issues) if you need to report
or ask something. If you are filing in on a bug, please include as much
information as you can about the issue, and try to recreate the same bug
in a simple, easily reproducible situation.

### 🏗️ Contributing

Contributions are more than welcome! See [`CONTRIBUTING.md`](https://github.com/zellerlab/GECCO/blob/master/CONTRIBUTING.md)
for more details.

## ⚖️ License

This software is provided under the [GNU General Public License v3.0 *or later*](https://choosealicense.com/licenses/gpl-3.0/). GECCO is developped by the [Zeller Team](https://www.embl.de/research/units/scb/zeller/index.html)
at the [European Molecular Biology Laboratory](https://www.embl.de/) in Heidelberg.



            

Raw data

            {
    "_id": null,
    "home_page": "https://gecco.embl.de",
    "name": "gecco-tool",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "Martin Larralde",
    "author_email": "martin.larralde@embl.de",
    "download_url": "https://files.pythonhosted.org/packages/f2/cf/60e9119dcd1350f62d019d7e591802aae57a989fb94943a027cc05a3263b/gecco-tool-0.9.10.tar.gz",
    "platform": "x86",
    "description": "<img align=\"right\" width=\"180\" height=\"180\" src=\"https://raw.githubusercontent.com/zellerlab/GECCO/v0.6.2/static/gecco-square.png\">\n\n# Hi, I'm GECCO!\n\n## \ud83e\udd8e \ufe0fOverview\n\nGECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and\nscalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs)\nin genomic and metagenomic data using Conditional Random Fields (CRFs).\n\n[![Actions](https://img.shields.io/github/actions/workflow/status/zellerlab/GECCO/test.yml?branch=master&style=flat-square&maxAge=300)](https://github.com/zellerlab/GECCO/actions/workflows/test.yml)\n[![License](https://img.shields.io/badge/license-GPLv3-blue.svg?style=flat-square&maxAge=2678400)](https://choosealicense.com/licenses/gpl-3.0/)\n[![Coverage](https://img.shields.io/codecov/c/gh/zellerlab/GECCO?style=flat-square&maxAge=600)]( https://codecov.io/gh/zellerlab/GECCO/)\n[![Docs](https://img.shields.io/badge/docs-gecco.embl.de-green.svg?maxAge=2678400&style=flat-square)](https://gecco.embl.de)\n[![Source](https://img.shields.io/badge/source-GitHub-303030.svg?maxAge=2678400&style=flat-square)](https://github.com/zellerlab/GECCO/)\n[![Mirror](https://img.shields.io/badge/mirror-EMBL-009f4d?style=flat-square&maxAge=2678400)](https://git.embl.de/grp-zeller/GECCO/)\n[![Changelog](https://img.shields.io/badge/keep%20a-changelog-8A0707.svg?maxAge=2678400&style=flat-square)](https://github.com/zellerlab/GECCO/blob/master/CHANGELOG.md)\n[![Issues](https://img.shields.io/github/issues/zellerlab/GECCO.svg?style=flat-square&maxAge=600)](https://github.com/zellerlab/GECCO/issues)\n[![Preprint](https://img.shields.io/badge/preprint-bioRxiv-darkblue?style=flat-square&maxAge=2678400)](https://www.biorxiv.org/content/10.1101/2021.05.03.442509v1)\n[![PyPI](https://img.shields.io/pypi/v/gecco-tool.svg?style=flat-square&maxAge=3600)](https://pypi.python.org/pypi/gecco-tool)\n[![Bioconda](https://img.shields.io/conda/vn/bioconda/gecco?style=flat-square&maxAge=3600)](https://anaconda.org/bioconda/gecco)\n[![Galaxy](https://img.shields.io/badge/Galaxy-GECCO-darkblue?style=flat-square&maxAge=3600)](https://toolshed.g2.bx.psu.edu/repository?repository_id=c29bc911b3fc5f8c)\n[![Versions](https://img.shields.io/pypi/pyversions/gecco-tool.svg?style=flat-square&maxAge=3600)](https://pypi.org/project/gecco-tool/#files)\n[![Wheel](https://img.shields.io/pypi/wheel/gecco-tool?style=flat-square&maxAge=3600)](https://pypi.org/project/gecco-tool/#files)\n\n\n## \ud83d\udd27 Installing GECCO\n\nGECCO is implemented in [Python](https://www.python.org/), and supports [all\nversions](https://endoflife.date/python) from Python 3.7. It requires\nadditional libraries that can be installed directly from\n[PyPI](https://pypi.org), the Python Package Index.\n\nUse [`pip`](https://pip.pypa.io/en/stable/) to install GECCO on your\nmachine:\n```console\n$ pip install gecco-tool\n```\n\nIf you'd rather use [Conda](https://conda.io), a package is available\nin the [`bioconda`](https://bioconda.github.io/) channel. You can install\nwith:\n```console\n$ conda install -c bioconda gecco\n```\n\nThis will install GECCO, its dependencies, and the data needed to run\npredictions. This requires around 40MB of data to be downloaded, so\nit could take some time depending on your Internet connection. Once done,\nyou will have a ``gecco`` command available in your $PATH.\n\n*Note that GECCO uses [HMMER3](http://hmmer.org/), which can only run\non PowerPC and recent x86-64 machines running a POSIX operating system.\nTherefore, GECCO will work on Linux and OSX, but not on Windows.*\n\n\n## \ud83e\uddec Running GECCO\n\nOnce `gecco` is installed, you can run it from the terminal by giving it a\nFASTA or GenBank file with the genomic sequence you want to analyze, as\nwell as an output directory:\n\n```console\n$ gecco run --genome some_genome.fna -o some_output_dir\n```\n\nAdditional parameters of interest are:\n\n- `--jobs`, which controls the number of threads that will be spawned by\n  GECCO whenever a step can be parallelized. The default, *0*, will\n  autodetect the number of CPUs on the machine using\n  [`os.cpu_count`](https://docs.python.org/3/library/os.html#os.cpu_count).\n- `--cds`, controlling the minimum number of consecutive genes a BGC region\n  must have to be detected by GECCO. The default is *3*.\n- `--threshold`, controlling the minimum probability for a gene to be\n  considered part of a BGC region. Using a lower number will increase the\n  number (and possibly length) of predictions, but reduce accuracy. The\n  default of *0.8* was selected to optimize precision/recall on a test set\n  of 364 BGCs from [MIBiG 2.0](https://mibig.secondarymetabolites.org/).\n- `--cds-feature`, which can be supplied a feature name to extract genes\n  if the input file already contains gene annotations instead of predicting\n  genes with [Pyrodigal](https://pyrodigal.readthedocs.io). A common value\n  for records downloaded from GenBank is `--cds-feature CDS`.\n\n## \ud83d\udd0e Results\n\nGECCO will create the following files:\n\n- `{genome}.genes.tsv`: The *genes* file, containing the genes extracted\n  or predicted from the input file, and per-gene BGC probabilities\n  predicted by the CRF.\n- `{genome}.features.tsv`: The *features* file, containing the identified\n  domains in the input sequences, in tabular format.\n- `{genome}.clusters.tsv`: If any were found, a *clusters* file, containing\n  the coordinates of the predicted clusters along their putative biosynthetic\n  type, in tabular format.\n- `{genome}_cluster_{N}.gbk`: If any were found, a GenBank file per cluster,\n  containing the cluster sequence annotated with its member proteins and domains.\n\nGECCO can also convert results to other formats that may be more convenient\ndepending on the downstream usage. GECCO can convert results into:\n\n- GFF3 format so they can be loaded into a genomic viewer \n  (`gecco convert clusters --format gff`).\n- GenBank files with antiSMASH-style features so they can be loaded into \n  [BiG-SLiCE](https://github.com/medema-group/bigslice) for further analysis\n  (`gecco convert gbk --format bigslice`).\n- FASTA files with the sequences of all the predicted BGCs (`gecco convert gbk --format fna`)\n  or with the sequences of all their proteins (`gecco convert gbk --format faa`).\n\nTo get a more visual way of exploring of the predictions, you\ncan open the GenBank files in a genome editing software like [UGENE](http://ugene.net/).\nYou can otherwise load the results into an AntiSMASH report: check the\n[Integrations](https://gecco.embl.de/integrations.html#antismash) page of the\ndocumentation for a step-by-step guide. \n\n\n## \ud83d\udd16 Reference\n\nGECCO can be cited using the following preprint:\n\n> **Accurate de novo identification of biosynthetic gene clusters with GECCO**.\n> Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller.\n> bioRxiv 2021.05.03.442509; [doi:10.1101/2021.05.03.442509](https://doi.org/10.1101/2021.05.03.442509)\n\n\n## \ud83d\udcad Feedback\n\n### \u26a0\ufe0f Issue Tracker\n\nFound a bug ? Have an enhancement request ? Head over to the [GitHub issue\ntracker](https://github.com/zellerlab/GECCO/issues) if you need to report\nor ask something. If you are filing in on a bug, please include as much\ninformation as you can about the issue, and try to recreate the same bug\nin a simple, easily reproducible situation.\n\n### \ud83c\udfd7\ufe0f Contributing\n\nContributions are more than welcome! See [`CONTRIBUTING.md`](https://github.com/zellerlab/GECCO/blob/master/CONTRIBUTING.md)\nfor more details.\n\n## \u2696\ufe0f License\n\nThis software is provided under the [GNU General Public License v3.0 *or later*](https://choosealicense.com/licenses/gpl-3.0/). GECCO is developped by the [Zeller Team](https://www.embl.de/research/units/scb/zeller/index.html)\nat the [European Molecular Biology Laboratory](https://www.embl.de/) in Heidelberg.\n\n\n",
    "bugtrack_url": null,
    "license": "GPLv3",
    "summary": "Gene cluster prediction with Conditional random fields.",
    "version": "0.9.10",
    "project_urls": {
        "Bug Tracker": "https://github.com/zellerlab/GECCO/issues",
        "Builds": "https://git.embl.de/grp-zeller/GECCO/-/pipelines",
        "Changelog": "https://github.com/zellerlab/GECCO/blob/master/CHANGELOG.md",
        "Coverage": "https://codecov.io/gh/zellerlab/GECCO/",
        "Homepage": "https://gecco.embl.de",
        "Preprint": "https://www.biorxiv.org/content/10.1101/2021.05.03.442509v1",
        "Repository": "https://github.com/zellerlab/GECCO"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f660b1c3b0254ebb6420ddf489860b8d2d642c2fa0aae6e24e47d524e4ef039e",
                "md5": "e20f3e25727a99373728c78045d5a4d2",
                "sha256": "98d9f493fbbdfaa3ad5e724c0387a468494062439744d24ac0a93200f489856f"
            },
            "downloads": -1,
            "filename": "gecco_tool-0.9.10-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e20f3e25727a99373728c78045d5a4d2",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.7",
            "size": 42855566,
            "upload_time": "2024-02-27T16:10:48",
            "upload_time_iso_8601": "2024-02-27T16:10:48.075583Z",
            "url": "https://files.pythonhosted.org/packages/f6/60/b1c3b0254ebb6420ddf489860b8d2d642c2fa0aae6e24e47d524e4ef039e/gecco_tool-0.9.10-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f2cf60e9119dcd1350f62d019d7e591802aae57a989fb94943a027cc05a3263b",
                "md5": "9dae7643aad67b0aa0d326b68ef452da",
                "sha256": "6ab405587824228a2a2baa08ccb9e6df1f6df214fe6c1a531b778a613fb1e90d"
            },
            "downloads": -1,
            "filename": "gecco-tool-0.9.10.tar.gz",
            "has_sig": false,
            "md5_digest": "9dae7643aad67b0aa0d326b68ef452da",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 1930933,
            "upload_time": "2024-02-27T16:10:51",
            "upload_time_iso_8601": "2024-02-27T16:10:51.776401Z",
            "url": "https://files.pythonhosted.org/packages/f2/cf/60e9119dcd1350f62d019d7e591802aae57a989fb94943a027cc05a3263b/gecco-tool-0.9.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-27 16:10:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "zellerlab",
    "github_project": "GECCO",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "gecco-tool"
}
        
Elapsed time: 0.18895s