deepbgc

Name	deepbgc JSON
Version	0.1.31 JSON
	download
home_page	https://github.com/Merck/DeepBGC
Summary	DeepBGC - Biosynthetic Gene Cluster detection and classification
upload_time	2023-11-11 12:50:46
maintainer
docs_url	None
author	David Příhoda, Geoffrey Hannigan
requires_python	>=3.6,<3.8
license	MIT
keywords	biosynthetic gene clusters bgc detection deep learning pfam2vec
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # DeepBGC: Biosynthetic Gene Cluster detection and classification

DeepBGC detects BGCs in bacterial and fungal genomes using deep learning. 
DeepBGC employs a Bidirectional Long Short-Term Memory Recurrent Neural Network 
and a word2vec-like vector embedding of Pfam protein domains. 
Product class and activity of detected BGCs is predicted using a Random Forest classifier.

[![BioConda Install](https://img.shields.io/conda/dn/bioconda/deepbgc.svg?style=flag&label=BioConda%20install&color=green)](https://anaconda.org/bioconda/deepbgc) 
![PyPI - Downloads](https://img.shields.io/pypi/dm/deepbgc.svg?color=green&label=PyPI%20downloads)
[![PyPI license](https://img.shields.io/pypi/l/deepbgc.svg)](https://pypi.python.org/pypi/deepbgc/)
[![PyPI version](https://badge.fury.io/py/deepbgc.svg)](https://badge.fury.io/py/deepbgc)
[![CI](https://api.travis-ci.org/Merck/deepbgc.svg?branch=master)](https://travis-ci.org/Merck/deepbgc)

![DeepBGC architecture](images/deepbgc.architecture.png?raw=true "DeepBGC architecture")

## 📌 News 📌

- **DeepBGC 0.1.23**: Predicted BGCs can now be uploaded for visualization in **antiSMASH** using a JSON output file
  - Install and run DeepBGC as usual based on instructions below
  - Upload `antismash.json` from the DeepBGC output folder using "Upload extra annotations" on the [antiSMASH](https://antismash.secondarymetabolites.org/) page
  - Predicted BGC regions and their prediction scores will be displayed alongside antiSMASH BGCs
 
## Publications

A deep learning genome-mining strategy for biosynthetic gene cluster prediction <br>
Geoffrey D Hannigan,  David Prihoda et al., Nucleic Acids Research, gkz654, https://doi.org/10.1093/nar/gkz654

## Install using conda (recommended)

You can install DeepBGC using [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/download.html) 
or one of the alternatives ([Miniconda](https://docs.conda.io/en/latest/miniconda.html), 
[Miniforge](https://github.com/conda-forge/miniforge)).

Set up Bioconda and Conda-Forge channels:

```bash
conda config --add channels bioconda
conda config --add channels conda-forge
```

Install DeepBGC using:

```bash
# Create a separate DeepBGC environment and install dependencies
conda create -n deepbgc python=3.7 hmmer prodigal

# Install DeepBGC into the environment using pip
conda activate deepbgc
pip install deepbgc

# Alternatively, install everything using conda (currently unstable due to conda conflicts)
conda install deepbgc
```


## Install dependencies manually (if conda is not available)

If you don't mind installing the HMMER and Prodigal dependencies manually, you can also install DeepBGC using pip:

- Install Python version 3.6 or 3.7 (Note: **Python 3.8 is not supported** due to Tensorflow < 2.0 dependency)
- Install Prodigal and put the `prodigal` binary it on your PATH: https://github.com/hyattpd/Prodigal/releases
- Install HMMER and put the `hmmscan` and `hmmpress` binaries on your PATH: http://hmmer.org/download.html
- Run `pip install deepbgc` to install DeepBGC   

## Use DeepBGC

### Download models and Pfam database

Before you can use DeepBGC, download trained models and Pfam database:

```bash
deepbgc download
```

You can display downloaded dependencies and models using:

```bash
deepbgc info
```

### Detection and classification

![DeepBGC pipeline](images/deepbgc.pipeline.png?raw=true "DeepBGC pipeline")

Detect and classify BGCs in a genomic sequence. 
Proteins and Pfam domains are detected automatically if not already annotated (HMMER and Prodigal needed)

```bash
# Show command help docs
deepbgc pipeline --help

# Detect and classify BGCs in mySequence.fa using DeepBGC detector.
deepbgc pipeline mySequence.fa

# Detect and classify BGCs in mySequence.fa using custom DeepBGC detector trained on your own data.
deepbgc pipeline --detector path/to/myDetector.pkl mySequence.fa
```

This will produce a `mySequence` directory with multiple files and a README.txt with file descriptions.

See [Train DeepBGC on your own data](#train-deepbgc-on-your-own-data) section below for more information about training a custom detector or classifier.

#### Example output

See the [DeepBGC Example Result Notebook](https://nbviewer.jupyter.org/urls/github.com/Merck/deepbgc/releases/download/v0.1.0/DeepBGC_Example_Result.ipynb).
Data can be downloaded on the [releases page](https://github.com/Merck/deepbgc/releases)

![Detected BGC Regions](images/deepbgc.bgc.png?raw=true "Detected BGC regions")

## Train DeepBGC on your own data

You can train your own BGC detection and classification models, see `deepbgc train --help` for documentation and examples.

Training and validation data can be found in [release 0.1.0](https://github.com/Merck/deepbgc/releases/tag/v0.1.0) and [release 0.1.5](https://github.com/Merck/deepbgc/releases/tag/v0.1.5). You will need:
- Positive (BGC) training data - In most cases, this is your own BGC training set, see "Preparing training data" section below
- Negative (Non-BGC) training data - Needed for BGC detection. You can use `GeneSwap_Negatives.pfam.tsv` from release https://github.com/Merck/deepbgc/releases/tag/v0.1.0
- Validation data - Needed for BGC detection. Contigs with annotated BGC and non-BGC regions. A working example can be downloaded from https://github.com/Merck/deepbgc/releases/tag/v0.1.5
- Trained Pfam2vec vectors - "Vocabulary" converting Pfam IDs to meaningful numeric vectors, you can reuse previously trained `pfam2vec.csv` results from https://github.com/Merck/deepbgc/releases/tag/v0.1.0
- JSON configuration files - See JSON section below

If you have any questions about using or training DeepBGC, feel free to submit an issue.

### Preparing training data

The training examples need to be prepared in Pfam TSV format, which can be prepared from your sequence
using `deepbgc prepare`. 

First, you will need to manually add an `in_cluster` column that will contain 0 for pfams outside a BGC 
and 1 for pfams inside a BGC. We recommend preparing a separate negative TSV and positive TSV file, 
where the column will be equal to all 0 or 1 respectively. 

Finally, you will need to manually add a `sequence_id` column ,
which will identify a continuous sequence of Pfams from a single sample (BGC or negative sequence).
The samples are shuffled during training to present the model with a random order of positive and negative samples.
Pfams with the same `sequence_id` value will be kept together. For example, if your training set contains multiple BGCs, the `sequence_id` column should contain the BGC ID.

**! New in version 0.1.17 !** You can now prepare *protein* FASTA sequences into a Pfam TSV file using `deepbgc prepare --protein`.


### JSON model training template files

DeepBGC is using JSON template files to define model architecture and training parameters. All templates can be downloaded in [release 0.1.0](https://github.com/Merck/deepbgc/releases/tag/v0.1.0).

JSON template for DeepBGC LSTM **detector** with pfam2vec is structured as follows:
```
{
  "type": "KerasRNN", - Model architecture (KerasRNN/DiscreteHMM/GeneBorderHMM)
  "build_params": { - Parameters for model architecture
    "batch_size": 16, - Number of splits of training data that is trained in parallel 
    "hidden_size": 128, - Size of vector storing the LSTM inner state
    "stateful": true - Remember previous sequence when training next batch
  },
  "fit_params": {
    "timesteps": 256, - Number of pfam2vec vectors trained in one batch
    "validation_size": 0, - Fraction of training data to use for validation (if validation data is not provided explicitly). Use 0.2 for 20% data used for testing.
    "verbose": 1, - Verbosity during training
    "num_epochs": 1000, - Number of passes over your training set during training. You probably want to use a lower number if not using early stopping on validation data.
    "early_stopping" : { - Stop model training when at certain validation performance
      "monitor": "val_auc_roc", - Use validation AUC ROC to observe performance
      "min_delta": 0.0001, - Stop training when the improvement in the last epochs did not improve more than 0.0001
      "patience": 20, - How many of the last epochs to check for improvement
      "mode": "max" - Stop training when given metric stops increasing (use "min" for decreasing metrics like loss)
    },
    "shuffle": true, - Shuffle samples in each epoch. Will use "sequence_id" field to group pfam vectors belonging to the same sample and shuffle them together 
    "optimizer": "adam", - Optimizer algorithm
    "learning_rate": 0.0001, - Learning rate
    "weighted": true - Increase weight of less-represented class. Will give more weight to BGC training samples if the non-BGC set is larger.
  },
  "input_params": {
    "features": [ - Array of features to use in model, see deepbgc/features.py
      {
        "type": "ProteinBorderTransformer" - Add two binary flags for pfam domains found at beginning or at end of protein
      },
      {
        "type": "Pfam2VecTransformer", - Convert pfam_id field to pfam2vec vector using provided pfam2vec table
        "vector_path": "#{PFAM2VEC}" - PFAM2VEC variable is filled in using command line argument --config
      }
    ]
  }
}
```

JSON template for Random Forest **classifier** is structured as follows:
```
{
  "type": "RandomForestClassifier", - Type of classifier (RandomForestClassifier)
  "build_params": {
    "n_estimators": 100, - Number of trees in random forest
    "random_state": 0 - Random seed used to get same result each time
  },
  "input_params": {
    "sequence_as_vector": true, - Convert each sample into a single vector
    "features": [
      {
        "type": "OneHotEncodingTransformer" - Convert each sequence of Pfams into a single binary vector (Pfam set)
      }
    ]
  }
}
```

### Using your trained model

Since version `0.1.10` you can provide a direct path to the detector or classifier model like so:
```bash
deepbgc pipeline \
    mySequence.fa \
    --detector path/to/myDetector.pkl \
    --classifier path/to/myClassifier.pkl 
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Merck/DeepBGC",
    "name": "deepbgc",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6,<3.8",
    "maintainer_email": "",
    "keywords": "biosynthetic gene clusters,bgc detection,deep learning,pfam2vec",
    "author": "David P\u0159\u00edhoda, Geoffrey Hannigan",
    "author_email": "david.prihoda1@merck.com",
    "download_url": "https://files.pythonhosted.org/packages/63/66/a24a5a909ff4b1bc332913f85e73f8fac25d78ef515c1af5c5a46ed02e45/deepbgc-0.1.31.tar.gz",
    "platform": null,
    "description": "# DeepBGC: Biosynthetic Gene Cluster detection and classification\n\nDeepBGC detects BGCs in bacterial and fungal genomes using deep learning. \nDeepBGC employs a Bidirectional Long Short-Term Memory Recurrent Neural Network \nand a word2vec-like vector embedding of Pfam protein domains. \nProduct class and activity of detected BGCs is predicted using a Random Forest classifier.\n\n[![BioConda Install](https://img.shields.io/conda/dn/bioconda/deepbgc.svg?style=flag&label=BioConda%20install&color=green)](https://anaconda.org/bioconda/deepbgc) \n![PyPI - Downloads](https://img.shields.io/pypi/dm/deepbgc.svg?color=green&label=PyPI%20downloads)\n[![PyPI license](https://img.shields.io/pypi/l/deepbgc.svg)](https://pypi.python.org/pypi/deepbgc/)\n[![PyPI version](https://badge.fury.io/py/deepbgc.svg)](https://badge.fury.io/py/deepbgc)\n[![CI](https://api.travis-ci.org/Merck/deepbgc.svg?branch=master)](https://travis-ci.org/Merck/deepbgc)\n\n![DeepBGC architecture](images/deepbgc.architecture.png?raw=true \"DeepBGC architecture\")\n\n## \ud83d\udccc News \ud83d\udccc\n\n- **DeepBGC 0.1.23**: Predicted BGCs can now be uploaded for visualization in **antiSMASH** using a JSON output file\n  - Install and run DeepBGC as usual based on instructions below\n  - Upload `antismash.json` from the DeepBGC output folder using \"Upload extra annotations\" on the [antiSMASH](https://antismash.secondarymetabolites.org/) page\n  - Predicted BGC regions and their prediction scores will be displayed alongside antiSMASH BGCs\n \n## Publications\n\nA deep learning genome-mining strategy for biosynthetic gene cluster prediction <br>\nGeoffrey D Hannigan,  David Prihoda et al., Nucleic Acids Research, gkz654, https://doi.org/10.1093/nar/gkz654\n\n## Install using conda (recommended)\n\nYou can install DeepBGC using [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/download.html) \nor one of the alternatives ([Miniconda](https://docs.conda.io/en/latest/miniconda.html), \n[Miniforge](https://github.com/conda-forge/miniforge)).\n\nSet up Bioconda and Conda-Forge channels:\n\n```bash\nconda config --add channels bioconda\nconda config --add channels conda-forge\n```\n\nInstall DeepBGC using:\n\n```bash\n# Create a separate DeepBGC environment and install dependencies\nconda create -n deepbgc python=3.7 hmmer prodigal\n\n# Install DeepBGC into the environment using pip\nconda activate deepbgc\npip install deepbgc\n\n# Alternatively, install everything using conda (currently unstable due to conda conflicts)\nconda install deepbgc\n```\n\n\n## Install dependencies manually (if conda is not available)\n\nIf you don't mind installing the HMMER and Prodigal dependencies manually, you can also install DeepBGC using pip:\n\n- Install Python version 3.6 or 3.7 (Note: **Python 3.8 is not supported** due to Tensorflow < 2.0 dependency)\n- Install Prodigal and put the `prodigal` binary it on your PATH: https://github.com/hyattpd/Prodigal/releases\n- Install HMMER and put the `hmmscan` and `hmmpress` binaries on your PATH: http://hmmer.org/download.html\n- Run `pip install deepbgc` to install DeepBGC   \n\n## Use DeepBGC\n\n### Download models and Pfam database\n\nBefore you can use DeepBGC, download trained models and Pfam database:\n\n```bash\ndeepbgc download\n```\n\nYou can display downloaded dependencies and models using:\n\n```bash\ndeepbgc info\n```\n\n### Detection and classification\n\n![DeepBGC pipeline](images/deepbgc.pipeline.png?raw=true \"DeepBGC pipeline\")\n\nDetect and classify BGCs in a genomic sequence. \nProteins and Pfam domains are detected automatically if not already annotated (HMMER and Prodigal needed)\n\n```bash\n# Show command help docs\ndeepbgc pipeline --help\n\n# Detect and classify BGCs in mySequence.fa using DeepBGC detector.\ndeepbgc pipeline mySequence.fa\n\n# Detect and classify BGCs in mySequence.fa using custom DeepBGC detector trained on your own data.\ndeepbgc pipeline --detector path/to/myDetector.pkl mySequence.fa\n```\n\nThis will produce a `mySequence` directory with multiple files and a README.txt with file descriptions.\n\nSee [Train DeepBGC on your own data](#train-deepbgc-on-your-own-data) section below for more information about training a custom detector or classifier.\n\n#### Example output\n\nSee the [DeepBGC Example Result Notebook](https://nbviewer.jupyter.org/urls/github.com/Merck/deepbgc/releases/download/v0.1.0/DeepBGC_Example_Result.ipynb).\nData can be downloaded on the [releases page](https://github.com/Merck/deepbgc/releases)\n\n![Detected BGC Regions](images/deepbgc.bgc.png?raw=true \"Detected BGC regions\")\n\n## Train DeepBGC on your own data\n\nYou can train your own BGC detection and classification models, see `deepbgc train --help` for documentation and examples.\n\nTraining and validation data can be found in [release 0.1.0](https://github.com/Merck/deepbgc/releases/tag/v0.1.0) and [release 0.1.5](https://github.com/Merck/deepbgc/releases/tag/v0.1.5). You will need:\n- Positive (BGC) training data - In most cases, this is your own BGC training set, see \"Preparing training data\" section below\n- Negative (Non-BGC) training data - Needed for BGC detection. You can use `GeneSwap_Negatives.pfam.tsv` from release https://github.com/Merck/deepbgc/releases/tag/v0.1.0\n- Validation data - Needed for BGC detection. Contigs with annotated BGC and non-BGC regions. A working example can be downloaded from https://github.com/Merck/deepbgc/releases/tag/v0.1.5\n- Trained Pfam2vec vectors - \"Vocabulary\" converting Pfam IDs to meaningful numeric vectors, you can reuse previously trained `pfam2vec.csv` results from https://github.com/Merck/deepbgc/releases/tag/v0.1.0\n- JSON configuration files - See JSON section below\n\nIf you have any questions about using or training DeepBGC, feel free to submit an issue.\n\n### Preparing training data\n\nThe training examples need to be prepared in Pfam TSV format, which can be prepared from your sequence\nusing `deepbgc prepare`. \n\nFirst, you will need to manually add an `in_cluster` column that will contain 0 for pfams outside a BGC \nand 1 for pfams inside a BGC. We recommend preparing a separate negative TSV and positive TSV file, \nwhere the column will be equal to all 0 or 1 respectively. \n\nFinally, you will need to manually add a `sequence_id` column ,\nwhich will identify a continuous sequence of Pfams from a single sample (BGC or negative sequence).\nThe samples are shuffled during training to present the model with a random order of positive and negative samples.\nPfams with the same `sequence_id` value will be kept together. For example, if your training set contains multiple BGCs, the `sequence_id` column should contain the BGC ID.\n\n**! New in version 0.1.17 !** You can now prepare *protein* FASTA sequences into a Pfam TSV file using `deepbgc prepare --protein`.\n\n\n### JSON model training template files\n\nDeepBGC is using JSON template files to define model architecture and training parameters. All templates can be downloaded in [release 0.1.0](https://github.com/Merck/deepbgc/releases/tag/v0.1.0).\n\nJSON template for DeepBGC LSTM **detector** with pfam2vec is structured as follows:\n```\n{\n  \"type\": \"KerasRNN\", - Model architecture (KerasRNN/DiscreteHMM/GeneBorderHMM)\n  \"build_params\": { - Parameters for model architecture\n    \"batch_size\": 16, - Number of splits of training data that is trained in parallel \n    \"hidden_size\": 128, - Size of vector storing the LSTM inner state\n    \"stateful\": true - Remember previous sequence when training next batch\n  },\n  \"fit_params\": {\n    \"timesteps\": 256, - Number of pfam2vec vectors trained in one batch\n    \"validation_size\": 0, - Fraction of training data to use for validation (if validation data is not provided explicitly). Use 0.2 for 20% data used for testing.\n    \"verbose\": 1, - Verbosity during training\n    \"num_epochs\": 1000, - Number of passes over your training set during training. You probably want to use a lower number if not using early stopping on validation data.\n    \"early_stopping\" : { - Stop model training when at certain validation performance\n      \"monitor\": \"val_auc_roc\", - Use validation AUC ROC to observe performance\n      \"min_delta\": 0.0001, - Stop training when the improvement in the last epochs did not improve more than 0.0001\n      \"patience\": 20, - How many of the last epochs to check for improvement\n      \"mode\": \"max\" - Stop training when given metric stops increasing (use \"min\" for decreasing metrics like loss)\n    },\n    \"shuffle\": true, - Shuffle samples in each epoch. Will use \"sequence_id\" field to group pfam vectors belonging to the same sample and shuffle them together \n    \"optimizer\": \"adam\", - Optimizer algorithm\n    \"learning_rate\": 0.0001, - Learning rate\n    \"weighted\": true - Increase weight of less-represented class. Will give more weight to BGC training samples if the non-BGC set is larger.\n  },\n  \"input_params\": {\n    \"features\": [ - Array of features to use in model, see deepbgc/features.py\n      {\n        \"type\": \"ProteinBorderTransformer\" - Add two binary flags for pfam domains found at beginning or at end of protein\n      },\n      {\n        \"type\": \"Pfam2VecTransformer\", - Convert pfam_id field to pfam2vec vector using provided pfam2vec table\n        \"vector_path\": \"#{PFAM2VEC}\" - PFAM2VEC variable is filled in using command line argument --config\n      }\n    ]\n  }\n}\n```\n\nJSON template for Random Forest **classifier** is structured as follows:\n```\n{\n  \"type\": \"RandomForestClassifier\", - Type of classifier (RandomForestClassifier)\n  \"build_params\": {\n    \"n_estimators\": 100, - Number of trees in random forest\n    \"random_state\": 0 - Random seed used to get same result each time\n  },\n  \"input_params\": {\n    \"sequence_as_vector\": true, - Convert each sample into a single vector\n    \"features\": [\n      {\n        \"type\": \"OneHotEncodingTransformer\" - Convert each sequence of Pfams into a single binary vector (Pfam set)\n      }\n    ]\n  }\n}\n```\n\n### Using your trained model\n\nSince version `0.1.10` you can provide a direct path to the detector or classifier model like so:\n```bash\ndeepbgc pipeline \\\n    mySequence.fa \\\n    --detector path/to/myDetector.pkl \\\n    --classifier path/to/myClassifier.pkl \n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "DeepBGC - Biosynthetic Gene Cluster detection and classification",
    "version": "0.1.31",
    "project_urls": {
        "Homepage": "https://github.com/Merck/DeepBGC"
    },
    "split_keywords": [
        "biosynthetic gene clusters",
        "bgc detection",
        "deep learning",
        "pfam2vec"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2113b7ddd653560352bcd8eb2fd58f84db2cd84cbb315623218c9465ad2311cd",
                "md5": "facb7bb8caeef42abdceb3ef9e7e0a48",
                "sha256": "99e93840d6b40a8ac5dcd952da3af7b1eb18a5eec9e5f13bb95a3e3007639e63"
            },
            "downloads": -1,
            "filename": "deepbgc-0.1.31-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "facb7bb8caeef42abdceb3ef9e7e0a48",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6,<3.8",
            "size": 64614,
            "upload_time": "2023-11-11T12:50:43",
            "upload_time_iso_8601": "2023-11-11T12:50:43.924171Z",
            "url": "https://files.pythonhosted.org/packages/21/13/b7ddd653560352bcd8eb2fd58f84db2cd84cbb315623218c9465ad2311cd/deepbgc-0.1.31-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6366a24a5a909ff4b1bc332913f85e73f8fac25d78ef515c1af5c5a46ed02e45",
                "md5": "788399561eb3cf9dbbbc855feaa6ed24",
                "sha256": "f294736ca63790a0429a2802457774da9c07e0b42dee8b2963f42f77ccb06255"
            },
            "downloads": -1,
            "filename": "deepbgc-0.1.31.tar.gz",
            "has_sig": false,
            "md5_digest": "788399561eb3cf9dbbbc855feaa6ed24",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6,<3.8",
            "size": 53586,
            "upload_time": "2023-11-11T12:50:46",
            "upload_time_iso_8601": "2023-11-11T12:50:46.194091Z",
            "url": "https://files.pythonhosted.org/packages/63/66/a24a5a909ff4b1bc332913f85e73f8fac25d78ef515c1af5c5a46ed02e45/deepbgc-0.1.31.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-11 12:50:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Merck",
    "github_project": "DeepBGC",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "deepbgc"
}

David Příhoda, Geoffrey Hannigan