sequence-unet


Namesequence-unet JSON
Version 1.0.3 PyPI version JSON
download
home_page
SummaryMake protein predictions with Sequence UNET and train new models
upload_time2023-07-28 14:43:15
maintainer
docs_urlNone
authorPedro Beltrao, Mohammed AlQuraishi
requires_python>=3.6
license
keywords deep learning neural network protein bioinformatics variant effect prediction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Sequence UNET 1.0.3
<!-- badges: start -->
[![DOI](https://zenodo.org/badge/370484533.svg)](https://zenodo.org/badge/latestdoi/370484533)
[![Documentation Status](https://readthedocs.org/projects/sequence-unet/badge/?version=latest)](https://sequence-unet.readthedocs.io/en/latest/?badge=latest)
<!-- badges: end -->

Sequence UNET is a fully convolutional neural network variant effect predictor, able to predict the pathogenicity of protein coding variants and the frequency they occur across large multiple sequence alignments.
A description and discussion of the model is available on bioRxiv [(Dunham et al. 2022)](https://www.biorxiv.org/content/10.1101/2022.05.23.493038).
It uses a U-shaped architecture inspired by the U-NET medical image segmentation network [(Ronneberger et al. 2015)](http://arxiv.org/abs/1505.04597), with an optional Graph CNN section to incorporate information from protein structure:

![Sequence UNET model schematic](figures/model_schematic.png)

This repo contains a python package for downloading and using the trained models as well as code used to train, explore and analyse the models.
The package can download, load and predict with 8 variant trained models, which are also available for manual download on [BioStudies](https://www.ebi.ac.uk/biostudies/studies/S-BSST732).
The python package is contained in the `sequence_unet` directory, with documentation in the `docs` directory.
It is also possible to use the modules and scripts used for training and exploring different models, found in `analysis`, `bin` and `source`.
The python scripts and modules require adding the `src` directory to your python path.

## Installation

In most cases pip should be able to handle all the dependancies, meaning installation is simple:

`pip install sequence_unet`

Or the latest development version from GitHub:

`pip install git+https://github.com/allydunham/sequence_unet`

### Manual installation

If pip can't resolve the correct dependancies the requirements might be able to be installed manually:

1. Install Tensorflow (or tensorflow-macos on M1 Macs): `pip install tensorflow` or `pip install tensorflow-macos`
2. Install other PyPi requirements: `pip install numpy pandas biopython tqdm proteinnetpy`
3. Install Sequence UNET: `pip install --no-deps sequence_unet`

This was needed on M1 Macs before I updated `pyproject.toml` to use `tensorflow-macos` for those systems since no versions of Tensorflow were available which matched the requirements.
It might help similar compatibility issues too.

### Requirements

The python package requires:

* Python 3+
* Tensorflow 2.6+
* Numpy
* Pandas
* Biopython
* TQDM
* [ProteinNetPy](https://github.com/allydunham/proteinnetpy)

I have tried to maintain as broad requirements as possible but future changes, particularly in Tensorflow, may result in more restrictive version support.

Figure generation and performance analysis was performed in R 4.0, largely using [Tidyverse](https://www.tidyverse.org/) packages.
All R packages used are loaded in `src/config.R` or the top of the respective script.

## Basic Usage

Detailed [documentation](https://sequence-unet.readthedocs.io/en/latest/) of the modules functions and scripts are available on ReadTheDocs.

### Loading Models

Trained models are downloaded and loaded using the [sequence_unet.models](https://sequence-unet.readthedocs.io/en/latest/models.html) module, which also provides functions to initialise untrained models.
The loaded model is a TensorFlow Keras model, so easily interfaces with other code.

```python
from sequence_unet import models

# Download a trained model
model_path = models.download_trained_model("freq_classifier", root="models", model_format="tf")

# Load the downloaded model via its path. This function can also download the model if not found
model = models.load_trained_model(model=model_path)

# or via name and root
model = models.load_trained_model(model="freq_classifier", root="models")
model.summary()
```

### Prediction

The [sequence_unet.predict](https://sequence-unet.readthedocs.io/en/latest/predict.html) module provides functions to predict from sequences or ProteinNet files.
Other input data can easily be parsed into one of these formats or transformed in Python to fit the models input shapes (described in the [models.sequence_unet](https://sequence-unet.readthedocs.io/en/latest/models.html#sequence_unet.models.sequence_unet) function).

```python
from pandas import concat
from Bio import SeqIO
from proteinnetpy.data import ProteinNetDataset

from sequence_unet.models import load_trained_model
from sequence_unet.predict import predict_sequence, predict_proteinnet

# load a model
model = load_trained_model(model="freq_classifier", download=True)

# Predict from a fasta file
fasta = SeqIO.parse("path/to/fasta.fa", format="fasta")
preds = concat([p for p in predict_sequence(model, sequences=fasta, wide=True)])

# Predict from a ProteinNet file
data = ProteinNetDataset(path="path/to/proteinnet", preload=False)
preds = concat([p for p in predict_proteinnet(model, data, wide=True)])
```

The package also binds the `sequence_unet` command, which allows users to make predictions from the command line.
It supports predicting scores for proteins in a fasta or [ProteinNet](https://github.com/aqlaboratory/proteinnet/) file.
A full description is available in the [documentation](https://sequence-unet.readthedocs.io/en/latest/scripts.html) and can be accessed using `-h`.

### Training

The `sequence_unet` package does not include specific code for training new models, as the best way to do this will be specific to the users needs.
However, the loaded models are [TensorFlow Keras models](https://www.tensorflow.org/api_docs/python/tf/keras/Model) and inherently support straightforward and extensible training.
I used the `training.py` script to train the models, based on a saved model and data loading function saved by the `make_experiment_dir` function in `src/utils.py`.
Usage of the script is documented in it's docstring and can be accessed using `-h`.
The training scripts in subdirectories of `models` give examples of the model training procedures I used to train the various forms of the model.
This setup can be adapted for other users needs reasonably easily.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "sequence-unet",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "Alistair Dunham <ad44@sanger.ac.uk>",
    "keywords": "deep learning,neural network,protein,bioinformatics,variant effect prediction",
    "author": "Pedro Beltrao, Mohammed AlQuraishi",
    "author_email": "Alistair Dunham <ad44@sanger.ac.uk>",
    "download_url": "https://files.pythonhosted.org/packages/1d/a0/c13aa301226742a3b69a83a1a89d42980977e344cfb5284d3437a2560ae5/sequence_unet-1.0.3.tar.gz",
    "platform": null,
    "description": "# Sequence UNET 1.0.3\n<!-- badges: start -->\n[![DOI](https://zenodo.org/badge/370484533.svg)](https://zenodo.org/badge/latestdoi/370484533)\n[![Documentation Status](https://readthedocs.org/projects/sequence-unet/badge/?version=latest)](https://sequence-unet.readthedocs.io/en/latest/?badge=latest)\n<!-- badges: end -->\n\nSequence UNET is a fully convolutional neural network variant effect predictor, able to predict the pathogenicity of protein coding variants and the frequency they occur across large multiple sequence alignments.\nA description and discussion of the model is available on bioRxiv [(Dunham et al. 2022)](https://www.biorxiv.org/content/10.1101/2022.05.23.493038).\nIt uses a U-shaped architecture inspired by the U-NET medical image segmentation network [(Ronneberger et al. 2015)](http://arxiv.org/abs/1505.04597), with an optional Graph CNN section to incorporate information from protein structure:\n\n![Sequence UNET model schematic](figures/model_schematic.png)\n\nThis repo contains a python package for downloading and using the trained models as well as code used to train, explore and analyse the models.\nThe package can download, load and predict with 8 variant trained models, which are also available for manual download on [BioStudies](https://www.ebi.ac.uk/biostudies/studies/S-BSST732).\nThe python package is contained in the `sequence_unet` directory, with documentation in the `docs` directory.\nIt is also possible to use the modules and scripts used for training and exploring different models, found in `analysis`, `bin` and `source`.\nThe python scripts and modules require adding the `src` directory to your python path.\n\n## Installation\n\nIn most cases pip should be able to handle all the dependancies, meaning installation is simple:\n\n`pip install sequence_unet`\n\nOr the latest development version from GitHub:\n\n`pip install git+https://github.com/allydunham/sequence_unet`\n\n### Manual installation\n\nIf pip can't resolve the correct dependancies the requirements might be able to be installed manually:\n\n1. Install Tensorflow (or tensorflow-macos on M1 Macs): `pip install tensorflow` or `pip install tensorflow-macos`\n2. Install other PyPi requirements: `pip install numpy pandas biopython tqdm proteinnetpy`\n3. Install Sequence UNET: `pip install --no-deps sequence_unet`\n\nThis was needed on M1 Macs before I updated `pyproject.toml` to use `tensorflow-macos` for those systems since no versions of Tensorflow were available which matched the requirements.\nIt might help similar compatibility issues too.\n\n### Requirements\n\nThe python package requires:\n\n* Python 3+\n* Tensorflow 2.6+\n* Numpy\n* Pandas\n* Biopython\n* TQDM\n* [ProteinNetPy](https://github.com/allydunham/proteinnetpy)\n\nI have tried to maintain as broad requirements as possible but future changes, particularly in Tensorflow, may result in more restrictive version support.\n\nFigure generation and performance analysis was performed in R 4.0, largely using [Tidyverse](https://www.tidyverse.org/) packages.\nAll R packages used are loaded in `src/config.R` or the top of the respective script.\n\n## Basic Usage\n\nDetailed [documentation](https://sequence-unet.readthedocs.io/en/latest/) of the modules functions and scripts are available on ReadTheDocs.\n\n### Loading Models\n\nTrained models are downloaded and loaded using the [sequence_unet.models](https://sequence-unet.readthedocs.io/en/latest/models.html) module, which also provides functions to initialise untrained models.\nThe loaded model is a TensorFlow Keras model, so easily interfaces with other code.\n\n```python\nfrom sequence_unet import models\n\n# Download a trained model\nmodel_path = models.download_trained_model(\"freq_classifier\", root=\"models\", model_format=\"tf\")\n\n# Load the downloaded model via its path. This function can also download the model if not found\nmodel = models.load_trained_model(model=model_path)\n\n# or via name and root\nmodel = models.load_trained_model(model=\"freq_classifier\", root=\"models\")\nmodel.summary()\n```\n\n### Prediction\n\nThe [sequence_unet.predict](https://sequence-unet.readthedocs.io/en/latest/predict.html) module provides functions to predict from sequences or ProteinNet files.\nOther input data can easily be parsed into one of these formats or transformed in Python to fit the models input shapes (described in the [models.sequence_unet](https://sequence-unet.readthedocs.io/en/latest/models.html#sequence_unet.models.sequence_unet) function).\n\n```python\nfrom pandas import concat\nfrom Bio import SeqIO\nfrom proteinnetpy.data import ProteinNetDataset\n\nfrom sequence_unet.models import load_trained_model\nfrom sequence_unet.predict import predict_sequence, predict_proteinnet\n\n# load a model\nmodel = load_trained_model(model=\"freq_classifier\", download=True)\n\n# Predict from a fasta file\nfasta = SeqIO.parse(\"path/to/fasta.fa\", format=\"fasta\")\npreds = concat([p for p in predict_sequence(model, sequences=fasta, wide=True)])\n\n# Predict from a ProteinNet file\ndata = ProteinNetDataset(path=\"path/to/proteinnet\", preload=False)\npreds = concat([p for p in predict_proteinnet(model, data, wide=True)])\n```\n\nThe package also binds the `sequence_unet` command, which allows users to make predictions from the command line.\nIt supports predicting scores for proteins in a fasta or [ProteinNet](https://github.com/aqlaboratory/proteinnet/) file.\nA full description is available in the [documentation](https://sequence-unet.readthedocs.io/en/latest/scripts.html) and can be accessed using `-h`.\n\n### Training\n\nThe `sequence_unet` package does not include specific code for training new models, as the best way to do this will be specific to the users needs.\nHowever, the loaded models are [TensorFlow Keras models](https://www.tensorflow.org/api_docs/python/tf/keras/Model) and inherently support straightforward and extensible training.\nI used the `training.py` script to train the models, based on a saved model and data loading function saved by the `make_experiment_dir` function in `src/utils.py`.\nUsage of the script is documented in it's docstring and can be accessed using `-h`.\nThe training scripts in subdirectories of `models` give examples of the model training procedures I used to train the various forms of the model.\nThis setup can be adapted for other users needs reasonably easily.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Make protein predictions with Sequence UNET and train new models",
    "version": "1.0.3",
    "project_urls": {
        "Documentation": "https://sequence-unet.readthedocs.io/en/latest/",
        "Publication": "https://doi.org/10.1186/s13059-023-02948-3",
        "Repository": "https://github.com/allydunham/sequence_unet"
    },
    "split_keywords": [
        "deep learning",
        "neural network",
        "protein",
        "bioinformatics",
        "variant effect prediction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d82b841f3c9eef740fe33f0297acef56011215514cbb8cf30bf0c66a05008fe9",
                "md5": "eb26d23fa08afc6b7d2a6a838ca7bcfe",
                "sha256": "68d6cd1fe0aefffff533f44f9a39765b8dd7f434cfa2742c1aee4e861573d733"
            },
            "downloads": -1,
            "filename": "sequence_unet-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "eb26d23fa08afc6b7d2a6a838ca7bcfe",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 25578,
            "upload_time": "2023-07-28T14:43:14",
            "upload_time_iso_8601": "2023-07-28T14:43:14.504043Z",
            "url": "https://files.pythonhosted.org/packages/d8/2b/841f3c9eef740fe33f0297acef56011215514cbb8cf30bf0c66a05008fe9/sequence_unet-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1da0c13aa301226742a3b69a83a1a89d42980977e344cfb5284d3437a2560ae5",
                "md5": "e70f0193d4dc93f38bbe3a2db1bddd08",
                "sha256": "47fe1b875384e55cdce7bf74291e780e06141d2edf09d3e0b928b1220afd429d"
            },
            "downloads": -1,
            "filename": "sequence_unet-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "e70f0193d4dc93f38bbe3a2db1bddd08",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 24290,
            "upload_time": "2023-07-28T14:43:15",
            "upload_time_iso_8601": "2023-07-28T14:43:15.791242Z",
            "url": "https://files.pythonhosted.org/packages/1d/a0/c13aa301226742a3b69a83a1a89d42980977e344cfb5284d3437a2560ae5/sequence_unet-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-28 14:43:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "allydunham",
    "github_project": "sequence_unet",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sequence-unet"
}
        
Elapsed time: 0.10322s