proteinnetpy


Nameproteinnetpy JSON
Version 1.0.1 PyPI version JSON
download
home_page
SummaryRead, process and write ProteinNet data
upload_time2023-07-28 14:24:02
maintainer
docs_urlNone
author
requires_python>=3
licenseCopyright 2020 EMBL - European Bioinformatics Institute Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
keywords protein bioinformatics proteinnet machine learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ProteinNetPy 1.0.1
<!-- badges: start -->
[![DOI](https://zenodo.org/badge/267846791.svg)](https://zenodo.org/badge/latestdoi/267846791)
[![Documentation Status](https://readthedocs.org/projects/proteinnetpy/badge/?version=latest)](https://proteinnetpy.readthedocs.io/en/latest/?badge=latest)
<!-- badges: end -->

A python library for working with [ProteinNet](https://github.com/aqlaboratory/proteinnet) text data, allowing you to easily load, stream and filter data, map functions across records and produce TensorFlow datasets.
For details of the dataset see the ProteinNet [Bioinformatics paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2932-0).
Documentation for all functions of the module is available [here](https://proteinnetpy.readthedocs.io/en/latest/).

## Install

`pip install proteinnetpy`

Or install the development version from Github:

`pip install git+https://github.com:allydunham/proteinnetpy`

## Requirements

* Python 3
* Numpy
* Biopython
* TensorFlow (if using the `datasets` module)

## Basic Usage

The main object used in ProteinNetPy is the ProteinNetRecord, which allows access to the various record fields and methods for common manipulations, such as calculating a one-hot sequence representation or residue distance matrix.
It also supports most applicable operations like `len`, `str` etc.
While the `parser` module contains a generator to parse files, it is generally easier to use the `ProteinNetDataset` class from the data module:

```python
from proteinnetpy.data import ProteinNetDataset
data = ProteinNetDataset(path="path/to/proteinnet")
```

This class includes a `preload` argument, which determines if the dataset is loaded into memory or streamed.
It also supports filtering using the `filter_func` argument, which is passed a function that returns truthy values for a record to determine if it is kept in the dataset.
A range of common filters are included in the data module, as well as `combine_filters()`, which can applies all passed filters to each record.

Once a dataset has been loaded it can be iterated over to process data.
The `ProteinNetMap` class creates map objects that map a function over the dataset, including options to stream the map on each iteration or pre-calculate results.
They have a `generate` method that creates a generator object yielding the output of the function.
The `LabeledFunction` class is provided to create functions annotated with output types and shapes, used for automatically creating TensorFlow datasets.
The `mutation` module provides some example functions that return mutated records.

The following example code shows a typical simple usage, creating a streamed TensorFlow dataset from ProteinNet data:

```python
from proteinnetpy import data
from proteinnetpy import tfdataset

class MapFunction(data.LabeledFunction):
    """
    Example ProteinNetMap function outputting a one-hot sequence and contact graph input data
    and multiple alignment PSSM labels
    """
    def __init__(self):
        self.output_shapes = (([None, 20], [None, None]), [None, 20])
        self.output_types = (('float32', 'float32'), 'int32')

    def __call__(self, record):
        return (record.get_one_hot_sequence().T, record.distance_matrix()), record.evolutionary.T

filter_func = data.make_length_filter(min_length=32, max_length=2000)
data = data.ProteinNetDataset(path="path/to/proteinnet", preload=False)
pn_map = data.ProteinNetMap(data, map=MapFunction(), static=False, filter_errors=True)

tf_dataset = tfdataset.proteinnet_tf_dataset(pn_map, batch_size=100, prefetch=400, shuffle_buffer=200)
```

Many more functions, arguments and uses are available, with detailed descriptions currently found in docstrings.
Full documentation will be generated from these for a future release.

## Scripts

The package also provides convenience scripts for processing ProteinNet datasets:

* add_angles_to_proteinnet - Add extra fields to a ProteinNet file with φ, ψ and χ backbone/torsion angles
* proteinnet_to_fasta - Extract a fasta file with the sequences from a ProteinNet file
* filter_proteinnet - Filter a ProteinNet file to include/exclude records from a list of IDs

Detailed usage instructions for each can be found using the `-h` argument.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "proteinnetpy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3",
    "maintainer_email": "Alistair Dunham <ad44@sanger.ac.uk>",
    "keywords": "protein,bioinformatics,proteinnet,machine learning",
    "author": "",
    "author_email": "Alistair Dunham <ad44@sanger.ac.uk>",
    "download_url": "https://files.pythonhosted.org/packages/c7/16/ab8ee1e8bb5954629c11ce035da23416e1a1e95c64502693e555486f6058/proteinnetpy-1.0.1.tar.gz",
    "platform": null,
    "description": "# ProteinNetPy 1.0.1\n<!-- badges: start -->\n[![DOI](https://zenodo.org/badge/267846791.svg)](https://zenodo.org/badge/latestdoi/267846791)\n[![Documentation Status](https://readthedocs.org/projects/proteinnetpy/badge/?version=latest)](https://proteinnetpy.readthedocs.io/en/latest/?badge=latest)\n<!-- badges: end -->\n\nA python library for working with [ProteinNet](https://github.com/aqlaboratory/proteinnet) text data, allowing you to easily load, stream and filter data, map functions across records and produce TensorFlow datasets.\nFor details of the dataset see the ProteinNet [Bioinformatics paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2932-0).\nDocumentation for all functions of the module is available [here](https://proteinnetpy.readthedocs.io/en/latest/).\n\n## Install\n\n`pip install proteinnetpy`\n\nOr install the development version from Github:\n\n`pip install git+https://github.com:allydunham/proteinnetpy`\n\n## Requirements\n\n* Python 3\n* Numpy\n* Biopython\n* TensorFlow (if using the `datasets` module)\n\n## Basic Usage\n\nThe main object used in ProteinNetPy is the ProteinNetRecord, which allows access to the various record fields and methods for common manipulations, such as calculating a one-hot sequence representation or residue distance matrix.\nIt also supports most applicable operations like `len`, `str` etc.\nWhile the `parser` module contains a generator to parse files, it is generally easier to use the `ProteinNetDataset` class from the data module:\n\n```python\nfrom proteinnetpy.data import ProteinNetDataset\ndata = ProteinNetDataset(path=\"path/to/proteinnet\")\n```\n\nThis class includes a `preload` argument, which determines if the dataset is loaded into memory or streamed.\nIt also supports filtering using the `filter_func` argument, which is passed a function that returns truthy values for a record to determine if it is kept in the dataset.\nA range of common filters are included in the data module, as well as `combine_filters()`, which can applies all passed filters to each record.\n\nOnce a dataset has been loaded it can be iterated over to process data.\nThe `ProteinNetMap` class creates map objects that map a function over the dataset, including options to stream the map on each iteration or pre-calculate results.\nThey have a `generate` method that creates a generator object yielding the output of the function.\nThe `LabeledFunction` class is provided to create functions annotated with output types and shapes, used for automatically creating TensorFlow datasets.\nThe `mutation` module provides some example functions that return mutated records.\n\nThe following example code shows a typical simple usage, creating a streamed TensorFlow dataset from ProteinNet data:\n\n```python\nfrom proteinnetpy import data\nfrom proteinnetpy import tfdataset\n\nclass MapFunction(data.LabeledFunction):\n    \"\"\"\n    Example ProteinNetMap function outputting a one-hot sequence and contact graph input data\n    and multiple alignment PSSM labels\n    \"\"\"\n    def __init__(self):\n        self.output_shapes = (([None, 20], [None, None]), [None, 20])\n        self.output_types = (('float32', 'float32'), 'int32')\n\n    def __call__(self, record):\n        return (record.get_one_hot_sequence().T, record.distance_matrix()), record.evolutionary.T\n\nfilter_func = data.make_length_filter(min_length=32, max_length=2000)\ndata = data.ProteinNetDataset(path=\"path/to/proteinnet\", preload=False)\npn_map = data.ProteinNetMap(data, map=MapFunction(), static=False, filter_errors=True)\n\ntf_dataset = tfdataset.proteinnet_tf_dataset(pn_map, batch_size=100, prefetch=400, shuffle_buffer=200)\n```\n\nMany more functions, arguments and uses are available, with detailed descriptions currently found in docstrings.\nFull documentation will be generated from these for a future release.\n\n## Scripts\n\nThe package also provides convenience scripts for processing ProteinNet datasets:\n\n* add_angles_to_proteinnet - Add extra fields to a ProteinNet file with \u03c6, \u03c8 and \u03c7 backbone/torsion angles\n* proteinnet_to_fasta - Extract a fasta file with the sequences from a ProteinNet file\n* filter_proteinnet - Filter a ProteinNet file to include/exclude records from a list of IDs\n\nDetailed usage instructions for each can be found using the `-h` argument.\n",
    "bugtrack_url": null,
    "license": "Copyright 2020 EMBL - European Bioinformatics Institute  Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at  http://www.apache.org/licenses/LICENSE-2.0  Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.  ",
    "summary": "Read, process and write ProteinNet data",
    "version": "1.0.1",
    "project_urls": {
        "Documentation": "https://proteinnetpy.readthedocs.io/en/latest/",
        "Publication": "https://doi.org/10.1186/s13059-023-02948-3",
        "Repository": "https://github.com/allydunham/proteinnetpy"
    },
    "split_keywords": [
        "protein",
        "bioinformatics",
        "proteinnet",
        "machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "707499a1c7ac64de66e856b35540659d849f04654cbd7b5b145961069ce428f5",
                "md5": "3a523aa0df906dac4f195441b93b588b",
                "sha256": "161de26b65e2f1ff5c836a1c11705cf4272260752a9be08ec45eb01d45b3fc2f"
            },
            "downloads": -1,
            "filename": "proteinnetpy-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3a523aa0df906dac4f195441b93b588b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3",
            "size": 51176,
            "upload_time": "2023-07-28T14:24:00",
            "upload_time_iso_8601": "2023-07-28T14:24:00.395645Z",
            "url": "https://files.pythonhosted.org/packages/70/74/99a1c7ac64de66e856b35540659d849f04654cbd7b5b145961069ce428f5/proteinnetpy-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c716ab8ee1e8bb5954629c11ce035da23416e1a1e95c64502693e555486f6058",
                "md5": "ca2265331a926f8ee8c84107384a8114",
                "sha256": "d4650759a019e859c55119dd5c98b18df499e389f54cec49087221660d12ab4e"
            },
            "downloads": -1,
            "filename": "proteinnetpy-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "ca2265331a926f8ee8c84107384a8114",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3",
            "size": 25005,
            "upload_time": "2023-07-28T14:24:02",
            "upload_time_iso_8601": "2023-07-28T14:24:02.584172Z",
            "url": "https://files.pythonhosted.org/packages/c7/16/ab8ee1e8bb5954629c11ce035da23416e1a1e95c64502693e555486f6058/proteinnetpy-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-28 14:24:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "allydunham",
    "github_project": "proteinnetpy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "proteinnetpy"
}
        
Elapsed time: 0.09194s