protenc


Nameprotenc JSON
Version 0.1.6 PyPI version JSON
download
home_pagehttps://github.com/kklemon/ProtEnc
SummaryExtract protein embeddings from protein language models.
upload_time2023-10-06 08:35:57
maintainer
docs_urlNone
authorKristian Klemon
requires_python>=3.10,<3.13
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ProtEnc: generate protein embeddings the easy way
=======

[ProtEnc](https://github.com/kklemon/ProtEnc) aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:

* [ProtTrans](https://github.com/agemagician/ProtTrans) family
* [ESM](https://github.com/facebookresearch/esm)
* AlphaFold (coming soon™)
* [OmegaPLM](https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1) (coming soon™)

Usage
-----

### Installation

```bash
pip install protenc
```

### Python API

```python
import protenc

# List available models
print(protenc.list_models())

# Load encoder model
encoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')

proteins = [
  'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
  'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]

for embed in encoder(proteins, return_format='numpy'):
  # Embeddings have shape [L, D] where L is the sequence length and D the  embedding dimensionality.
  print(embed.shape)
  
  # Derive a single per-protein embedding vector by averaging along the sequence dimension
  embed.mean(0)
```

### Command-line interface

After installation, use the `protenc` shell command for bulk generation and export of protein embeddings.

```bash
protenc sequences.fasta embeddings.lmdb --model_name=<name-of-model>
```

By default, input and output formats are inferred from the file extensions.

Run

```bash
protenc --help
```

for a detailed usage description.

**Example**

Generate protein embeddings using the ESM2 650M model for sequences provided in a [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file and write embeddings to an [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database):

```bash
protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D
```

The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the `read_from_lmdb` utility function:

```python
from protenc.utils import read_from_lmdb

for label, embed in read_from_lmdb('embeddings.lmdb'):
    print(label, embed)
```

**Features**

Input formats:
* CSV
* JSON
* [FASTA](https://en.wikipedia.org/wiki/FASTA_format)

Output format:
* [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database)
* [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) (coming soon)

General:
* Multi-GPU inference with (`--data_parallel`)
* FP16 inference (`--amp`)

Development
-----------

Clone the repository:

```bash
git clone git+https://github.com/kklemon/protenc.git
```

Install dependencies via [Poetry](https://python-poetry.org/):

```bash
poetry install
```

Contribution
------------

Have feature ideas or found a bug? Love to see support for a new model? Feel free to [create an issue](https://github.com/kklemon/ProtEnc/issues/new).

Todo
----

- [ ] Support for more input formats
  - [X] CSV
  - [ ] Parquet
  - [X] FASTA
  - [X] JSON
- [ ] Support for more output formats
  - [X] LMDB
  - [ ] HDF5
  - [ ] DataFrame
  - [ ] Pickle
- [ ] Support for large models
  - [ ] Model offloading
  - [ ] Sharding
  - [ ] FlashAttention (via Kernl?)
- [ ] Support for more protein language models
  - [X] Whole ProtTrans family
  - [X] Whole ESM family
  - [ ] AlphaFold (?)
- [X] Implement all remaining TODOs in code
- [ ] Evaluation
- [ ] Demos
- [ ] Distributed inference
- [ ] Maybe support some sort of optimized inference such as quantization
  - This may be up to the model providers
- [ ] Improve documentation
- [ ] Support translation of gene sequences
- [ ] Add tests. We need tests!!!

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kklemon/ProtEnc",
    "name": "protenc",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<3.13",
    "maintainer_email": "",
    "keywords": "",
    "author": "Kristian Klemon",
    "author_email": "kristian.klemon@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d3/34/d73edcf3a7965023b85919fda6230c74d5d608e0ea65bf4ff703d46295f1/protenc-0.1.6.tar.gz",
    "platform": null,
    "description": "ProtEnc: generate protein embeddings the easy way\n=======\n\n[ProtEnc](https://github.com/kklemon/ProtEnc) aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:\n\n* [ProtTrans](https://github.com/agemagician/ProtTrans) family\n* [ESM](https://github.com/facebookresearch/esm)\n* AlphaFold (coming soon\u2122)\n* [OmegaPLM](https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1) (coming soon\u2122)\n\nUsage\n-----\n\n### Installation\n\n```bash\npip install protenc\n```\n\n### Python API\n\n```python\nimport protenc\n\n# List available models\nprint(protenc.list_models())\n\n# Load encoder model\nencoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')\n\nproteins = [\n  'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',\n  'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'\n]\n\nfor embed in encoder(proteins, return_format='numpy'):\n  # Embeddings have shape [L, D] where L is the sequence length and D the  embedding dimensionality.\n  print(embed.shape)\n  \n  # Derive a single per-protein embedding vector by averaging along the sequence dimension\n  embed.mean(0)\n```\n\n### Command-line interface\n\nAfter installation, use the `protenc` shell command for bulk generation and export of protein embeddings.\n\n```bash\nprotenc sequences.fasta embeddings.lmdb --model_name=<name-of-model>\n```\n\nBy default, input and output formats are inferred from the file extensions.\n\nRun\n\n```bash\nprotenc --help\n```\n\nfor a detailed usage description.\n\n**Example**\n\nGenerate protein embeddings using the ESM2 650M model for sequences provided in a [FASTA](https://en.wikipedia.org/wiki/FASTA_format) file and write embeddings to an [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database):\n\n```bash\nprotenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D\n```\n\nThe generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the `read_from_lmdb` utility function:\n\n```python\nfrom protenc.utils import read_from_lmdb\n\nfor label, embed in read_from_lmdb('embeddings.lmdb'):\n    print(label, embed)\n```\n\n**Features**\n\nInput formats:\n* CSV\n* JSON\n* [FASTA](https://en.wikipedia.org/wiki/FASTA_format)\n\nOutput format:\n* [LMDB](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database)\n* [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) (coming soon)\n\nGeneral:\n* Multi-GPU inference with (`--data_parallel`)\n* FP16 inference (`--amp`)\n\nDevelopment\n-----------\n\nClone the repository:\n\n```bash\ngit clone git+https://github.com/kklemon/protenc.git\n```\n\nInstall dependencies via [Poetry](https://python-poetry.org/):\n\n```bash\npoetry install\n```\n\nContribution\n------------\n\nHave feature ideas or found a bug? Love to see support for a new model? Feel free to [create an issue](https://github.com/kklemon/ProtEnc/issues/new).\n\nTodo\n----\n\n- [ ] Support for more input formats\n  - [X] CSV\n  - [ ] Parquet\n  - [X] FASTA\n  - [X] JSON\n- [ ] Support for more output formats\n  - [X] LMDB\n  - [ ] HDF5\n  - [ ] DataFrame\n  - [ ] Pickle\n- [ ] Support for large models\n  - [ ] Model offloading\n  - [ ] Sharding\n  - [ ] FlashAttention (via Kernl?)\n- [ ] Support for more protein language models\n  - [X] Whole ProtTrans family\n  - [X] Whole ESM family\n  - [ ] AlphaFold (?)\n- [X] Implement all remaining TODOs in code\n- [ ] Evaluation\n- [ ] Demos\n- [ ] Distributed inference\n- [ ] Maybe support some sort of optimized inference such as quantization\n  - This may be up to the model providers\n- [ ] Improve documentation\n- [ ] Support translation of gene sequences\n- [ ] Add tests. We need tests!!!\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Extract protein embeddings from protein language models.",
    "version": "0.1.6",
    "project_urls": {
        "Homepage": "https://github.com/kklemon/ProtEnc",
        "Repository": "https://github.com/kklemon/ProtEnc"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d5f8b9601dd9d40fa7c0cfe1648abd4bf18f16d42a0ea9daf7e908c81b610dd5",
                "md5": "69498c2146f059d2d277948f9f9c7ab8",
                "sha256": "f3271b4c279bef15d315cf7ae00cf950ace17f662ecbb5dbffa6a5a8817d3097"
            },
            "downloads": -1,
            "filename": "protenc-0.1.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "69498c2146f059d2d277948f9f9c7ab8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<3.13",
            "size": 13510,
            "upload_time": "2023-10-06T08:35:56",
            "upload_time_iso_8601": "2023-10-06T08:35:56.163117Z",
            "url": "https://files.pythonhosted.org/packages/d5/f8/b9601dd9d40fa7c0cfe1648abd4bf18f16d42a0ea9daf7e908c81b610dd5/protenc-0.1.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d334d73edcf3a7965023b85919fda6230c74d5d608e0ea65bf4ff703d46295f1",
                "md5": "e3fefcd799ee203395551b7bf9282fe4",
                "sha256": "b5c9304c9a664abcff9c6540166e83a7ce32d6eb9b3539c8f6ade9fbda080a1f"
            },
            "downloads": -1,
            "filename": "protenc-0.1.6.tar.gz",
            "has_sig": false,
            "md5_digest": "e3fefcd799ee203395551b7bf9282fe4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<3.13",
            "size": 12714,
            "upload_time": "2023-10-06T08:35:57",
            "upload_time_iso_8601": "2023-10-06T08:35:57.425820Z",
            "url": "https://files.pythonhosted.org/packages/d3/34/d73edcf3a7965023b85919fda6230c74d5d608e0ea65bf4ff703d46295f1/protenc-0.1.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-06 08:35:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kklemon",
    "github_project": "ProtEnc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "protenc"
}
        
Elapsed time: 0.59893s