SmilesPE


NameSmilesPE JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/XinhaoLi74/SmilesPE
SummaryTokenize SMILES with substructure units
upload_time2020-04-07 22:12:03
maintainer
docs_urlNone
authorXinhao Li
requires_python>=3.6
licenseApache Software License 2.0
keywords cheminformatics smiles
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SMILES Pair Encoding (SmilesPE).
> SMILES Pair Encoding (SmilesPE) trains a substructure tokenizer from a large set of SMILES strings (e.g., ChEMBL) based on [byte-pair-encoding (BPE)](https://www.aclweb.org/anthology/P16-1162/).


## Overview

## Installation

```
pip install SmilesPE
```

## Usage Instructions

### Basic Tokenizers

1. Atom-level Tokenizer

```python
from SmilesPE.pretokenizer import atomwise_tokenizer

smi = 'CC[N+](C)(C)Cc1ccccc1Br'
toks = atomwise_tokenizer(smi)
print(toks)
```

    ['C', 'C', '[N+]', '(', 'C', ')', '(', 'C', ')', 'C', 'c', '1', 'c', 'c', 'c', 'c', 'c', '1', 'Br']


2. K-mer Tokenzier

```python
from SmilesPE.pretokenizer import kmer_tokenizer

smi = 'CC[N+](C)(C)Cc1ccccc1Br'
toks = kmer_tokenizer(smi, ngram=4)
print(toks)
```

    ['CC[N+](', 'C[N+](C', '[N+](C)', '(C)(', 'C)(C', ')(C)', '(C)C', 'C)Cc', ')Cc1', 'Cc1c', 'c1cc', '1ccc', 'cccc', 'cccc', 'ccc1', 'cc1Br']


The basic tokenizers are also compatible with [SELFIES](https://github.com/aspuru-guzik-group/selfies) and [DeepSMILES](https://github.com/baoilleach/deepsmiles). Package installations are required.

Example of SELFIES

```python
import selfies
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
sel = selfies.encoder(smi)
print(f'SELFIES string: {sel}')
> >> SELFIES string: [C][C][N+][Branch1_2][epsilon][C][Branch1_3][epsilon][C][C][c][c][c][c][c][c][Ring1][Branch1_1][Br]    
toks = atomwise_tokenizer(sel)
print(toks)
> >> ['[C]', '[C]', '[N+]', '[Branch1_2]', '[epsilon]', '[C]', '[Branch1_3]', '[epsilon]', '[C]', '[C]', '[c]', '[c]', '[c]', '[c]', '[c]', '[c]', '[Ring1]', '[Branch1_1]', '[Br]']

toks = kmer_tokenizer(sel, ngram=4)
print(toks)

>>> ['[C][C][N+][Branch1_2]', '[C][N+][Branch1_2][epsilon]', '[N+][Branch1_2][epsilon][C]', '[Branch1_2][epsilon][C][Branch1_3]', '[epsilon][C][Branch1_3][epsilon]', '[C][Branch1_3][epsilon][C]', '[Branch1_3][epsilon][C][C]', '[epsilon][C][C][c]', '[C][C][c][c]', '[C][c][c][c]', '[c][c][c][c]', '[c][c][c][c]', '[c][c][c][c]', '[c][c][c][Ring1]', '[c][c][Ring1][Branch1_1]', '[c][Ring1][Branch1_1][Br]']
```

Example of DeepSMILES

```python
import deepsmiles
converter = deepsmiles.Converter(rings=True, branches=True)
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
deepsmi = converter.encode(smi)
print(f'DeepSMILES string: {deepsmi}')> >> DeepSMILES string: CC[N+]C)C)Ccccccc6Br
toks = atomwise_tokenizer(deepsmi)
print(toks)

>>> ['C', 'C', '[N+]', 'C', ')', 'C', ')', 'C', 'c', 'c', 'c', 'c', 'c', 'c', '6', 'Br']

toks = kmer_tokenizer(deepsmi, ngram=4)
print(toks)

>>> ['CC[N+]C', 'C[N+]C)', '[N+]C)C', 'C)C)', ')C)C', 'C)Cc', ')Ccc', 'Cccc', 'cccc', 'cccc', 'cccc', 'ccc6', 'cc6Br']
```

### Use the Pre-trained SmilesPE Tokenizer

Dowbload ['SPE_ChEMBL.txt'](https://github.com/XinhaoLi74/SmilesPE/blob/master/SPE_ChEMBL.txt).

```python

import codecs
from SmilesPE.tokenizer import *

spe_vob= codecs.open('../SPE_ChEMBL.txt')
spe = SPE_Tokenizer(spe_vob)

smi = 'CC[N+](C)(C)Cc1ccccc1Br'
spe.tokenize(smi)

>>> 'CC [N+](C) (C)C c1ccccc1 Br'
```

### Train a SmilesPE Tokenizer with a Custom Dataset

See [train_SPE.ipynb](https://github.com/XinhaoLi74/SmilesPE/blob/master/Examples/train_SPE.ipynb) for an example of training A SPE tokenizer on ChEMBL data.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/XinhaoLi74/SmilesPE",
    "name": "SmilesPE",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "Cheminformatics SMILES",
    "author": "Xinhao Li",
    "author_email": "xli74@ncsu.edu",
    "download_url": "https://files.pythonhosted.org/packages/5e/5c/a638fd96cdf4499eaed76d5dbcec734d98c4ddaf2a8f9e13e44e5151fa29/SmilesPE-0.0.3.tar.gz",
    "platform": "",
    "description": "# SMILES Pair Encoding (SmilesPE).\n> SMILES Pair Encoding (SmilesPE) trains a substructure tokenizer from a large set of SMILES strings (e.g., ChEMBL) based on [byte-pair-encoding (BPE)](https://www.aclweb.org/anthology/P16-1162/).\n\n\n## Overview\n\n## Installation\n\n```\npip install SmilesPE\n```\n\n## Usage Instructions\n\n### Basic Tokenizers\n\n1. Atom-level Tokenizer\n\n```python\nfrom SmilesPE.pretokenizer import atomwise_tokenizer\n\nsmi = 'CC[N+](C)(C)Cc1ccccc1Br'\ntoks = atomwise_tokenizer(smi)\nprint(toks)\n```\n\n    ['C', 'C', '[N+]', '(', 'C', ')', '(', 'C', ')', 'C', 'c', '1', 'c', 'c', 'c', 'c', 'c', '1', 'Br']\n\n\n2. K-mer Tokenzier\n\n```python\nfrom SmilesPE.pretokenizer import kmer_tokenizer\n\nsmi = 'CC[N+](C)(C)Cc1ccccc1Br'\ntoks = kmer_tokenizer(smi, ngram=4)\nprint(toks)\n```\n\n    ['CC[N+](', 'C[N+](C', '[N+](C)', '(C)(', 'C)(C', ')(C)', '(C)C', 'C)Cc', ')Cc1', 'Cc1c', 'c1cc', '1ccc', 'cccc', 'cccc', 'ccc1', 'cc1Br']\n\n\nThe basic tokenizers are also compatible with [SELFIES](https://github.com/aspuru-guzik-group/selfies) and [DeepSMILES](https://github.com/baoilleach/deepsmiles). Package installations are required.\n\nExample of SELFIES\n\n```python\nimport selfies\nsmi = 'CC[N+](C)(C)Cc1ccccc1Br'\nsel = selfies.encoder(smi)\nprint(f'SELFIES string: {sel}')\n> >> SELFIES string: [C][C][N+][Branch1_2][epsilon][C][Branch1_3][epsilon][C][C][c][c][c][c][c][c][Ring1][Branch1_1][Br]    \ntoks = atomwise_tokenizer(sel)\nprint(toks)\n> >> ['[C]', '[C]', '[N+]', '[Branch1_2]', '[epsilon]', '[C]', '[Branch1_3]', '[epsilon]', '[C]', '[C]', '[c]', '[c]', '[c]', '[c]', '[c]', '[c]', '[Ring1]', '[Branch1_1]', '[Br]']\n\ntoks = kmer_tokenizer(sel, ngram=4)\nprint(toks)\n\n>>> ['[C][C][N+][Branch1_2]', '[C][N+][Branch1_2][epsilon]', '[N+][Branch1_2][epsilon][C]', '[Branch1_2][epsilon][C][Branch1_3]', '[epsilon][C][Branch1_3][epsilon]', '[C][Branch1_3][epsilon][C]', '[Branch1_3][epsilon][C][C]', '[epsilon][C][C][c]', '[C][C][c][c]', '[C][c][c][c]', '[c][c][c][c]', '[c][c][c][c]', '[c][c][c][c]', '[c][c][c][Ring1]', '[c][c][Ring1][Branch1_1]', '[c][Ring1][Branch1_1][Br]']\n```\n\nExample of DeepSMILES\n\n```python\nimport deepsmiles\nconverter = deepsmiles.Converter(rings=True, branches=True)\nsmi = 'CC[N+](C)(C)Cc1ccccc1Br'\ndeepsmi = converter.encode(smi)\nprint(f'DeepSMILES string: {deepsmi}')> >> DeepSMILES string: CC[N+]C)C)Ccccccc6Br\ntoks = atomwise_tokenizer(deepsmi)\nprint(toks)\n\n>>> ['C', 'C', '[N+]', 'C', ')', 'C', ')', 'C', 'c', 'c', 'c', 'c', 'c', 'c', '6', 'Br']\n\ntoks = kmer_tokenizer(deepsmi, ngram=4)\nprint(toks)\n\n>>> ['CC[N+]C', 'C[N+]C)', '[N+]C)C', 'C)C)', ')C)C', 'C)Cc', ')Ccc', 'Cccc', 'cccc', 'cccc', 'cccc', 'ccc6', 'cc6Br']\n```\n\n### Use the Pre-trained SmilesPE Tokenizer\n\nDowbload ['SPE_ChEMBL.txt'](https://github.com/XinhaoLi74/SmilesPE/blob/master/SPE_ChEMBL.txt).\n\n```python\n\nimport codecs\nfrom SmilesPE.tokenizer import *\n\nspe_vob= codecs.open('../SPE_ChEMBL.txt')\nspe = SPE_Tokenizer(spe_vob)\n\nsmi = 'CC[N+](C)(C)Cc1ccccc1Br'\nspe.tokenize(smi)\n\n>>> 'CC [N+](C) (C)C c1ccccc1 Br'\n```\n\n### Train a SmilesPE Tokenizer with a Custom Dataset\n\nSee [train_SPE.ipynb](https://github.com/XinhaoLi74/SmilesPE/blob/master/Examples/train_SPE.ipynb) for an example of training A SPE tokenizer on ChEMBL data.\n\n\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Tokenize SMILES with substructure units",
    "version": "0.0.3",
    "split_keywords": [
        "cheminformatics",
        "smiles"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6df9273f54d9d4b42779926291c82a5b3ffea30cff2492ebbe4ce08dccdcc949",
                "md5": "d9faf4f4f324a7018a099d8f9a933d6c",
                "sha256": "9f74279daa14945859546fb2de11c208b5116927ce5fe03b3cf46bcba96f5e58"
            },
            "downloads": -1,
            "filename": "SmilesPE-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d9faf4f4f324a7018a099d8f9a933d6c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 15723,
            "upload_time": "2020-04-07T22:12:02",
            "upload_time_iso_8601": "2020-04-07T22:12:02.522544Z",
            "url": "https://files.pythonhosted.org/packages/6d/f9/273f54d9d4b42779926291c82a5b3ffea30cff2492ebbe4ce08dccdcc949/SmilesPE-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5e5ca638fd96cdf4499eaed76d5dbcec734d98c4ddaf2a8f9e13e44e5151fa29",
                "md5": "ac151f898f038aab0f6becc2f620e78d",
                "sha256": "7ceebc7d314e456a08f77d45f08fe4b638886901c0eac50f0cdb005b9f0912bc"
            },
            "downloads": -1,
            "filename": "SmilesPE-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "ac151f898f038aab0f6becc2f620e78d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 15625,
            "upload_time": "2020-04-07T22:12:03",
            "upload_time_iso_8601": "2020-04-07T22:12:03.569070Z",
            "url": "https://files.pythonhosted.org/packages/5e/5c/a638fd96cdf4499eaed76d5dbcec734d98c4ddaf2a8f9e13e44e5151fa29/SmilesPE-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2020-04-07 22:12:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "XinhaoLi74",
    "github_project": "SmilesPE",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "smilespe"
}
        
Elapsed time: 0.14349s