Name | mhfp JSON |
Version |
1.9.2
JSON |
| download |
home_page | https://github.com/reymond-group/mhfp |
Summary | Molecular MHFP fingerprints for cheminformatics applications |
upload_time | 2019-08-28 13:12:33 |
maintainer | |
docs_url | None |
author | Daniel Probst |
requires_python | |
license | |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# MHFP
MHFP6 (MinHash fingerprint, up to six bonds) is a molecular fingerprint which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate.

**Associated Publication**:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0321-8
## Materials
The performance of MHFP has been avaluated using [Open-source platform to benchmark fingerprints for ligand-based virtual screening](https://github.com/rdkit/benchmarking_platform) and [ChEMBL24](https://www.ebi.ac.uk/chembl/downloads). Python (pyspark) scripts for the ChEMBL24 benchmark can be found in this repository.
## Dependencies
MHFP requires Python 3.x and Numpy.
MHFP requires the cheminformatics library RDKit.
## Installation
MHFP can be installed using the Python packet manager pip:
```bash
pip install mhfp
```
## Usage
Usage is straightforward:
```Python
from mhfp.encoder import MHFPEncoder
mhfp_encoder = MHFPEncoder()
fp_a = mhfp_encoder.encode('CCOC1=C(C=C(C=C1)S(=O)(=O)N(C)C)C2=NC(=O)C3=C(N2)C(=NN3C)C(C)(C)C')
fp_b = mhfp_encoder.encode('CCCC1=NN(C2=C1NC(=NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C')
fp_c = mhfp_encoder.encode('O=C(OC)C(C1CCCCN1)C2=CC=CC=C2')
print(MHFPEncoder.distance(fp_a, fp_a))
print(MHFPEncoder.distance(fp_a, fp_b))
print(MHFPEncoder.distance(fp_a, fp_c))
print(MHFPEncoder.distance(fp_b, fp_c))
#> 0.0
#> 0.45849609375
#> 0.97998046875
#> 0.97216796875
```
### LSH Forest
As the fingerprints are created using the locality sensitive hashing scheme MinHash, they are ready to be used with LSH-based algorithms:
```Python
from mhfp.encoder import MHFPEncoder
from mhfp.lsh_forest import LSHForestHelper
mhfp_encoder = MHFPEncoder()
fp_a = mhfp_encoder.encode('CCOC1=C(C=C(C=C1)S(=O)(=O)N(C)C)C2=NC(=O)C3=C(N2)C(=NN3C)C(C)(C)C')
fp_b = mhfp_encoder.encode('CCCC1=NN(C2=C1NC(=NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C')
fp_c = mhfp_encoder.encode('O=C(OC)C(C1CCCCN1)C2=CC=CC=C2')
fp_q = mhfp_encoder.encode('COC1=C(C=C(C=C1)S(=O)(O)N(C)C)C2=NC(=O)C3=C(N2)C(=NN3C)C(C)(C)C')
lsh_forest_helper = LSHForestHelper()
fps = [fp_a, fp_b, fp_c]
for i, fp in enumerate(fps):
lsh_forest_helper.add(i, fp)
lsh_forest_helper.index()
print(lsh_forest_helper.query(fp_q, 1, fps))
#> [0]
```
### SECFP
SECFP (SMILES Extended Connectifity Fingerprint) creation is achieved by calling the static method `MHFPEncoder.secfp_from_smiles()`:
```Python
from mhfp.encoder import MHFPEncoder
fp = MHFPEncoder.secfp_from_smiles('CCOC1=C(C=C(C=C1)S(=O)(=O)N(C)C)C2=NC(=O)C3=C(N2)C(=NN3C)C(C)(C)C')
```
## Documentation
### MHFPEncoder
The class for encoding SMILES and RDKit molecule instances as MHFP fingerprints.
```Python
__init__(n_permutations = 2048, seed = 42)
```
| Parameter | Default | Description |
|---|---|---|
| n_permutations | ```2048``` | Analogous to the number of bits ECFP fingerprints are folded into. Higher is better, lower is less exact. |
| seed | ```42``` | The seed for the MinHash operation. Has to be the same for comparable fingerprints. |
```Python
MHFPEncoder.encode(in_smiles, radius = 3, rings = True, kekulize = True, sanitize = True)
```
| Parameter | Default | Description |
|---|---|---|
| in_smiles | | The input SMILES. |
| radius | ```3``` | Analogous to the radius for the Morgan fingerprint. The default radius 3 encodes SMILES to MHFP6. |
| rings | ```True``` | Whether rings in the molecule are included in the fingerprint. As a radii of 3 fails to encode rings and there is no way to determine ring-membership in a substructure SMILES, this considerably increases performance. |
| kekulize | ```True``` | Whether or not to kekulize the molecule before extracting substructure SMILES. |
| sanitize | ```True``` | Whether or not to sanitize the molecule (using RDKit) before extracting substructure SMILES. |
**Returns** a `numpy.ndarray` containing the fingerprint hashes.
```Python
MHFPEncoder.encode_mol(in_mol, radius = 3, rings = True, kekulize = True)
```
| Parameter | Default | Description |
|---|---|---|
| in_mol | | The input RDKit molecule instance. |
| radius | ```3``` | Analogous to the radius for the Morgan fingerprint. The default radius 3 encodes SMILES to MHFP6. |
| rings | ```True``` | Whether rings in the molecule are included in the fingerprint. As a radii of 3 fails to encode rings and there is no way to determine ring-membership in a substructure SMILES, this considerably increases performance. |
| kekulize | ```True``` | Whether or not to kekulize the molecule before extracting substructure SMILES. |
**Returns** a `numpy.ndarray` containing the fingerprint hashes.
```Python
MHFPEncoder.distance(a, b)
```
| Parameter | Default | Description |
|---|---|---|
| a | | A `numpy.ndarray` containing fingerprint hashes. |
| b | | A `numpy.ndarray` containing fingerprint hashes. |
**Returns** a `float` representing the distance between two MHFP encoded molecules.
```Python
MHFPEncoder.secfp_from_mol(in_mol, length=2048, radius=3, rings=True, kekulize=True)
```
| Parameter | Default | Description |
|---|---|---|
| in_mol | | The input RDKit molecule instance. |
| length | ```2048``` | The length of the folded fingerprint. |
| radius | ```3``` | Analogous to the radius for the Morgan fingerprint. The default radius 3 encodes SMILES to MHFP6. |
| rings | ```True``` | Whether rings in the molecule are included in the fingerprint. As a radii of 3 fails to encode rings and there is no way to determine ring-membership in a substructure SMILES, this considerably increases performance. |
| kekulize | ```True``` | Whether or not to kekulize the molecule before extracting substructure SMILES. |
**Returns** a `numpy.ndarray` containing the fingerprint values.
```Python
MHFPEncoder.secfp_from_smiles(in_smiles, length=2048, radius=3, rings=True, kekulize=True, sanitize=False)
```
| Parameter | Default | Description |
|---|---|---|
| in_smiles | | The input SMILES. |
| length | ```2048``` | The length of the folded fingerprint. |
| radius | ```3``` | Analogous to the radius for the Morgan fingerprint. The default radius 3 encodes SMILES to MHFP6. |
| rings | ```True``` | Whether rings in the molecule are included in the fingerprint. As a radii of 3 fails to encode rings and there is no way to determine ring-membership in a substructure SMILES, this considerably increases performance. |
| kekulize | ```True``` | Whether or not to kekulize the molecule before extracting substructure SMILES. |
| sanitize | ```False``` | Whether or not to sanitize the SMILES when parsing it using RDKit. |
**Returns** a `numpy.ndarray` containing the fingerprint values.
### LSHForestHelper
If you know your way around Python, I suggest you use the ```LSHForest``` class directly. See code for details.
```Python
__init__(dims = 2048, n_prefix_trees = 64)
```
| Parameter | Default | Description |
|---|---|---|
| dims | ```2048``` | **Has to be equal to the number of permutations** `n_permutations` **set when initializing** `MHFPEncoder` |
| n_prefix_trees | ```64``` | The number of prefix trees. Lower number results in lower memory usage but less quality |
```Python
LSHForestHelper.add(key, mhfp):
```
| Parameter | Default | Description |
|---|---|---|
| key | | The key that will be retrieved by a query. It's recommended to use integers. |
| mhfp | | The `numpy.ndarray` containing the fingerprint hashes. |
```Python
LSHForestHelper.index():
```
**Has to be run after adding entities and before running queries!**
```Python
MHFPEncoder.query(query_mhfp, k, data, kc=10)
```
| Parameter | Default | Description |
|---|---|---|
| query_mhfp | | A `numpy.ndarray` containing fingerprint hashes. |
| k | | The number of nearest neighbors to be retrieved. |
| data | | Data mapping keys to fingerprints. See LSH Forest example for details. |
| kc | 10 | Initially k is multiplied by kc and k · kc (ANN) results are returned. The top k are then selected from these k · kc using linear scan. |
Raw data
{
"_id": null,
"home_page": "https://github.com/reymond-group/mhfp",
"name": "mhfp",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Daniel Probst",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/eb/b6/877cbe38d04c538486739fbb1b1a93093018d4702d51d616d21274c04491/mhfp-1.9.2.tar.gz",
"platform": "",
"description": "# MHFP\n\nMHFP6 (MinHash fingerprint, up to six bonds) is a molecular fingerprint which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. Furthermore, MHFP6 outperforms ECFP4 in approximate nearest neighbor searches by two orders of magnitude in terms of speed, while decreasing the error rate.\n\n\n\n**Associated Publication**: \nhttps://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0321-8\n\n## Materials\nThe performance of MHFP has been avaluated using [Open-source platform to benchmark fingerprints for ligand-based virtual screening](https://github.com/rdkit/benchmarking_platform) and [ChEMBL24](https://www.ebi.ac.uk/chembl/downloads). Python (pyspark) scripts for the ChEMBL24 benchmark can be found in this repository.\n\n## Dependencies\nMHFP requires Python 3.x and Numpy.\n\nMHFP requires the cheminformatics library RDKit.\n\n## Installation\nMHFP can be installed using the Python packet manager pip:\n```bash\npip install mhfp\n```\n\n## Usage\nUsage is straightforward:\n```Python\nfrom mhfp.encoder import MHFPEncoder\n\nmhfp_encoder = MHFPEncoder()\n\nfp_a = mhfp_encoder.encode('CCOC1=C(C=C(C=C1)S(=O)(=O)N(C)C)C2=NC(=O)C3=C(N2)C(=NN3C)C(C)(C)C')\nfp_b = mhfp_encoder.encode('CCCC1=NN(C2=C1NC(=NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C')\nfp_c = mhfp_encoder.encode('O=C(OC)C(C1CCCCN1)C2=CC=CC=C2')\n\nprint(MHFPEncoder.distance(fp_a, fp_a))\nprint(MHFPEncoder.distance(fp_a, fp_b))\nprint(MHFPEncoder.distance(fp_a, fp_c))\nprint(MHFPEncoder.distance(fp_b, fp_c))\n\n#> 0.0\n#> 0.45849609375\n#> 0.97998046875\n#> 0.97216796875\n```\n\n### LSH Forest\nAs the fingerprints are created using the locality sensitive hashing scheme MinHash, they are ready to be used with LSH-based algorithms:\n```Python\nfrom mhfp.encoder import MHFPEncoder\nfrom mhfp.lsh_forest import LSHForestHelper\n\nmhfp_encoder = MHFPEncoder()\n\nfp_a = mhfp_encoder.encode('CCOC1=C(C=C(C=C1)S(=O)(=O)N(C)C)C2=NC(=O)C3=C(N2)C(=NN3C)C(C)(C)C')\nfp_b = mhfp_encoder.encode('CCCC1=NN(C2=C1NC(=NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C')\nfp_c = mhfp_encoder.encode('O=C(OC)C(C1CCCCN1)C2=CC=CC=C2')\nfp_q = mhfp_encoder.encode('COC1=C(C=C(C=C1)S(=O)(O)N(C)C)C2=NC(=O)C3=C(N2)C(=NN3C)C(C)(C)C')\n\nlsh_forest_helper = LSHForestHelper()\n\nfps = [fp_a, fp_b, fp_c]\n\nfor i, fp in enumerate(fps):\n lsh_forest_helper.add(i, fp)\n\nlsh_forest_helper.index()\n\nprint(lsh_forest_helper.query(fp_q, 1, fps))\n\n#> [0]\n```\n\n### SECFP\nSECFP (SMILES Extended Connectifity Fingerprint) creation is achieved by calling the static method `MHFPEncoder.secfp_from_smiles()`:\n```Python\nfrom mhfp.encoder import MHFPEncoder\nfp = MHFPEncoder.secfp_from_smiles('CCOC1=C(C=C(C=C1)S(=O)(=O)N(C)C)C2=NC(=O)C3=C(N2)C(=NN3C)C(C)(C)C')\n```\n\n\n## Documentation\n### MHFPEncoder\nThe class for encoding SMILES and RDKit molecule instances as MHFP fingerprints.\n\n```Python\n__init__(n_permutations = 2048, seed = 42)\n```\n| Parameter | Default | Description |\n|---|---|---|\n| n_permutations | ```2048``` | Analogous to the number of bits ECFP fingerprints are folded into. Higher is better, lower is less exact. |\n| seed | ```42``` | The seed for the MinHash operation. Has to be the same for comparable fingerprints. |\n\n```Python\nMHFPEncoder.encode(in_smiles, radius = 3, rings = True, kekulize = True, sanitize = True)\n```\n| Parameter | Default | Description |\n|---|---|---|\n| in_smiles | | The input SMILES. |\n| radius | ```3``` | Analogous to the radius for the Morgan fingerprint. The default radius 3 encodes SMILES to MHFP6. |\n| rings | ```True``` | Whether rings in the molecule are included in the fingerprint. As a radii of 3 fails to encode rings and there is no way to determine ring-membership in a substructure SMILES, this considerably increases performance. |\n| kekulize | ```True``` | Whether or not to kekulize the molecule before extracting substructure SMILES. |\n| sanitize | ```True``` | Whether or not to sanitize the molecule (using RDKit) before extracting substructure SMILES. |\n\n**Returns** a `numpy.ndarray` containing the fingerprint hashes.\n\n\n```Python\nMHFPEncoder.encode_mol(in_mol, radius = 3, rings = True, kekulize = True)\n```\n| Parameter | Default | Description |\n|---|---|---|\n| in_mol | | The input RDKit molecule instance. |\n| radius | ```3``` | Analogous to the radius for the Morgan fingerprint. The default radius 3 encodes SMILES to MHFP6. |\n| rings | ```True``` | Whether rings in the molecule are included in the fingerprint. As a radii of 3 fails to encode rings and there is no way to determine ring-membership in a substructure SMILES, this considerably increases performance. |\n| kekulize | ```True``` | Whether or not to kekulize the molecule before extracting substructure SMILES. |\n\n**Returns** a `numpy.ndarray` containing the fingerprint hashes.\n\n```Python\nMHFPEncoder.distance(a, b)\n```\n| Parameter | Default | Description |\n|---|---|---|\n| a | | A `numpy.ndarray` containing fingerprint hashes. |\n| b | | A `numpy.ndarray` containing fingerprint hashes. |\n\n**Returns** a `float` representing the distance between two MHFP encoded molecules.\n\n```Python\nMHFPEncoder.secfp_from_mol(in_mol, length=2048, radius=3, rings=True, kekulize=True)\n```\n| Parameter | Default | Description |\n|---|---|---|\n| in_mol | | The input RDKit molecule instance. |\n| length | ```2048``` | The length of the folded fingerprint. |\n| radius | ```3``` | Analogous to the radius for the Morgan fingerprint. The default radius 3 encodes SMILES to MHFP6. |\n| rings | ```True``` | Whether rings in the molecule are included in the fingerprint. As a radii of 3 fails to encode rings and there is no way to determine ring-membership in a substructure SMILES, this considerably increases performance. |\n| kekulize | ```True``` | Whether or not to kekulize the molecule before extracting substructure SMILES. |\n\n**Returns** a `numpy.ndarray` containing the fingerprint values.\n\n```Python\nMHFPEncoder.secfp_from_smiles(in_smiles, length=2048, radius=3, rings=True, kekulize=True, sanitize=False)\n```\n| Parameter | Default | Description |\n|---|---|---|\n| in_smiles | | The input SMILES. |\n| length | ```2048``` | The length of the folded fingerprint. |\n| radius | ```3``` | Analogous to the radius for the Morgan fingerprint. The default radius 3 encodes SMILES to MHFP6. |\n| rings | ```True``` | Whether rings in the molecule are included in the fingerprint. As a radii of 3 fails to encode rings and there is no way to determine ring-membership in a substructure SMILES, this considerably increases performance. |\n| kekulize | ```True``` | Whether or not to kekulize the molecule before extracting substructure SMILES. |\n| sanitize | ```False``` | Whether or not to sanitize the SMILES when parsing it using RDKit. |\n\n**Returns** a `numpy.ndarray` containing the fingerprint values.\n\n### LSHForestHelper\nIf you know your way around Python, I suggest you use the ```LSHForest``` class directly. See code for details.\n\n```Python\n__init__(dims = 2048, n_prefix_trees = 64)\n```\n| Parameter | Default | Description |\n|---|---|---|\n| dims | ```2048``` | **Has to be equal to the number of permutations** `n_permutations` **set when initializing** `MHFPEncoder` |\n| n_prefix_trees | ```64``` | The number of prefix trees. Lower number results in lower memory usage but less quality |\n\n```Python\nLSHForestHelper.add(key, mhfp):\n```\n| Parameter | Default | Description |\n|---|---|---|\n| key | | The key that will be retrieved by a query. It's recommended to use integers. |\n| mhfp | | The `numpy.ndarray` containing the fingerprint hashes. |\n\n```Python\nLSHForestHelper.index():\n```\n**Has to be run after adding entities and before running queries!**\n\n```Python\nMHFPEncoder.query(query_mhfp, k, data, kc=10)\n```\n| Parameter | Default | Description |\n|---|---|---|\n| query_mhfp | | A `numpy.ndarray` containing fingerprint hashes. |\n| k | | The number of nearest neighbors to be retrieved. |\n| data | | Data mapping keys to fingerprints. See LSH Forest example for details. |\n| kc | 10 | Initially k is multiplied by kc and k · kc (ANN) results are returned. The top k are then selected from these k · kc using linear scan. |\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "Molecular MHFP fingerprints for cheminformatics applications",
"version": "1.9.2",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e9243c853163790e1f42847d1fa96901b07a9db19614014dd34c5e3f06808a6b",
"md5": "d52a501d1022c6bc00ccc3fb693dee94",
"sha256": "3ca5325a30a315fb17978724159d53cc9d1cb3a5c34e68ec2e5babb542ad66a1"
},
"downloads": -1,
"filename": "mhfp-1.9.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d52a501d1022c6bc00ccc3fb693dee94",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 9666,
"upload_time": "2019-08-28T13:12:32",
"upload_time_iso_8601": "2019-08-28T13:12:32.085562Z",
"url": "https://files.pythonhosted.org/packages/e9/24/3c853163790e1f42847d1fa96901b07a9db19614014dd34c5e3f06808a6b/mhfp-1.9.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ebb6877cbe38d04c538486739fbb1b1a93093018d4702d51d616d21274c04491",
"md5": "283d6f4dc7982f536dbc5d2820edc9f2",
"sha256": "bff6d38232502896abeab09224ac747390b4326d50114671068754ff22a09827"
},
"downloads": -1,
"filename": "mhfp-1.9.2.tar.gz",
"has_sig": false,
"md5_digest": "283d6f4dc7982f536dbc5d2820edc9f2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 22783,
"upload_time": "2019-08-28T13:12:33",
"upload_time_iso_8601": "2019-08-28T13:12:33.553575Z",
"url": "https://files.pythonhosted.org/packages/eb/b6/877cbe38d04c538486739fbb1b1a93093018d4702d51d616d21274c04491/mhfp-1.9.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2019-08-28 13:12:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "reymond-group",
"github_project": "mhfp",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "mhfp"
}