# MolPipeline
MolPipeline is a Python package for processing molecules with RDKit in scikit-learn.
<br/><br/>
<p align="center"><img src=".github/molpipeline.png" height="200"/></p>
## Background
The [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine
learning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to
prepend custom data processing steps to the machine learning model.
`MolPipeline` extends this concept to the field of cheminformatics by
wrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings
or calculating molecular descriptors from a molecule-object.
MolPipeline aims to provide:
- Automated end-to-end processing from molecule data sets to deployable machine learning models.
- Scalable parallel processing and low memory usage through instance-based processing.
- Standard pipeline building blocks for flexibly building custom pipelines for various
cheminformatics tasks.
- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a
SMILES string that could not be parsed correctly).
- Integrated and self-contained pipeline serialization for easy deployment and tracking
in version control.
## Publications
[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing
molecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)
\
Further links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)
[Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural
fingerprint-based models, 2024](https://doi.org/10.1039/D4FD00095A)
\
Further links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)
## Installation
```commandline
pip install molpipeline
```
## Documentation
The [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.
A nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).
## Quick Start
### Model building
Create a fingerprint-based prediction model:
```python
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToMorganFP
from molpipeline.mol2mol import (
ElementFilter,
SaltRemover,
)
from sklearn.ensemble import RandomForestRegressor
# set up pipeline
pipeline = Pipeline([
("auto2mol", AutoToMol()), # reading molecules
("element_filter", ElementFilter()), # standardization
("salt_remover", SaltRemover()), # standardization
("morgan2_2048", MolToMorganFP(n_bits=2048, radius=2)), # fingerprints and featurization
("RandomForestRegressor", RandomForestRegressor()) # machine learning model
],
n_jobs=4)
# fit the pipeline
pipeline.fit(X=["CCCCCC", "c1ccccc1"], y=[0.2, 0.4])
# make predictions from SMILES strings
pipeline.predict(["CCC"])
# output: array([0.29])
```
### Feature calculation
Calculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can
be calculated like this:
```python
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToRDKitPhysChem
pipeline_physchem = Pipeline(
[
("auto2mol", AutoToMol()),
(
"physchem",
MolToRDKitPhysChem(
standardizer=None,
descriptor_list=["HeavyAtomMolWt", "TPSA", "NumHAcceptors"],
),
),
],
n_jobs=-1,
)
physchem_matrix = pipeline_physchem.transform(["CCCCCC", "c1ccccc1(O)"])
physchem_matrix
# output: array([[72.066, 0. , 0. ],
# [88.065, 20.23 , 1. ]])
```
MolPipeline provides further features and descriptors from RDKit,
for example Morgan (binary/count) fingerprints and MACCS keys.
See the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.
### Clustering
Molpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be
clustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.
```python
from molpipeline.estimators import MurckoScaffoldClustering
scaffold_smiles = [
"Nc1ccccc1",
"Cc1cc(Oc2nccc(CCC)c2)ccc1",
"c1ccccc1",
]
linear_smiles = ["CC", "CCC", "CCCN"]
# run the scaffold clustering
scaffold_clustering = MurckoScaffoldClustering(
make_generic=False, linear_molecules_strategy="own_cluster", n_jobs=16
)
scaffold_clustering.fit_predict(scaffold_smiles + linear_smiles)
# output: array([1., 0., 1., 2., 2., 2.])
```
## License
This software is licensed under the MIT license. See the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "molpipeline",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Christian W. Feldmann, Jennifer Hemmerich, Jochen Sieg",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/cc/cc/5449d61b90e97a15c1b49a7eff9b299311ec0f64423d366467cbe77f1fb0/molpipeline-0.9.1.tar.gz",
"platform": null,
"description": "# MolPipeline\nMolPipeline is a Python package for processing molecules with RDKit in scikit-learn.\n<br/><br/>\n<p align=\"center\"><img src=\".github/molpipeline.png\" height=\"200\"/></p>\n\n## Background\n\nThe [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine\nlearning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to\nprepend custom data processing steps to the machine learning model.\n`MolPipeline` extends this concept to the field of cheminformatics by\nwrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings\nor calculating molecular descriptors from a molecule-object.\n\nMolPipeline aims to provide:\n\n- Automated end-to-end processing from molecule data sets to deployable machine learning models.\n- Scalable parallel processing and low memory usage through instance-based processing.\n- Standard pipeline building blocks for flexibly building custom pipelines for various\ncheminformatics tasks.\n- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a\nSMILES string that could not be parsed correctly).\n- Integrated and self-contained pipeline serialization for easy deployment and tracking\nin version control.\n\n## Publications\n\n[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing\nmolecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)\n\\\nFurther links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)\n\n[Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural\nfingerprint-based models, 2024](https://doi.org/10.1039/D4FD00095A)\n\\\nFurther links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)\n\n## Installation\n```commandline\npip install molpipeline\n```\n\n## Documentation\n\nThe [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.\n\nA nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).\n\n## Quick Start\n\n### Model building\n\nCreate a fingerprint-based prediction model:\n```python\nfrom molpipeline import Pipeline\nfrom molpipeline.any2mol import AutoToMol\nfrom molpipeline.mol2any import MolToMorganFP\nfrom molpipeline.mol2mol import (\n ElementFilter,\n SaltRemover,\n)\n\nfrom sklearn.ensemble import RandomForestRegressor\n\n# set up pipeline\npipeline = Pipeline([\n (\"auto2mol\", AutoToMol()), # reading molecules\n (\"element_filter\", ElementFilter()), # standardization\n (\"salt_remover\", SaltRemover()), # standardization\n (\"morgan2_2048\", MolToMorganFP(n_bits=2048, radius=2)), # fingerprints and featurization\n (\"RandomForestRegressor\", RandomForestRegressor()) # machine learning model\n ],\n n_jobs=4)\n\n# fit the pipeline\npipeline.fit(X=[\"CCCCCC\", \"c1ccccc1\"], y=[0.2, 0.4])\n# make predictions from SMILES strings\npipeline.predict([\"CCC\"])\n# output: array([0.29])\n```\n\n### Feature calculation\n\nCalculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can\nbe calculated like this:\n```python\nfrom molpipeline import Pipeline\nfrom molpipeline.any2mol import AutoToMol\nfrom molpipeline.mol2any import MolToRDKitPhysChem\n\npipeline_physchem = Pipeline(\n [\n (\"auto2mol\", AutoToMol()),\n (\n \"physchem\",\n MolToRDKitPhysChem(\n standardizer=None,\n descriptor_list=[\"HeavyAtomMolWt\", \"TPSA\", \"NumHAcceptors\"],\n ),\n ),\n ],\n n_jobs=-1,\n)\nphyschem_matrix = pipeline_physchem.transform([\"CCCCCC\", \"c1ccccc1(O)\"])\nphyschem_matrix\n# output: array([[72.066, 0. , 0. ],\n# [88.065, 20.23 , 1. ]])\n```\n\nMolPipeline provides further features and descriptors from RDKit, \nfor example Morgan (binary/count) fingerprints and MACCS keys.\nSee the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.\n\n### Clustering\n\nMolpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be\nclustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.\n\n```python\nfrom molpipeline.estimators import MurckoScaffoldClustering\n\nscaffold_smiles = [\n \"Nc1ccccc1\",\n \"Cc1cc(Oc2nccc(CCC)c2)ccc1\",\n \"c1ccccc1\",\n]\nlinear_smiles = [\"CC\", \"CCC\", \"CCCN\"]\n\n# run the scaffold clustering\nscaffold_clustering = MurckoScaffoldClustering(\n make_generic=False, linear_molecules_strategy=\"own_cluster\", n_jobs=16\n)\nscaffold_clustering.fit_predict(scaffold_smiles + linear_smiles)\n# output: array([1., 0., 1., 2., 2., 2.])\n```\n\n\n## License\n\nThis software is licensed under the MIT license. See the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "Integration of rdkit functionality into sklearn pipelines.",
"version": "0.9.1",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3d7b01ffe5732602cbbac15642dbd33bb06b44b8bfe243626318d9a7997b44f4",
"md5": "0ed2d0be443b7cf8d50b41d18fe231a0",
"sha256": "3f53b946bddc38a54870639d30e9e5bfd2b0a6c2f126d7d6944007945db810fd"
},
"downloads": -1,
"filename": "molpipeline-0.9.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0ed2d0be443b7cf8d50b41d18fe231a0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 174780,
"upload_time": "2024-11-21T12:32:05",
"upload_time_iso_8601": "2024-11-21T12:32:05.211575Z",
"url": "https://files.pythonhosted.org/packages/3d/7b/01ffe5732602cbbac15642dbd33bb06b44b8bfe243626318d9a7997b44f4/molpipeline-0.9.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cccc5449d61b90e97a15c1b49a7eff9b299311ec0f64423d366467cbe77f1fb0",
"md5": "63535f5915610e91e45d7619767cedd2",
"sha256": "973887e74292007adc6f071e997329592cdd24c6f3c6a33be83d616527bee484"
},
"downloads": -1,
"filename": "molpipeline-0.9.1.tar.gz",
"has_sig": false,
"md5_digest": "63535f5915610e91e45d7619767cedd2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 120988,
"upload_time": "2024-11-21T12:32:06",
"upload_time_iso_8601": "2024-11-21T12:32:06.414900Z",
"url": "https://files.pythonhosted.org/packages/cc/cc/5449d61b90e97a15c1b49a7eff9b299311ec0f64423d366467cbe77f1fb0/molpipeline-0.9.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-21 12:32:06",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "molpipeline"
}