# MolPipeline
MolPipeline is a Python package for processing molecules with RDKit in scikit-learn.
<br/><br/>
<p align="center"><img src=".github/molpipeline.png" height="200"/></p>
## Background
The [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine
learning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to
prepend custom data processing steps to the machine learning model.
`MolPipeline` extends this concept to the field of cheminformatics by
wrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings
or calculating molecular descriptors from a molecule-object.
MolPipeline aims to provide:
- Automated end-to-end processing from molecule data sets to deployable machine learning models.
- Scalable parallel processing and low memory usage through instance-based processing.
- Standard pipeline building blocks for flexibly building custom pipelines for various
cheminformatics tasks.
- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a
SMILES string that could not be parsed correctly).
- Integrated and self-contained pipeline serialization for easy deployment and tracking
in version control.
## Publications
[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing
molecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)
\
Further links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)
[Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural
fingerprint-based models, 2024](https://doi.org/10.1039/D4FD00095A)
\
Further links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)
# Table of Contents
- [MolPipeline](#molpipeline)
- [Background](#background)
- [Publications](#publications)
- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Documentation](#documentation)
- [Quick start](#quick-start)
- [Model building](#model-building)
- [Feature calculation](#feature-calculation)
- [Clustering](#clustering)
- [Explainability](#explainability)
- [License](#license)
## Installation
```commandline
pip install molpipeline
```
## Documentation
The [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.
A nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).
## Quick Start
### Model building
Create a fingerprint-based prediction model:
```python
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToMorganFP
from molpipeline.mol2mol import (
ElementFilter,
SaltRemover,
)
from sklearn.ensemble import RandomForestRegressor
# set up pipeline
pipeline = Pipeline([
("auto2mol", AutoToMol()), # reading molecules
("element_filter", ElementFilter()), # standardization
("salt_remover", SaltRemover()), # standardization
("morgan2_2048", MolToMorganFP(n_bits=2048, radius=2)), # fingerprints and featurization
("RandomForestRegressor", RandomForestRegressor()) # machine learning model
],
n_jobs=4)
# fit the pipeline
pipeline.fit(X=["CCCCCC", "c1ccccc1"], y=[0.2, 0.4])
# make predictions from SMILES strings
pipeline.predict(["CCC"])
# output: array([0.29])
```
### Feature calculation
Calculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can
be calculated like this:
```python
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToRDKitPhysChem
pipeline_physchem = Pipeline(
[
("auto2mol", AutoToMol()),
(
"physchem",
MolToRDKitPhysChem(
standardizer=None,
descriptor_list=["HeavyAtomMolWt", "TPSA", "NumHAcceptors"],
),
),
],
n_jobs=-1,
)
physchem_matrix = pipeline_physchem.transform(["CCCCCC", "c1ccccc1(O)"])
physchem_matrix
# output: array([[72.066, 0. , 0. ],
# [88.065, 20.23 , 1. ]])
```
MolPipeline provides further features and descriptors from RDKit,
for example Morgan (binary/count) fingerprints and MACCS keys.
See the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.
### Clustering
Molpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be
clustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.
```python
from molpipeline.estimators import MurckoScaffoldClustering
scaffold_smiles = [
"Nc1ccccc1",
"Cc1cc(Oc2nccc(CCC)c2)ccc1",
"c1ccccc1",
]
linear_smiles = ["CC", "CCC", "CCCN"]
# run the scaffold clustering
scaffold_clustering = MurckoScaffoldClustering(
make_generic=False, linear_molecules_strategy="own_cluster", n_jobs=16
)
scaffold_clustering.fit_predict(scaffold_smiles + linear_smiles)
# output: array([1., 0., 1., 2., 2., 2.])
```
### Explainability
Machine learning model pipelines can be explained using the `explainability` module. MolPipeline uses the
[SHAP](https://github.com/shap/shap) library to compute Shapley values for explanations. The Shapley Values can be
mapped to the molecular structure to visualize the importance of atoms for the prediction.
<p align="center"><img src=".github/xai_example.png" height="500"/></p>
[advanced_03_introduction_to_explainable_ai notebook](notebooks/advanced_03_introduction_to_explainable_ai.ipynb)
<a target="_blank" href="https://colab.research.google.com/github/basf/MolPipeline/blob/main/notebooks/advanced_03_introduction_to_explainable_ai.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
gives a detailed introduction to explainability. The notebook also compares explanations of Tree-based models to Neural Networks
using the structure-activity relationship (SAR) data from [Harren et al. 2022](https://pubs.acs.org/doi/10.1021/acs.jcim.1c01263).
Use the following example code to explain a model's predictions and visualize the explanation as heatmaps.
```python
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToMorganFP
from molpipeline.experimental.explainability import SHAPTreeExplainer
from molpipeline.experimental.explainability import (
structure_heatmap_shap,
)
from sklearn.ensemble import RandomForestRegressor
X = ["CCCCCC", "c1ccccc1"]
y = [0.2, 0.4]
pipeline = Pipeline([
("auto2mol", AutoToMol()),
("morgan2_2048", MolToMorganFP(n_bits=2048, radius=2)),
("RandomForest", RandomForestRegressor())
],
n_jobs=4)
pipeline.fit(X, y)
# explain the model
explainer = SHAPTreeExplainer(pipeline)
explanations = explainer.explain(X)
# visualize the explanation
image = structure_heatmap_shap(explanation=explanations[0])
image.save("explanation.png")
```
Note that the explainability module is fully-functional but in the 'experimental' directory because we might make changes to the API.
## License
This software is licensed under the MIT license. See the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "molpipeline",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Christian W. Feldmann, Jennifer Hemmerich, Jochen Sieg",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/da/2b/ab3d687f8d6ef1657b8c1d63818876ef54e90cc31327e6fa1747657861b2/molpipeline-0.10.3.tar.gz",
"platform": null,
"description": "# MolPipeline\nMolPipeline is a Python package for processing molecules with RDKit in scikit-learn.\n<br/><br/>\n<p align=\"center\"><img src=\".github/molpipeline.png\" height=\"200\"/></p>\n\n## Background\n\nThe [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine\nlearning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to\nprepend custom data processing steps to the machine learning model.\n`MolPipeline` extends this concept to the field of cheminformatics by\nwrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings\nor calculating molecular descriptors from a molecule-object.\n\nMolPipeline aims to provide:\n\n- Automated end-to-end processing from molecule data sets to deployable machine learning models.\n- Scalable parallel processing and low memory usage through instance-based processing.\n- Standard pipeline building blocks for flexibly building custom pipelines for various\ncheminformatics tasks.\n- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a\nSMILES string that could not be parsed correctly).\n- Integrated and self-contained pipeline serialization for easy deployment and tracking\nin version control.\n\n## Publications\n\n[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing\nmolecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)\n\\\nFurther links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)\n\n[Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural\nfingerprint-based models, 2024](https://doi.org/10.1039/D4FD00095A)\n\\\nFurther links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)\n\n# Table of Contents\n\n- [MolPipeline](#molpipeline)\n - [Background](#background)\n - [Publications](#publications)\n- [Table of Contents](#table-of-contents)\n - [Installation](#installation)\n - [Documentation](#documentation)\n - [Quick start](#quick-start)\n - [Model building](#model-building)\n - [Feature calculation](#feature-calculation)\n - [Clustering](#clustering)\n - [Explainability](#explainability)\n - [License](#license)\n\n## Installation\n```commandline\npip install molpipeline\n```\n\n## Documentation\n\nThe [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.\n\nA nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).\n\n## Quick Start\n\n### Model building\n\nCreate a fingerprint-based prediction model:\n```python\nfrom molpipeline import Pipeline\nfrom molpipeline.any2mol import AutoToMol\nfrom molpipeline.mol2any import MolToMorganFP\nfrom molpipeline.mol2mol import (\n ElementFilter,\n SaltRemover,\n)\n\nfrom sklearn.ensemble import RandomForestRegressor\n\n# set up pipeline\npipeline = Pipeline([\n (\"auto2mol\", AutoToMol()), # reading molecules\n (\"element_filter\", ElementFilter()), # standardization\n (\"salt_remover\", SaltRemover()), # standardization\n (\"morgan2_2048\", MolToMorganFP(n_bits=2048, radius=2)), # fingerprints and featurization\n (\"RandomForestRegressor\", RandomForestRegressor()) # machine learning model\n ],\n n_jobs=4)\n\n# fit the pipeline\npipeline.fit(X=[\"CCCCCC\", \"c1ccccc1\"], y=[0.2, 0.4])\n# make predictions from SMILES strings\npipeline.predict([\"CCC\"])\n# output: array([0.29])\n```\n\n### Feature calculation\n\nCalculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can\nbe calculated like this:\n```python\nfrom molpipeline import Pipeline\nfrom molpipeline.any2mol import AutoToMol\nfrom molpipeline.mol2any import MolToRDKitPhysChem\n\npipeline_physchem = Pipeline(\n [\n (\"auto2mol\", AutoToMol()),\n (\n \"physchem\",\n MolToRDKitPhysChem(\n standardizer=None,\n descriptor_list=[\"HeavyAtomMolWt\", \"TPSA\", \"NumHAcceptors\"],\n ),\n ),\n ],\n n_jobs=-1,\n)\nphyschem_matrix = pipeline_physchem.transform([\"CCCCCC\", \"c1ccccc1(O)\"])\nphyschem_matrix\n# output: array([[72.066, 0. , 0. ],\n# [88.065, 20.23 , 1. ]])\n```\n\nMolPipeline provides further features and descriptors from RDKit,\nfor example Morgan (binary/count) fingerprints and MACCS keys.\nSee the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.\n\n### Clustering\n\nMolpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be\nclustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.\n\n```python\nfrom molpipeline.estimators import MurckoScaffoldClustering\n\nscaffold_smiles = [\n \"Nc1ccccc1\",\n \"Cc1cc(Oc2nccc(CCC)c2)ccc1\",\n \"c1ccccc1\",\n]\nlinear_smiles = [\"CC\", \"CCC\", \"CCCN\"]\n\n# run the scaffold clustering\nscaffold_clustering = MurckoScaffoldClustering(\n make_generic=False, linear_molecules_strategy=\"own_cluster\", n_jobs=16\n)\nscaffold_clustering.fit_predict(scaffold_smiles + linear_smiles)\n# output: array([1., 0., 1., 2., 2., 2.])\n```\n\n\n### Explainability\n\nMachine learning model pipelines can be explained using the `explainability` module. MolPipeline uses the\n[SHAP](https://github.com/shap/shap) library to compute Shapley values for explanations. The Shapley Values can be\nmapped to the molecular structure to visualize the importance of atoms for the prediction.\n\n<p align=\"center\"><img src=\".github/xai_example.png\" height=\"500\"/></p>\n\n[advanced_03_introduction_to_explainable_ai notebook](notebooks/advanced_03_introduction_to_explainable_ai.ipynb)\n<a target=\"_blank\" href=\"https://colab.research.google.com/github/basf/MolPipeline/blob/main/notebooks/advanced_03_introduction_to_explainable_ai.ipynb\">\n <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n</a>\ngives a detailed introduction to explainability. The notebook also compares explanations of Tree-based models to Neural Networks\nusing the structure-activity relationship (SAR) data from [Harren et al. 2022](https://pubs.acs.org/doi/10.1021/acs.jcim.1c01263).\n\nUse the following example code to explain a model's predictions and visualize the explanation as heatmaps.\n\n```python\nfrom molpipeline import Pipeline\nfrom molpipeline.any2mol import AutoToMol\nfrom molpipeline.mol2any import MolToMorganFP\nfrom molpipeline.experimental.explainability import SHAPTreeExplainer\nfrom molpipeline.experimental.explainability import (\n structure_heatmap_shap,\n)\nfrom sklearn.ensemble import RandomForestRegressor\n\nX = [\"CCCCCC\", \"c1ccccc1\"]\ny = [0.2, 0.4]\n\npipeline = Pipeline([\n (\"auto2mol\", AutoToMol()),\n (\"morgan2_2048\", MolToMorganFP(n_bits=2048, radius=2)),\n (\"RandomForest\", RandomForestRegressor())\n],\n n_jobs=4)\npipeline.fit(X, y)\n\n# explain the model\nexplainer = SHAPTreeExplainer(pipeline)\nexplanations = explainer.explain(X)\n\n# visualize the explanation\nimage = structure_heatmap_shap(explanation=explanations[0])\nimage.save(\"explanation.png\")\n\n```\nNote that the explainability module is fully-functional but in the 'experimental' directory because we might make changes to the API.\n\n## License\n\nThis software is licensed under the MIT license. See the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "Integration of rdkit functionality into sklearn pipelines.",
"version": "0.10.3",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2c74a1bb269e1764a24f7a47eefd1aa630ab483ac757bd3f00f8728b2f03c5fe",
"md5": "7a3682e36c23e465380efd60e0980356",
"sha256": "f28db28af3e4e15a38bd83854fc11261fc5f59aeb7010b19a9c513c493ed158a"
},
"downloads": -1,
"filename": "molpipeline-0.10.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7a3682e36c23e465380efd60e0980356",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 214606,
"upload_time": "2025-07-21T16:33:17",
"upload_time_iso_8601": "2025-07-21T16:33:17.755373Z",
"url": "https://files.pythonhosted.org/packages/2c/74/a1bb269e1764a24f7a47eefd1aa630ab483ac757bd3f00f8728b2f03c5fe/molpipeline-0.10.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "da2bab3d687f8d6ef1657b8c1d63818876ef54e90cc31327e6fa1747657861b2",
"md5": "b955c7ce1a1200364cfaaa2a3409c39f",
"sha256": "bcac5f6f9351653549458c7be3877d994af1b7857fc3fcfbdaeb616fb17074a2"
},
"downloads": -1,
"filename": "molpipeline-0.10.3.tar.gz",
"has_sig": false,
"md5_digest": "b955c7ce1a1200364cfaaa2a3409c39f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 155577,
"upload_time": "2025-07-21T16:33:18",
"upload_time_iso_8601": "2025-07-21T16:33:18.883727Z",
"url": "https://files.pythonhosted.org/packages/da/2b/ab3d687f8d6ef1657b8c1d63818876ef54e90cc31327e6fa1747657861b2/molpipeline-0.10.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-21 16:33:18",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "molpipeline"
}