molpipeline


Namemolpipeline JSON
Version 0.9.1 PyPI version JSON
download
home_pageNone
SummaryIntegration of rdkit functionality into sklearn pipelines.
upload_time2024-11-21 12:32:06
maintainerNone
docs_urlNone
authorChristian W. Feldmann, Jennifer Hemmerich, Jochen Sieg
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MolPipeline
MolPipeline is a Python package for processing molecules with RDKit in scikit-learn.
<br/><br/>
<p align="center"><img src=".github/molpipeline.png" height="200"/></p>

## Background

The [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine
learning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to
prepend custom data processing steps to the machine learning model.
`MolPipeline` extends this concept to the field of cheminformatics by
wrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings
or calculating molecular descriptors from a molecule-object.

MolPipeline aims to provide:

- Automated end-to-end processing from molecule data sets to deployable machine learning models.
- Scalable parallel processing and low memory usage through instance-based processing.
- Standard pipeline building blocks for flexibly building custom pipelines for various
cheminformatics tasks.
- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a
SMILES string that could not be parsed correctly).
- Integrated and self-contained pipeline serialization for easy deployment and tracking
in version control.

## Publications

[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing
molecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)
\
Further links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)

[Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural
fingerprint-based models, 2024](https://doi.org/10.1039/D4FD00095A)
\
Further links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)

## Installation
```commandline
pip install molpipeline
```

## Documentation

The [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.

A nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).

## Quick Start

### Model building

Create a fingerprint-based prediction model:
```python
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToMorganFP
from molpipeline.mol2mol import (
    ElementFilter,
    SaltRemover,
)

from sklearn.ensemble import RandomForestRegressor

# set up pipeline
pipeline = Pipeline([
      ("auto2mol", AutoToMol()),                                     # reading molecules
      ("element_filter", ElementFilter()),                           # standardization
      ("salt_remover", SaltRemover()),                               # standardization
      ("morgan2_2048", MolToMorganFP(n_bits=2048, radius=2)),        # fingerprints and featurization
      ("RandomForestRegressor", RandomForestRegressor())             # machine learning model
    ],
    n_jobs=4)

# fit the pipeline
pipeline.fit(X=["CCCCCC", "c1ccccc1"], y=[0.2, 0.4])
# make predictions from SMILES strings
pipeline.predict(["CCC"])
# output: array([0.29])
```

### Feature calculation

Calculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can
be calculated like this:
```python
from molpipeline import Pipeline
from molpipeline.any2mol import AutoToMol
from molpipeline.mol2any import MolToRDKitPhysChem

pipeline_physchem = Pipeline(
    [
        ("auto2mol", AutoToMol()),
        (
            "physchem",
            MolToRDKitPhysChem(
                standardizer=None,
                descriptor_list=["HeavyAtomMolWt", "TPSA", "NumHAcceptors"],
            ),
        ),
    ],
    n_jobs=-1,
)
physchem_matrix = pipeline_physchem.transform(["CCCCCC", "c1ccccc1(O)"])
physchem_matrix
# output: array([[72.066,  0.   ,  0.   ],
#                [88.065, 20.23 ,  1.   ]])
```

MolPipeline provides further features and descriptors from RDKit, 
for example Morgan (binary/count) fingerprints and MACCS keys.
See the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.

### Clustering

Molpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be
clustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.

```python
from molpipeline.estimators import MurckoScaffoldClustering

scaffold_smiles = [
    "Nc1ccccc1",
    "Cc1cc(Oc2nccc(CCC)c2)ccc1",
    "c1ccccc1",
]
linear_smiles = ["CC", "CCC", "CCCN"]

# run the scaffold clustering
scaffold_clustering = MurckoScaffoldClustering(
    make_generic=False, linear_molecules_strategy="own_cluster", n_jobs=16
)
scaffold_clustering.fit_predict(scaffold_smiles + linear_smiles)
# output: array([1., 0., 1., 2., 2., 2.])
```


## License

This software is licensed under the MIT license. See the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "molpipeline",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Christian W. Feldmann, Jennifer Hemmerich, Jochen Sieg",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/cc/cc/5449d61b90e97a15c1b49a7eff9b299311ec0f64423d366467cbe77f1fb0/molpipeline-0.9.1.tar.gz",
    "platform": null,
    "description": "# MolPipeline\nMolPipeline is a Python package for processing molecules with RDKit in scikit-learn.\n<br/><br/>\n<p align=\"center\"><img src=\".github/molpipeline.png\" height=\"200\"/></p>\n\n## Background\n\nThe [scikit-learn](https://scikit-learn.org/) package provides a large variety of machine\nlearning algorithms and data processing tools, among which is the `Pipeline` class, allowing users to\nprepend custom data processing steps to the machine learning model.\n`MolPipeline` extends this concept to the field of cheminformatics by\nwrapping standard [RDKit](https://www.rdkit.org/) functionality, such as reading and writing SMILES strings\nor calculating molecular descriptors from a molecule-object.\n\nMolPipeline aims to provide:\n\n- Automated end-to-end processing from molecule data sets to deployable machine learning models.\n- Scalable parallel processing and low memory usage through instance-based processing.\n- Standard pipeline building blocks for flexibly building custom pipelines for various\ncheminformatics tasks.\n- Consistent error handling for tracking, logging, and replacing failed instances (e.g., a\nSMILES string that could not be parsed correctly).\n- Integrated and self-contained pipeline serialization for easy deployment and tracking\nin version control.\n\n## Publications\n\n[Sieg J, Feldmann CW, Hemmerich J, Stork C, Sandfort F, Eiden P, and Mathea M, MolPipeline: A python package for processing\nmolecules with RDKit in scikit-learn, J. Chem. Inf. Model., doi:10.1021/acs.jcim.4c00863, 2024](https://doi.org/10.1021/acs.jcim.4c00863)\n\\\nFurther links: [arXiv](https://chemrxiv.org/engage/chemrxiv/article-details/661fec7f418a5379b00ae036)\n\n[Feldmann CW, Sieg J, and Mathea M, Analysis of uncertainty of neural\nfingerprint-based models, 2024](https://doi.org/10.1039/D4FD00095A)\n\\\nFurther links: [repository](https://github.com/basf/neural-fingerprint-uncertainty)\n\n## Installation\n```commandline\npip install molpipeline\n```\n\n## Documentation\n\nThe [notebooks](notebooks) folder contains many basic and advanced examples of how to use Molpipeline.\n\nA nice introduction to the basic usage is in the [01_getting_started_with_molpipeline notebook](notebooks/01_getting_started_with_molpipeline.ipynb).\n\n## Quick Start\n\n### Model building\n\nCreate a fingerprint-based prediction model:\n```python\nfrom molpipeline import Pipeline\nfrom molpipeline.any2mol import AutoToMol\nfrom molpipeline.mol2any import MolToMorganFP\nfrom molpipeline.mol2mol import (\n    ElementFilter,\n    SaltRemover,\n)\n\nfrom sklearn.ensemble import RandomForestRegressor\n\n# set up pipeline\npipeline = Pipeline([\n      (\"auto2mol\", AutoToMol()),                                     # reading molecules\n      (\"element_filter\", ElementFilter()),                           # standardization\n      (\"salt_remover\", SaltRemover()),                               # standardization\n      (\"morgan2_2048\", MolToMorganFP(n_bits=2048, radius=2)),        # fingerprints and featurization\n      (\"RandomForestRegressor\", RandomForestRegressor())             # machine learning model\n    ],\n    n_jobs=4)\n\n# fit the pipeline\npipeline.fit(X=[\"CCCCCC\", \"c1ccccc1\"], y=[0.2, 0.4])\n# make predictions from SMILES strings\npipeline.predict([\"CCC\"])\n# output: array([0.29])\n```\n\n### Feature calculation\n\nCalculating molecular descriptors from SMILES strings is straightforward. For example, physicochemical properties can\nbe calculated like this:\n```python\nfrom molpipeline import Pipeline\nfrom molpipeline.any2mol import AutoToMol\nfrom molpipeline.mol2any import MolToRDKitPhysChem\n\npipeline_physchem = Pipeline(\n    [\n        (\"auto2mol\", AutoToMol()),\n        (\n            \"physchem\",\n            MolToRDKitPhysChem(\n                standardizer=None,\n                descriptor_list=[\"HeavyAtomMolWt\", \"TPSA\", \"NumHAcceptors\"],\n            ),\n        ),\n    ],\n    n_jobs=-1,\n)\nphyschem_matrix = pipeline_physchem.transform([\"CCCCCC\", \"c1ccccc1(O)\"])\nphyschem_matrix\n# output: array([[72.066,  0.   ,  0.   ],\n#                [88.065, 20.23 ,  1.   ]])\n```\n\nMolPipeline provides further features and descriptors from RDKit, \nfor example Morgan (binary/count) fingerprints and MACCS keys.\nSee the [04_feature_calculation notebook](notebooks/04_feature_calculation.ipynb) for more examples.\n\n### Clustering\n\nMolpipeline provides several clustering algorithms as sklearn-like estimators. For example, molecules can be\nclustered by their Murcko scaffold. See the [02_scaffold_split_with_custom_estimators notebook](notebooks/02_scaffold_split_with_custom_estimators.ipynb) for scaffolds splits and further examples.\n\n```python\nfrom molpipeline.estimators import MurckoScaffoldClustering\n\nscaffold_smiles = [\n    \"Nc1ccccc1\",\n    \"Cc1cc(Oc2nccc(CCC)c2)ccc1\",\n    \"c1ccccc1\",\n]\nlinear_smiles = [\"CC\", \"CCC\", \"CCCN\"]\n\n# run the scaffold clustering\nscaffold_clustering = MurckoScaffoldClustering(\n    make_generic=False, linear_molecules_strategy=\"own_cluster\", n_jobs=16\n)\nscaffold_clustering.fit_predict(scaffold_smiles + linear_smiles)\n# output: array([1., 0., 1., 2., 2., 2.])\n```\n\n\n## License\n\nThis software is licensed under the MIT license. See the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Integration of rdkit functionality into sklearn pipelines.",
    "version": "0.9.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3d7b01ffe5732602cbbac15642dbd33bb06b44b8bfe243626318d9a7997b44f4",
                "md5": "0ed2d0be443b7cf8d50b41d18fe231a0",
                "sha256": "3f53b946bddc38a54870639d30e9e5bfd2b0a6c2f126d7d6944007945db810fd"
            },
            "downloads": -1,
            "filename": "molpipeline-0.9.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0ed2d0be443b7cf8d50b41d18fe231a0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 174780,
            "upload_time": "2024-11-21T12:32:05",
            "upload_time_iso_8601": "2024-11-21T12:32:05.211575Z",
            "url": "https://files.pythonhosted.org/packages/3d/7b/01ffe5732602cbbac15642dbd33bb06b44b8bfe243626318d9a7997b44f4/molpipeline-0.9.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cccc5449d61b90e97a15c1b49a7eff9b299311ec0f64423d366467cbe77f1fb0",
                "md5": "63535f5915610e91e45d7619767cedd2",
                "sha256": "973887e74292007adc6f071e997329592cdd24c6f3c6a33be83d616527bee484"
            },
            "downloads": -1,
            "filename": "molpipeline-0.9.1.tar.gz",
            "has_sig": false,
            "md5_digest": "63535f5915610e91e45d7619767cedd2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 120988,
            "upload_time": "2024-11-21T12:32:06",
            "upload_time_iso_8601": "2024-11-21T12:32:06.414900Z",
            "url": "https://files.pythonhosted.org/packages/cc/cc/5449d61b90e97a15c1b49a7eff9b299311ec0f64423d366467cbe77f1fb0/molpipeline-0.9.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-21 12:32:06",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "molpipeline"
}
        
Elapsed time: 0.36899s