# scikit-mol
![Fancy logo](./ressources/logo/ScikitMol_Logo_DarkBG_300px.png#gh-dark-mode-only)
![Fancy logo](./ressources/logo/ScikitMol_Logo_LightBG_300px.png#gh-light-mode-only)
## Scikit-Learn classes for molecular vectorization using RDKit
The intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings
As example with the needed scikit-learn and -mol imports and RDKit mol objects in the mol_list_train and \_test lists:
pipe = Pipeline([('mol_transformer', MorganFingerprintTransformer()), ('Regressor', Ridge())])
pipe.fit(mol_list_train, y_train)
pipe.score(mol_list_test, y_test)
pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])
>>> array([4.93858815])
The scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities
The first draft for the project was created at the [RDKIT UGM 2022 hackathon](https://github.com/rdkit/UGM_2022) 2022-October-14
## Implemented
- Descriptors
- MolecularDescriptorTransformer
<br>
- Fingerprints
- MorganFingerprintTransformer
- MACCSKeysFingerprintTransformer
- RDKitFingerprintTransformer
- AtomPairFingerprintTransformer
- TopologicalTorsionFingerprintTransformer
- MHFingerprintTransformer
- SECFingerprintTransformer
- AvalonFingerprintTransformer
<br>
- Conversions
- SmilesToMol
<br>
- Standardizer
- Standardizer
<br>
- safeinference
- SafeInferenceWrapper
- set_safe_inference_mode
<br>
- Utilities
- CheckSmilesSanitazion
## Installation
Users can install latest tagged release from pip
pip install scikit-mol
or from conda-forge
conda install -c conda-forge scikit-mol
The conda forge package should get updated shortly after a new tagged release on pypi.
Bleeding edge
pip install git+https://github.com:EBjerrum/scikit-mol.git
## Documentation
There are a collection of notebooks in the notebooks directory which demonstrates some different aspects and use cases
- [Basic Usage and fingerprint transformers](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/01_basic_usage.ipynb)
- [Descriptor transformer](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/02_descriptor_transformer.ipynb)
- [Pipelining with Scikit-Learn classes](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/03_example_pipeline.ipynb)
- [Molecular standardization](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/04_standardizer.ipynb)
- [Sanitizing SMILES input](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/05_smiles_sanitaztion.ipynb)
- [Integrated hyperparameter tuning of Scikit-Learn estimator and Scikit-Mol transformer](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/06_hyperparameter_tuning.ipynb)
- [Using parallel execution to speed up descriptor and fingerprint calculations](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/07_parallel_transforms.ipynb)
- [Using skopt for hyperparameter tuning](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/08_external_library_skopt.ipynb)
- [Testing different fingerprints as part of the hyperparameter optimization](https://github.com/EBjerrum/scikit-mol/blob/main/notebooks/09_Combinatorial_Method_Usage_with_FingerPrint_Transformers.ipynb)
- [Using pandas output for easy feature importance analysis and combine pre-exisitng values with new computations](https://github.com/EBjerrum/scikit-mol/blob/main/notebooks/10_pipeline_pandas_output.ipynb)
- [Working with pipelines and estimators in safe inference mode for handling prediction on batches with invalid smiles or molecules](https://github.com/EBjerrum/scikit-mol/blob/main/notebooks/11_safe_inference.ipynb)
We also put a software note on ChemRxiv. [https://doi.org/10.26434/chemrxiv-2023-fzqwd](https://doi.org/10.26434/chemrxiv-2023-fzqwd)
## Roadmap and Contributing
_Help wanted!_ Are you a PhD student that want a "side-quest" to procrastinate your thesis writing or are you simply interested in computational chemistry, cheminformatics or simply with an interest in QSAR modelling, Python Programming open-source software? Do you want to learn more about machine learning with Scikit-Learn? Or do you use scikit-mol for your current work and would like to pay a little back to the project and see it improved as well?
With a little bit of help, this project can be improved much faster! Reach to me (Esben), for a discussion about how we can proceed.
Currently we are working on fixing some deprecation warnings, its not the most exciting work, but it's important to maintain a little. Later on we need to go over the scikit-learn compatibility and update to some of their newer features on their estimator classes. We're also brewing on some feature enhancements and tests, such as new fingerprints and a more versatile standardizer.
There are more information about how to contribute to the project in [CONTRIBUTION.md](https://github.com/EBjerrum/scikit-mol/CONTRIBUTION.md)
## BUGS
Probably still, please check issues at GitHub and report there
## Contributers:
- Esben Jannik Bjerrum [@ebjerrum](https://github.com/ebjerrum), esbenbjerrum+scikit_mol@gmail.com
- Carmen Esposito [@cespos](https://github.com/cespos)
- Son Ha, sonha@uni-mainz.de
- Oh-hyeon Choung, ohhyeon.choung@gmail.com
- Andreas Poehlmann, [@ap--](https://github.com/ap--)
- Ya Chen, [@anya-chen](https://github.com/anya-chen)
- RafaĆ Bachorz [@rafalbachorz](https://github.com/rafalbachorz)
- Adrien Chaton [@adrienchaton](https://github.com/adrienchaton)
- [@VincentAlexanderScholz](https://github.com/VincentAlexanderScholz)
- [@RiesBen](https://github.com/RiesBen)
- [@enricogandini](https://github.com/enricogandini)
- [@mikemhenry](https://github.com/mikemhenry)
- [@c-feldmann](https://github.com/c-feldmann)
Raw data
{
"_id": null,
"home_page": "https://github.com/EBjerrum/scikit-mol",
"name": "scikit-mol",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Esben Jannik Bjerrum",
"author_email": "esben@cheminformania.com",
"download_url": "https://files.pythonhosted.org/packages/9c/90/8b4d92239a46090dc422fbd9979ad5159e4763246bb06696d836dcec0b36/scikit_mol-0.4.4.tar.gz",
"platform": null,
"description": "# scikit-mol\n\n![Fancy logo](./ressources/logo/ScikitMol_Logo_DarkBG_300px.png#gh-dark-mode-only)\n![Fancy logo](./ressources/logo/ScikitMol_Logo_LightBG_300px.png#gh-light-mode-only)\n\n## Scikit-Learn classes for molecular vectorization using RDKit\n\nThe intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings\n\nAs example with the needed scikit-learn and -mol imports and RDKit mol objects in the mol_list_train and \\_test lists:\n\n pipe = Pipeline([('mol_transformer', MorganFingerprintTransformer()), ('Regressor', Ridge())])\n pipe.fit(mol_list_train, y_train)\n pipe.score(mol_list_test, y_test)\n pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])\n\n >>> array([4.93858815])\n\nThe scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities\n\nThe first draft for the project was created at the [RDKIT UGM 2022 hackathon](https://github.com/rdkit/UGM_2022) 2022-October-14\n\n## Implemented\n\n- Descriptors\n - MolecularDescriptorTransformer\n\n<br>\n\n- Fingerprints\n - MorganFingerprintTransformer\n - MACCSKeysFingerprintTransformer\n - RDKitFingerprintTransformer\n - AtomPairFingerprintTransformer\n - TopologicalTorsionFingerprintTransformer\n - MHFingerprintTransformer\n - SECFingerprintTransformer\n - AvalonFingerprintTransformer\n\n<br>\n\n- Conversions\n - SmilesToMol\n\n<br>\n\n- Standardizer\n - Standardizer\n\n<br>\n- safeinference\n - SafeInferenceWrapper\n - set_safe_inference_mode\n\n<br>\n\n- Utilities\n - CheckSmilesSanitazion\n\n## Installation\n\nUsers can install latest tagged release from pip\n\n pip install scikit-mol\n\nor from conda-forge\n\n conda install -c conda-forge scikit-mol\n\nThe conda forge package should get updated shortly after a new tagged release on pypi.\n\nBleeding edge\n\n pip install git+https://github.com:EBjerrum/scikit-mol.git\n\n## Documentation\n\nThere are a collection of notebooks in the notebooks directory which demonstrates some different aspects and use cases\n\n- [Basic Usage and fingerprint transformers](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/01_basic_usage.ipynb)\n- [Descriptor transformer](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/02_descriptor_transformer.ipynb)\n- [Pipelining with Scikit-Learn classes](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/03_example_pipeline.ipynb)\n- [Molecular standardization](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/04_standardizer.ipynb)\n- [Sanitizing SMILES input](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/05_smiles_sanitaztion.ipynb)\n- [Integrated hyperparameter tuning of Scikit-Learn estimator and Scikit-Mol transformer](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/06_hyperparameter_tuning.ipynb)\n- [Using parallel execution to speed up descriptor and fingerprint calculations](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/07_parallel_transforms.ipynb)\n- [Using skopt for hyperparameter tuning](https://github.com/EBjerrum/scikit-mol/tree/main/notebooks/08_external_library_skopt.ipynb)\n- [Testing different fingerprints as part of the hyperparameter optimization](https://github.com/EBjerrum/scikit-mol/blob/main/notebooks/09_Combinatorial_Method_Usage_with_FingerPrint_Transformers.ipynb)\n- [Using pandas output for easy feature importance analysis and combine pre-exisitng values with new computations](https://github.com/EBjerrum/scikit-mol/blob/main/notebooks/10_pipeline_pandas_output.ipynb)\n- [Working with pipelines and estimators in safe inference mode for handling prediction on batches with invalid smiles or molecules](https://github.com/EBjerrum/scikit-mol/blob/main/notebooks/11_safe_inference.ipynb)\n\n We also put a software note on ChemRxiv. [https://doi.org/10.26434/chemrxiv-2023-fzqwd](https://doi.org/10.26434/chemrxiv-2023-fzqwd)\n\n## Roadmap and Contributing\n\n_Help wanted!_ Are you a PhD student that want a \"side-quest\" to procrastinate your thesis writing or are you simply interested in computational chemistry, cheminformatics or simply with an interest in QSAR modelling, Python Programming open-source software? Do you want to learn more about machine learning with Scikit-Learn? Or do you use scikit-mol for your current work and would like to pay a little back to the project and see it improved as well?\nWith a little bit of help, this project can be improved much faster! Reach to me (Esben), for a discussion about how we can proceed.\n\nCurrently we are working on fixing some deprecation warnings, its not the most exciting work, but it's important to maintain a little. Later on we need to go over the scikit-learn compatibility and update to some of their newer features on their estimator classes. We're also brewing on some feature enhancements and tests, such as new fingerprints and a more versatile standardizer.\n\nThere are more information about how to contribute to the project in [CONTRIBUTION.md](https://github.com/EBjerrum/scikit-mol/CONTRIBUTION.md)\n\n## BUGS\n\nProbably still, please check issues at GitHub and report there\n\n## Contributers:\n\n- Esben Jannik Bjerrum [@ebjerrum](https://github.com/ebjerrum), esbenbjerrum+scikit_mol@gmail.com\n- Carmen Esposito [@cespos](https://github.com/cespos)\n- Son Ha, sonha@uni-mainz.de\n- Oh-hyeon Choung, ohhyeon.choung@gmail.com\n- Andreas Poehlmann, [@ap--](https://github.com/ap--)\n- Ya Chen, [@anya-chen](https://github.com/anya-chen)\n- Rafa\u0142 Bachorz [@rafalbachorz](https://github.com/rafalbachorz)\n- Adrien Chaton [@adrienchaton](https://github.com/adrienchaton)\n- [@VincentAlexanderScholz](https://github.com/VincentAlexanderScholz)\n- [@RiesBen](https://github.com/RiesBen)\n- [@enricogandini](https://github.com/enricogandini)\n- [@mikemhenry](https://github.com/mikemhenry)\n- [@c-feldmann](https://github.com/c-feldmann)\n",
"bugtrack_url": null,
"license": "LGPL-3.0",
"summary": "scikit-learn classes for molecule transformation",
"version": "0.4.4",
"project_urls": {
"Download": "https://github.com/EBjerrum/scikit-mol",
"Homepage": "https://github.com/EBjerrum/scikit-mol"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6581873548df16558529dc5fc8768179b059c2d0b4ab027e2445c6235bc893b7",
"md5": "7ca9def05bd8da816fa008d8d0facd3c",
"sha256": "0fae72c4fb0339b1c05bbfc72c4abde1519ee97b4e1b692c00bf7bac9e9a39d3"
},
"downloads": -1,
"filename": "scikit_mol-0.4.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7ca9def05bd8da816fa008d8d0facd3c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 28316,
"upload_time": "2024-12-31T11:02:01",
"upload_time_iso_8601": "2024-12-31T11:02:01.128078Z",
"url": "https://files.pythonhosted.org/packages/65/81/873548df16558529dc5fc8768179b059c2d0b4ab027e2445c6235bc893b7/scikit_mol-0.4.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9c908b4d92239a46090dc422fbd9979ad5159e4763246bb06696d836dcec0b36",
"md5": "990ec015193ace0bc629a0b248c202cb",
"sha256": "45e9c37a5094fda471eb7b1664535e0e167e353cb1c2f4e0808aa8fd4cb4d655"
},
"downloads": -1,
"filename": "scikit_mol-0.4.4.tar.gz",
"has_sig": false,
"md5_digest": "990ec015193ace0bc629a0b248c202cb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 3232074,
"upload_time": "2024-12-31T11:02:04",
"upload_time_iso_8601": "2024-12-31T11:02:04.538189Z",
"url": "https://files.pythonhosted.org/packages/9c/90/8b4d92239a46090dc422fbd9979ad5159e4763246bb06696d836dcec0b36/scikit_mol-0.4.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-31 11:02:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "EBjerrum",
"github_project": "scikit-mol",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "scikit-mol"
}