# scikit-fingerprints
[![PyPI version](https://badge.fury.io/py/scikit-fingerprints.svg)](https://badge.fury.io/py/scikit-fingerprints)
[![](https://img.shields.io/pypi/dm/scikit-fingerprints)](https://pypi.org/project/scikit-fingerprints/)
[![Downloads](https://static.pepy.tech/badge/scikit-fingerprints)](https://pepy.tech/project/scikit-fingerprints)
![Libraries.io dependency status for latest release](https://img.shields.io/librariesio/release/pypi/scikit-fingerprints)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![License](https://img.shields.io/badge/license-MIT-blue)](LICENSE.md)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/scikit-fingerprints.svg)](https://pypi.org/project/scikit-fingerprints/)
[![Contributors](https://img.shields.io/github/contributors/scikit-fingerprints/scikit-fingerprints)](https://github.com/scikit-fingerprints/scikit-fingerprints/graphs/contributors)
[![check](https://github.com/scikit-fingerprints/scikit-fingerprints/actions/workflows/python-test.yml/badge.svg)](https://github.com/scikit-fingerprints/scikit-fingerprints/actions/workflows/python-test.yml)
[scikit-fingerprints](https://scikit-fingerprints.github.io/scikit-fingerprints/) is a Python library for efficient
computation of molecular fingerprints.
## Table of Contents
- [Description](#description)
- [Supported platforms](#supported-platforms)
- [Installation](#installation)
- [Quickstart](#quickstart)
- [Project overview](#project-overview)
- [Contributing](#contributing)
- [License](#license)
---
## Description
Molecular fingerprints are crucial in various scientific fields, including drug discovery, materials science, and
chemical analysis. However, existing Python libraries for computing molecular fingerprints often lack performance,
user-friendliness, and support for modern programming standards. This project aims to address these shortcomings by
creating an efficient and accessible Python library for molecular fingerprint computation.
You can find the documentation [HERE](https://scikit-fingerprints.github.io/scikit-fingerprints/)
Main features:
- scikit-learn compatible
- feature-rich, with >30 fingerprints
- parallelization
- sparse matrix support
- commercial-friendly MIT license
## Supported platforms
| | `python3.9` | `python3.10` | `python3.11` | `python3.12` |
|----------------------|------------------------|--------------|--------------|--------------|
| **Ubuntu - latest** | ✅ | ✅ | ✅ | ✅ |
| **Windows - latest** | ✅ | ✅ | ✅ | ✅ |
| **macOS - latest** | only macOS 13 or newer | ✅ | ✅ | ✅ |
## Installation
You can install the library using pip:
```bash
pip install scikit-fingerprints
```
If you need bleeding-edge features and don't mind potentially unstable or undocumented functionalities,
you can also install directly from GitHub:
```bash
pip install git+https://github.com/scikit-fingerprints/scikit-fingerprints.git
```
## Quickstart
Most fingerprints are based on molecular graphs (2D-based), and you can use SMILES
input directly:
```python
from skfp.fingerprints import AtomPairFingerprint
smiles_list = ["O=S(=O)(O)CCS(=O)(=O)O", "O=C(O)c1ccccc1O"]
atom_pair_fingerprint = AtomPairFingerprint()
X = atom_pair_fingerprint.transform(smiles_list)
print(X)
```
For fingerprints using conformers (3D-based), you need to create molecules first
and compute conformers. Those fingerprints have `requires_conformers` attribute set
to `True`.
```python
from skfp.preprocessing import ConformerGenerator, MolFromSmilesTransformer
from skfp.fingerprints import WHIMFingerprint
smiles_list = ["O=S(=O)(O)CCS(=O)(=O)O", "O=C(O)c1ccccc1O"]
mol_from_smiles = MolFromSmilesTransformer()
conf_gen = ConformerGenerator()
fp = WHIMFingerprint()
print(fp.requires_conformers) # True
mols_list = mol_from_smiles.transform(smiles_list)
mols_list = conf_gen.transform(mols_list)
X = fp.transform(mols_list)
print(X)
```
You can also use scikit-learn functionalities like pipelines, feature unions
etc. to build complex workflows. Popular datasets, e.g. from MoleculeNet benchmark,
can be loaded directly.
```python
from skfp.datasets.moleculenet import load_clintox
from skfp.metrics import multioutput_auroc_score, extract_pos_proba
from skfp.model_selection import scaffold_train_test_split
from skfp.fingerprints import ECFPFingerprint, MACCSFingerprint
from skfp.preprocessing import MolFromSmilesTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline, make_union
smiles, y = load_clintox()
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(
smiles, y, test_size=0.2
)
pipeline = make_pipeline(
MolFromSmilesTransformer(),
make_union(ECFPFingerprint(count=True), MACCSFingerprint()),
RandomForestClassifier(random_state=0),
)
pipeline.fit(smiles_train, y_train)
y_pred_proba = pipeline.predict_proba(smiles_test)
y_pred_proba = extract_pos_proba(y_pred_proba)
auroc = multioutput_auroc_score(y_test, y_pred_proba)
print(f"AUROC: {auroc:.2%}")
```
## Project overview
`scikit-fingerprint` brings molecular fingerprints and related functionalities into
the scikit-learn ecosystem. With familiar class-based design and `.transform()` method,
fingerprints can be computed from SMILES strings or RDKit `Mol` objects. Resulting NumPy
arrays or SciPy sparse arrays can be directly used in ML pipelines.
Main features:
1. **Scikit-learn compatible:** `scikit-fingerprints` uses familiar scikit-learn
interface and conforms to its API requirements. You can include molecular
fingerprints in pipelines, concatenate them with feature unions, and process with
ML algorithms.
2. **Performance optimization:** both speed and memory usage are optimized, by
utilizing parallelism (with Joblib) and sparse CSR matrices (with SciPy). Heavy
computation is typically relegated to C++ code of RDKit.
3. **Feature-rich:** in addition to computing fingerprints, you can load popular
benchmark datasets (e.g. from MoleculeNet), perform splitting (e.g. scaffold
split), generate conformers, and optimize hyperparameters with optimized cross-validation.
4. **Well-documented:** each public function and class has extensive documentation,
including relevant implementation details, caveats, and literature references.
5. **Extensibility:** any functionality can be easily modified or extended by
inheriting from existing classes.
6. **High code quality:** pre-commit hooks scan each commit for code quality (e.g. `black`,
`flake8`), typing (`mypy`), and security (e.g. `bandit`, `safety`). CI/CD process with
GitHub Actions also includes over 250 unit and integration tests.
## Contributing
Please read [CONTRIBUTING.md](CONTRIBUTING.md) and [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) for details on our code of
conduct, and the process for submitting pull requests to us.
## Citing
If you use scikit-fingerprints in your work, please cite our main publication,
[available on SoftwareX (open access)](https://doi.org/10.1016/j.softx.2024.101944):
```
@article{scikit_fingerprints,
title = {Scikit-fingerprints: Easy and efficient computation of molecular fingerprints in Python},
author = {Jakub Adamczyk and Piotr Ludynia},
journal = {SoftwareX},
volume = {28},
pages = {101944},
year = {2024},
issn = {2352-7110},
doi = {https://doi.org/10.1016/j.softx.2024.101944},
url = {https://www.sciencedirect.com/science/article/pii/S2352711024003145},
keywords = {Molecular fingerprints, Chemoinformatics, Molecular property prediction, Python, Machine learning, Scikit-learn},
}
```
Its preprint is also [available on ArXiv](https://arxiv.org/abs/2407.13291).
## License
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/scikit-fingerprints/scikit-fingerprints",
"name": "scikit-fingerprints",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.9",
"maintainer_email": null,
"keywords": "molecular fingerprints, molecular descriptors, cheminformatics",
"author": "Scikit-Fingerprints Development Team",
"author_email": "scikitfingerprints@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/41/76/efec3cad5f81954ff463df227e8b40e2f4e6059bfc528b668e21a95ae20d/scikit_fingerprints-1.12.0.tar.gz",
"platform": null,
"description": "# scikit-fingerprints\n\n[![PyPI version](https://badge.fury.io/py/scikit-fingerprints.svg)](https://badge.fury.io/py/scikit-fingerprints)\n[![](https://img.shields.io/pypi/dm/scikit-fingerprints)](https://pypi.org/project/scikit-fingerprints/)\n[![Downloads](https://static.pepy.tech/badge/scikit-fingerprints)](https://pepy.tech/project/scikit-fingerprints)\n![Libraries.io dependency status for latest release](https://img.shields.io/librariesio/release/pypi/scikit-fingerprints)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![License](https://img.shields.io/badge/license-MIT-blue)](LICENSE.md)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/scikit-fingerprints.svg)](https://pypi.org/project/scikit-fingerprints/)\n[![Contributors](https://img.shields.io/github/contributors/scikit-fingerprints/scikit-fingerprints)](https://github.com/scikit-fingerprints/scikit-fingerprints/graphs/contributors)\n[![check](https://github.com/scikit-fingerprints/scikit-fingerprints/actions/workflows/python-test.yml/badge.svg)](https://github.com/scikit-fingerprints/scikit-fingerprints/actions/workflows/python-test.yml)\n\n[scikit-fingerprints](https://scikit-fingerprints.github.io/scikit-fingerprints/) is a Python library for efficient\ncomputation of molecular fingerprints.\n\n## Table of Contents\n\n- [Description](#description)\n- [Supported platforms](#supported-platforms)\n- [Installation](#installation)\n- [Quickstart](#quickstart)\n- [Project overview](#project-overview)\n- [Contributing](#contributing)\n- [License](#license)\n\n---\n\n## Description\n\nMolecular fingerprints are crucial in various scientific fields, including drug discovery, materials science, and\nchemical analysis. However, existing Python libraries for computing molecular fingerprints often lack performance,\nuser-friendliness, and support for modern programming standards. This project aims to address these shortcomings by\ncreating an efficient and accessible Python library for molecular fingerprint computation.\n\nYou can find the documentation [HERE](https://scikit-fingerprints.github.io/scikit-fingerprints/)\n\nMain features:\n- scikit-learn compatible\n- feature-rich, with >30 fingerprints\n- parallelization\n- sparse matrix support\n- commercial-friendly MIT license\n\n## Supported platforms\n\n| | `python3.9` | `python3.10` | `python3.11` | `python3.12` |\n|----------------------|------------------------|--------------|--------------|--------------|\n| **Ubuntu - latest** | \u2705 | \u2705 | \u2705 | \u2705 |\n| **Windows - latest** | \u2705 | \u2705 | \u2705 | \u2705 |\n| **macOS - latest** | only macOS 13 or newer | \u2705 | \u2705 | \u2705 |\n\n## Installation\n\nYou can install the library using pip:\n\n```bash\npip install scikit-fingerprints\n```\n\nIf you need bleeding-edge features and don't mind potentially unstable or undocumented functionalities,\nyou can also install directly from GitHub:\n```bash\npip install git+https://github.com/scikit-fingerprints/scikit-fingerprints.git\n```\n\n## Quickstart\n\nMost fingerprints are based on molecular graphs (2D-based), and you can use SMILES\ninput directly:\n```python\nfrom skfp.fingerprints import AtomPairFingerprint\n\nsmiles_list = [\"O=S(=O)(O)CCS(=O)(=O)O\", \"O=C(O)c1ccccc1O\"]\n\natom_pair_fingerprint = AtomPairFingerprint()\n\nX = atom_pair_fingerprint.transform(smiles_list)\nprint(X)\n```\n\nFor fingerprints using conformers (3D-based), you need to create molecules first\nand compute conformers. Those fingerprints have `requires_conformers` attribute set\nto `True`.\n```python\nfrom skfp.preprocessing import ConformerGenerator, MolFromSmilesTransformer\nfrom skfp.fingerprints import WHIMFingerprint\n\nsmiles_list = [\"O=S(=O)(O)CCS(=O)(=O)O\", \"O=C(O)c1ccccc1O\"]\n\nmol_from_smiles = MolFromSmilesTransformer()\nconf_gen = ConformerGenerator()\nfp = WHIMFingerprint()\nprint(fp.requires_conformers) # True\n\nmols_list = mol_from_smiles.transform(smiles_list)\nmols_list = conf_gen.transform(mols_list)\n\nX = fp.transform(mols_list)\nprint(X)\n```\n\nYou can also use scikit-learn functionalities like pipelines, feature unions\netc. to build complex workflows. Popular datasets, e.g. from MoleculeNet benchmark,\ncan be loaded directly.\n```python\nfrom skfp.datasets.moleculenet import load_clintox\nfrom skfp.metrics import multioutput_auroc_score, extract_pos_proba\nfrom skfp.model_selection import scaffold_train_test_split\nfrom skfp.fingerprints import ECFPFingerprint, MACCSFingerprint\nfrom skfp.preprocessing import MolFromSmilesTransformer\n\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.pipeline import make_pipeline, make_union\n\n\nsmiles, y = load_clintox()\nsmiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(\n smiles, y, test_size=0.2\n)\n\npipeline = make_pipeline(\n MolFromSmilesTransformer(),\n make_union(ECFPFingerprint(count=True), MACCSFingerprint()),\n RandomForestClassifier(random_state=0),\n)\npipeline.fit(smiles_train, y_train)\n\ny_pred_proba = pipeline.predict_proba(smiles_test)\ny_pred_proba = extract_pos_proba(y_pred_proba)\nauroc = multioutput_auroc_score(y_test, y_pred_proba)\nprint(f\"AUROC: {auroc:.2%}\")\n```\n\n## Project overview\n\n`scikit-fingerprint` brings molecular fingerprints and related functionalities into\nthe scikit-learn ecosystem. With familiar class-based design and `.transform()` method,\nfingerprints can be computed from SMILES strings or RDKit `Mol` objects. Resulting NumPy\narrays or SciPy sparse arrays can be directly used in ML pipelines.\n\nMain features:\n\n1. **Scikit-learn compatible:** `scikit-fingerprints` uses familiar scikit-learn\n interface and conforms to its API requirements. You can include molecular\n fingerprints in pipelines, concatenate them with feature unions, and process with\n ML algorithms.\n\n2. **Performance optimization:** both speed and memory usage are optimized, by\n utilizing parallelism (with Joblib) and sparse CSR matrices (with SciPy). Heavy\n computation is typically relegated to C++ code of RDKit.\n\n3. **Feature-rich:** in addition to computing fingerprints, you can load popular\n benchmark datasets (e.g. from MoleculeNet), perform splitting (e.g. scaffold\n split), generate conformers, and optimize hyperparameters with optimized cross-validation.\n\n4. **Well-documented:** each public function and class has extensive documentation,\n including relevant implementation details, caveats, and literature references.\n\n5. **Extensibility:** any functionality can be easily modified or extended by\n inheriting from existing classes.\n\n6. **High code quality:** pre-commit hooks scan each commit for code quality (e.g. `black`,\n `flake8`), typing (`mypy`), and security (e.g. `bandit`, `safety`). CI/CD process with\n GitHub Actions also includes over 250 unit and integration tests.\n\n## Contributing\n\nPlease read [CONTRIBUTING.md](CONTRIBUTING.md) and [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) for details on our code of\nconduct, and the process for submitting pull requests to us.\n\n## Citing\n\nIf you use scikit-fingerprints in your work, please cite our main publication, \n[available on SoftwareX (open access)](https://doi.org/10.1016/j.softx.2024.101944):\n```\n@article{scikit_fingerprints,\n title = {Scikit-fingerprints: Easy and efficient computation of molecular fingerprints in Python},\n author = {Jakub Adamczyk and Piotr Ludynia},\n journal = {SoftwareX},\n volume = {28},\n pages = {101944},\n year = {2024},\n issn = {2352-7110},\n doi = {https://doi.org/10.1016/j.softx.2024.101944},\n url = {https://www.sciencedirect.com/science/article/pii/S2352711024003145},\n keywords = {Molecular fingerprints, Chemoinformatics, Molecular property prediction, Python, Machine learning, Scikit-learn},\n}\n```\n\nIts preprint is also [available on ArXiv](https://arxiv.org/abs/2407.13291).\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Library for effective molecular fingerprints calculation",
"version": "1.12.0",
"project_urls": {
"Bug Tracker": "https://github.com/scikit-fingerprints/scikit-fingerprints/issues",
"Documentation": "https://scikit-fingerprints.github.io/scikit-fingerprints/",
"Homepage": "https://github.com/scikit-fingerprints/scikit-fingerprints",
"Repository": "https://github.com/scikit-fingerprints/scikit-fingerprints"
},
"split_keywords": [
"molecular fingerprints",
" molecular descriptors",
" cheminformatics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1ce2b61b8b73f6b6d2e8dc44ad9d81736e935ff541d13243341ed1f7b8292dba",
"md5": "86d1b8f8df18467eb433c6d19bf3bc13",
"sha256": "bb2f50bf60a7a0bb2b3227a7944fb86c59336a4ed3311bab05286b71d4e50357"
},
"downloads": -1,
"filename": "scikit_fingerprints-1.12.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "86d1b8f8df18467eb433c6d19bf3bc13",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.9",
"size": 295469,
"upload_time": "2024-12-03T12:15:31",
"upload_time_iso_8601": "2024-12-03T12:15:31.522338Z",
"url": "https://files.pythonhosted.org/packages/1c/e2/b61b8b73f6b6d2e8dc44ad9d81736e935ff541d13243341ed1f7b8292dba/scikit_fingerprints-1.12.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4176efec3cad5f81954ff463df227e8b40e2f4e6059bfc528b668e21a95ae20d",
"md5": "fbafa655d659d3abca4aa18c819c973d",
"sha256": "1f4e5c2e9e64be67a245817c5989fcd9841db39ceff5239f35328d4cecc16d20"
},
"downloads": -1,
"filename": "scikit_fingerprints-1.12.0.tar.gz",
"has_sig": false,
"md5_digest": "fbafa655d659d3abca4aa18c819c973d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.9",
"size": 173275,
"upload_time": "2024-12-03T12:15:33",
"upload_time_iso_8601": "2024-12-03T12:15:33.596253Z",
"url": "https://files.pythonhosted.org/packages/41/76/efec3cad5f81954ff463df227e8b40e2f4e6059bfc528b668e21a95ae20d/scikit_fingerprints-1.12.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-03 12:15:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "scikit-fingerprints",
"github_project": "scikit-fingerprints",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "scikit-fingerprints"
}