# MLChemAD
Applicability domain definitions for cheminformatics modelling.
# Getting Started
## Install
```
pip install mlchemad
```
## Example Usage
- With molecular fingerprints, prefer the use of the `KNNApplicabilityDomain` with `k=1`, `scaling=None`, `hard_threshold=0.3`, and `dist='jaccard'`.
- Otherwise, the use of the `TopKatApplicabilityDomain` is recommended.
```python
from mlchemad import TopKatApplicabilityDomain, KNNApplicabilityDomain, data
# Create the applicability domain using TopKat's definition
app_domain = TopKatApplicabilityDomain()
# Fit it to the training set
app_domain.fit(data.mekenyan1993.training)
# Determine outliers from multiple samples (rows) ...
print(app_domain.contains(data.mekenyan1993.test))
# ... or a unique sample
sample = data.mekenyan1993.test.iloc[5] # Obtain the 5th row as a pandas.Series object
print(app_domain.contains(sample))
# Now with Morgan fingerprints
app_domain = KNNApplicabilityDomain(k=1, scaling=None, hard_threshold=0.3, dist='jaccard')
app_domain.fit(data.broccatelli2011.training.drop(columns='Activity'))
print(app_domain.contains(data.broccatelli2011.test.drop(columns='Activity')))
```
Depending on the definition of the applicability domain, some samples of the training set might be outliers themselves.
# Applicability domains
The applicability domain defined by MLChemAD as the following:
- Bounding Box
- PCA Bounding Box
- Convex Hull<br/>
***(does not scale well)***
- TOPKAT's Optimum Prediction Space<br/>
***(recommended with molecular descriptors)***
- Leverage
- Hotelling T²
- Distance to Centroids
- k-Nearest Neighbors<br/>
***(recommended with molecular fingerprints with the use of `dist='rogerstanimoto'`, `scaling=None` and `hard_threshold=0.75` for ECFP fingerprints)***
- Isolation Forests
- Non-parametric Kernel Densities
Raw data
{
"_id": null,
"home_page": "https://github.com/OlivierBeq/mlchemad",
"name": "mlchemad",
"maintainer": "Olivier J.M. B\u00e9quignon",
"docs_url": null,
"requires_python": null,
"maintainer_email": "olivier.bequignon.maintainer@gmail.com",
"keywords": "applicability domain, cheminformatics, outlier molecule detection, out-of-distribution detection, machine learning",
"author": "Olivier J.M. B\u00e9quignon",
"author_email": "olivier.bequignon.maintainer@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/58/1d/2452236c0e6cfcaf451f56dbbfe5bcc91354411e983d99cdc7a62af3314a/mlchemad-1.5.2.tar.gz",
"platform": null,
"description": "# MLChemAD\r\nApplicability domain definitions for cheminformatics modelling.\r\n\r\n# Getting Started\r\n\r\n## Install\r\n```\r\npip install mlchemad\r\n```\r\n\r\n## Example Usage\r\n\r\n- With molecular fingerprints, prefer the use of the `KNNApplicabilityDomain` with `k=1`, `scaling=None`, `hard_threshold=0.3`, and `dist='jaccard'`.\r\n- Otherwise, the use of the `TopKatApplicabilityDomain` is recommended.\r\n\r\n```python\r\nfrom mlchemad import TopKatApplicabilityDomain, KNNApplicabilityDomain, data\r\n\r\n# Create the applicability domain using TopKat's definition\r\napp_domain = TopKatApplicabilityDomain()\r\n# Fit it to the training set\r\napp_domain.fit(data.mekenyan1993.training)\r\n\r\n# Determine outliers from multiple samples (rows) ...\r\nprint(app_domain.contains(data.mekenyan1993.test))\r\n\r\n# ... or a unique sample\r\nsample = data.mekenyan1993.test.iloc[5] # Obtain the 5th row as a pandas.Series object \r\nprint(app_domain.contains(sample))\r\n\r\n# Now with Morgan fingerprints\r\napp_domain = KNNApplicabilityDomain(k=1, scaling=None, hard_threshold=0.3, dist='jaccard')\r\napp_domain.fit(data.broccatelli2011.training.drop(columns='Activity'))\r\nprint(app_domain.contains(data.broccatelli2011.test.drop(columns='Activity')))\r\n```\r\n\r\nDepending on the definition of the applicability domain, some samples of the training set might be outliers themselves.\r\n\r\n# Applicability domains\r\nThe applicability domain defined by MLChemAD as the following:\r\n- Bounding Box\r\n- PCA Bounding Box\r\n- Convex Hull<br/>\r\n ***(does not scale well)***\r\n- TOPKAT's Optimum Prediction Space<br/>\r\n ***(recommended with molecular descriptors)***\r\n- Leverage\r\n- Hotelling T\u00b2\r\n- Distance to Centroids\r\n- k-Nearest Neighbors<br/>\r\n ***(recommended with molecular fingerprints with the use of `dist='rogerstanimoto'`, `scaling=None` and `hard_threshold=0.75` for ECFP fingerprints)***\r\n- Isolation Forests\r\n- Non-parametric Kernel Densities\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Applicability domains for cheminformactics.",
"version": "1.5.2",
"project_urls": {
"Homepage": "https://github.com/OlivierBeq/mlchemad"
},
"split_keywords": [
"applicability domain",
" cheminformatics",
" outlier molecule detection",
" out-of-distribution detection",
" machine learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "984937b077b10c2bd780bf16a79cdf15d6f1dd0cf49ea51a0162ee0477989a4f",
"md5": "832ea93fe074abdb10e12d89d0d52a22",
"sha256": "b578ceca58139578c84843aa851aa14430e3748bcf339a7b9a12ad94dbddfa3e"
},
"downloads": -1,
"filename": "mlchemad-1.5.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "832ea93fe074abdb10e12d89d0d52a22",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 314364,
"upload_time": "2024-04-17T11:46:57",
"upload_time_iso_8601": "2024-04-17T11:46:57.401573Z",
"url": "https://files.pythonhosted.org/packages/98/49/37b077b10c2bd780bf16a79cdf15d6f1dd0cf49ea51a0162ee0477989a4f/mlchemad-1.5.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "581d2452236c0e6cfcaf451f56dbbfe5bcc91354411e983d99cdc7a62af3314a",
"md5": "ab322f828ea6da5ec56dd08389f9b7cd",
"sha256": "b0f2c6d6b8c639e0c873f14af0364693ad7b3e9641705441464d5e1816168a41"
},
"downloads": -1,
"filename": "mlchemad-1.5.2.tar.gz",
"has_sig": false,
"md5_digest": "ab322f828ea6da5ec56dd08389f9b7cd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 286179,
"upload_time": "2024-04-17T11:46:59",
"upload_time_iso_8601": "2024-04-17T11:46:59.907326Z",
"url": "https://files.pythonhosted.org/packages/58/1d/2452236c0e6cfcaf451f56dbbfe5bcc91354411e983d99cdc7a62af3314a/mlchemad-1.5.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-17 11:46:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "OlivierBeq",
"github_project": "mlchemad",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "mlchemad"
}