Name | assay-inspector JSON |
Version |
1.0.5
JSON |
| download |
home_page | None |
Summary | AssayInspector: A Python package for diagnostic assessment of data consistency in molecular datasets. |
upload_time | 2025-08-05 12:25:59 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | MIT License
Copyright (c) 2025 Chemotargets
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
|
keywords |
data reporting
molecular property
adme
physicochemical
machine learning
data aggregation
predictive accuracy
benchmark
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<div align="center">
<h1>
Data consistency assessment facilitates transfer learning in ADME modeling
</h1>
<p><i>AssayInspector: A Python package for diagnostic assessment of data consistency in molecular datasets</i></p>



</div>
<div align="center">
<img src="https://raw.githubusercontent.com/chemotargets/assay_inspector/master/AssayInspector.svg" alt="AssayInspector" width="80%">
</div>
Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments between benchmark and gold-standard sources that degrade model performance. Our analyses further revealed that dataset discrepancies arise from differences in various factors, from experimental conditions in data collection to chemical space coverage. This highlights the importance of **rigorous data consistency assessment (DCA) prior to modeling**. To facilitate a systematic DCA across diverse datasets, we developed **AssayInspector**, a **model-agnostic package** that leverages *statistics*, *visualizations*, and *diagnostic summaries* to identify *outliers*, *batch effects*, and *discrepancies*. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.
**Keywords:** data reporting, molecular property, ADME, physicochemical, machine learning, data aggregation, predictive accuracy, benchmark
## Installation
To install and use the package, first create the `conda` environment as follows:
```bash
conda env create -f AssayInspector_env.yml
```
Then, activate the environment:
```bash
conda activate assay_inspector
```
Finally, install the package from PyPI using pip:
```bash
pip install assay_inspector
```
## Getting Started
To run `AssayInspector`, you first need to prepare your input data. The file should be in `.tsv` or `.csv` format and include the following required columns:
* `smiles`: The SMILES string representation of each molecule in the dataset.
* `value`: The annotated value for each molecule — use a numerical value for regression tasks or a binary label (0 or 1) for classification tasks.
* `ref`: The reference source name from which each value-molecule annotation was obtained.
* `endpoint`: The name of the endpoint to analyze.
You can find two example input files for the [half-life](https://raw.githubusercontent.com/chemotargets/assay_inspector/refs/heads/master/data/half_life/logHL_aggregated_dataset.tsv) and [clearance](https://raw.githubusercontent.com/chemotargets/assay_inspector/refs/heads/master/data/clearance/logCL_aggregated_dataset.tsv) datasets.
## Usage
Once the input data file has been prepared, you can run `AssayInspector` in the following way:
```python
from assay_inspector import AssayInspector
# Prepare AssayInspector report
report = AssayInspector(
data_path='path/to/dataset/file.tsv',
endpoint_name='endpoint',
task='regression',
feature_type='ecfp4',
reference_set='path/to/reference_set.tsv' # optional
)
# Run AssayInspector report
report.get_individual_reporting()
report.get_comparative_reporting()
```
#### AssayInspector arguments
| Argument | Type | Description |
| --- | --- | --- |
| `data_path` | `str` | Path to the input dataset file (`.csv` or `.tsv` format). |
| `endpoint_name` | `str` | Name of the endpoint to analyze. |
| `task` | `str` | Type of task: either `'regression'` or `'classification'`. |
| `feature_type` | `str` | Type of features to use: one of `'ecfp4'`, `'rdkit'`, or `'custom'`. |
| `outliers_method` | `str` | *(Optional)* Method to detect outliers: `'zscore'` *(default)* or `'iqr'`. |
| `distance_metric` | `str` | *(Optional)* Distance metric for custom descriptors: `'euclidean'` *(default)*. |
| `descriptors_df` | `pd.DataFrame` | *(Optional)* DataFrame containing molecular descriptors for dataset molecules (required when `feature_type='custom'`). |
| `reference_set` | `str` | *(Optional)* Path to an additional dataset used for comparative analysis. |
| `lower_bound` | `int` or `float` | *(Optional)* Lower bound to define the endpoint applicability domain. |
| `upper_bound` | `int` or `float` | *(Optional)* Upper bound to define the endpoint applicability domain. |
The resulting output will be saved in a folder named `AssayInspector_YYYYMMDD`, which will contain:
- A tabular file that summarizes key descriptive parameters for each data source.
- A comprehensive set of visualization plots that facilitate the detection of inconsistencies across data sources.
- An insight report containing multiple alerts and recommendations to guide data cleaning and preprocessing.
## Examples
Below are a few sample outputs generated by `AssayInspector`.
| Endpoint | Outlier Visualization | Endpoint Distribution Comparative Visualization |
|-------------------------------|-------------------------------|----------------------------|
| Half-life |  |  |
| Clearance |  |  |
## License
`AssayInspector` is licensed under the MIT License. See the [LICENSE](https://github.com/chemotargets/assay_inspector/blob/master/LICENSE) file.
<!--
## Cite us
Please cite [our paper](url) if you use *AssayInspector* in your own work:
```
@article {TAG,
title = {Data consistency assessment facilitates transfer learning in ADME modeling},
author = {Parrondo-Pizarro, Raquel and Menestrina, Luca and Garcia-Serna, Ricard and Fernández-Torras, Adrià and Mestres, Jordi},
journal = {Journal},
volume = {Vol},
year = {Year},
doi = {doi},
URL = {url},
publisher = {Publisher},
}
```
-->
Raw data
{
"_id": null,
"home_page": null,
"name": "assay-inspector",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Raquel Parrondo-Pizarro <raquel.parrondo@chemotargets.com>, Luca Menestrina <luca.menestrina@chemotargets.com>",
"keywords": "data reporting, molecular property, ADME, physicochemical, machine learning, data aggregation, predictive accuracy, benchmark",
"author": null,
"author_email": "Raquel Parrondo-Pizarro <raquel.parrondo@chemotargets.com>, Luca Menestrina <luca.menestrina@chemotargets.com>",
"download_url": "https://files.pythonhosted.org/packages/46/00/edc24a388d73a7c8f0d0ba6b80a709774baffbd93b182670354b1602218a/assay_inspector-1.0.5.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <h1>\n Data consistency assessment facilitates transfer learning in ADME modeling\n </h1>\n <p><i>AssayInspector: A Python package for diagnostic assessment of data consistency in molecular datasets</i></p>\n\n \n \n \n\n</div>\n\n<div align=\"center\">\n <img src=\"https://raw.githubusercontent.com/chemotargets/assay_inspector/master/AssayInspector.svg\" alt=\"AssayInspector\" width=\"80%\">\n</div>\n\n \n\nData heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments between benchmark and gold-standard sources that degrade model performance. Our analyses further revealed that dataset discrepancies arise from differences in various factors, from experimental conditions in data collection to chemical space coverage. This highlights the importance of **rigorous data consistency assessment (DCA) prior to modeling**. To facilitate a systematic DCA across diverse datasets, we developed **AssayInspector**, a **model-agnostic package** that leverages *statistics*, *visualizations*, and *diagnostic summaries* to identify *outliers*, *batch effects*, and *discrepancies*. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.\n\n**Keywords:** data reporting, molecular property, ADME, physicochemical, machine learning, data aggregation, predictive accuracy, benchmark\n\n## Installation\n\nTo install and use the package, first create the `conda` environment as follows:\n```bash \nconda env create -f AssayInspector_env.yml\n```\n\nThen, activate the environment:\n```bash\nconda activate assay_inspector\n```\n\nFinally, install the package from PyPI using pip:\n```bash\npip install assay_inspector\n```\n\n\n## Getting Started\n\nTo run `AssayInspector`, you first need to prepare your input data. The file should be in `.tsv` or `.csv` format and include the following required columns:\n* `smiles`: The SMILES string representation of each molecule in the dataset.\n* `value`: The annotated value for each molecule \u2014 use a numerical value for regression tasks or a binary label (0 or 1) for classification tasks.\n* `ref`: The reference source name from which each value-molecule annotation was obtained.\n* `endpoint`: The name of the endpoint to analyze.\n\nYou can find two example input files for the [half-life](https://raw.githubusercontent.com/chemotargets/assay_inspector/refs/heads/master/data/half_life/logHL_aggregated_dataset.tsv) and [clearance](https://raw.githubusercontent.com/chemotargets/assay_inspector/refs/heads/master/data/clearance/logCL_aggregated_dataset.tsv) datasets.\n\n## Usage\n\nOnce the input data file has been prepared, you can run `AssayInspector` in the following way:\n\n```python\nfrom assay_inspector import AssayInspector\n\n# Prepare AssayInspector report\nreport = AssayInspector(\n\tdata_path='path/to/dataset/file.tsv',\n\tendpoint_name='endpoint',\n\ttask='regression',\n\tfeature_type='ecfp4',\n\treference_set='path/to/reference_set.tsv' # optional\n)\n\n# Run AssayInspector report\nreport.get_individual_reporting()\nreport.get_comparative_reporting()\n```\n\n#### AssayInspector arguments\n\n| Argument | Type | Description |\n| --- | --- | --- |\n| `data_path` | `str` | Path to the input dataset file (`.csv` or `.tsv` format). |\n| `endpoint_name` | `str` | Name of the endpoint to analyze. |\n| `task` | `str` | Type of task: either `'regression'` or `'classification'`. |\n| `feature_type` | `str` | Type of features to use: one of `'ecfp4'`, `'rdkit'`, or `'custom'`. |\n| `outliers_method` | `str` | *(Optional)* Method to detect outliers: `'zscore'` *(default)* or `'iqr'`. |\n| `distance_metric` | `str` | *(Optional)* Distance metric for custom descriptors: `'euclidean'` *(default)*. |\n| `descriptors_df` | `pd.DataFrame` | *(Optional)* DataFrame containing molecular descriptors for dataset molecules (required when `feature_type='custom'`). |\n| `reference_set` | `str` | *(Optional)* Path to an additional dataset used for comparative analysis. |\n| `lower_bound` | `int` or `float` | *(Optional)* Lower bound to define the endpoint applicability domain. |\n| `upper_bound` | `int` or `float` | *(Optional)* Upper bound to define the endpoint applicability domain. |\n\nThe resulting output will be saved in a folder named `AssayInspector_YYYYMMDD`, which will contain:\n- A tabular file that summarizes key descriptive parameters for each data source.\n- A comprehensive set of visualization plots that facilitate the detection of inconsistencies across data sources.\n- An insight report containing multiple alerts and recommendations to guide data cleaning and preprocessing.\n\n## Examples\n\nBelow are a few sample outputs generated by `AssayInspector`.\n\n| Endpoint | Outlier Visualization | Endpoint Distribution Comparative Visualization |\n|-------------------------------|-------------------------------|----------------------------|\n| Half-life |  |  |\n| Clearance |  |  |\n\n## License\n\n`AssayInspector` is licensed under the MIT License. See the [LICENSE](https://github.com/chemotargets/assay_inspector/blob/master/LICENSE) file.\n\n<!--\n## Cite us\nPlease cite [our paper](url) if you use *AssayInspector* in your own work:\n\n```\n@article {TAG,\n title = {Data consistency assessment facilitates transfer learning in ADME modeling},\n author = {Parrondo-Pizarro, Raquel and Menestrina, Luca and Garcia-Serna, Ricard and Fern\u00e1ndez-Torras, Adri\u00e0 and Mestres, Jordi},\n journal = {Journal},\n volume = {Vol},\n year = {Year},\n doi = {doi},\n URL = {url},\n publisher = {Publisher},\n}\n```\n-->\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) 2025 Chemotargets\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.\n ",
"summary": "AssayInspector: A Python package for diagnostic assessment of data consistency in molecular datasets.",
"version": "1.0.5",
"project_urls": {
"Documentation": "https://github.com/chemotargets/assay_inspector",
"Homepage": "https://github.com/chemotargets/assay_inspector",
"Repository": "https://github.com/chemotargets/assay_inspector"
},
"split_keywords": [
"data reporting",
" molecular property",
" adme",
" physicochemical",
" machine learning",
" data aggregation",
" predictive accuracy",
" benchmark"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ba2ec4db8c7222af4489a9ab97980a070bc7acae92991d4ccf3a052c13188063",
"md5": "2b1ebde235f9dd22e73fffbd0100c9e0",
"sha256": "e744f0eebee4b23249238096b4494ec41bab354141ac702cc16fc9e11434516a"
},
"downloads": -1,
"filename": "assay_inspector-1.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2b1ebde235f9dd22e73fffbd0100c9e0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 48484,
"upload_time": "2025-08-05T12:25:57",
"upload_time_iso_8601": "2025-08-05T12:25:57.595113Z",
"url": "https://files.pythonhosted.org/packages/ba/2e/c4db8c7222af4489a9ab97980a070bc7acae92991d4ccf3a052c13188063/assay_inspector-1.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "4600edc24a388d73a7c8f0d0ba6b80a709774baffbd93b182670354b1602218a",
"md5": "52ac68e2bdfcdf4bab6677cacd894e82",
"sha256": "c241b536a92895c4a2967f96e2ff78f932967343889b19e29ba10d5671426b1a"
},
"downloads": -1,
"filename": "assay_inspector-1.0.5.tar.gz",
"has_sig": false,
"md5_digest": "52ac68e2bdfcdf4bab6677cacd894e82",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 48102,
"upload_time": "2025-08-05T12:25:59",
"upload_time_iso_8601": "2025-08-05T12:25:59.026957Z",
"url": "https://files.pythonhosted.org/packages/46/00/edc24a388d73a7c8f0d0ba6b80a709774baffbd93b182670354b1602218a/assay_inspector-1.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-05 12:25:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "chemotargets",
"github_project": "assay_inspector",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "assay-inspector"
}