# CDmetrics
Case Difficulty (Instance Hardness) metrics in Python, with three ways to measure the difficulty of individual cases: CDmc, CDdm, and CDpu.
## Case Difficulty Metrics
- Case Difficulty Model Complexity **(CDmc)**
- CDmc is based on the complexity of the neural network required for accurate predictions.
- Case Difficulty Double Model **(CDdm)**
- CDdm utilizes a pair of neural networks: one predicts a given case, and the other assesses the likelihood that the prediction made by the first model is correct.
- Case Difficulty Predictive Uncertainty **(CDpu)**
- CDpu evaluates the variability of the neural network's predictions.
## Getting Started
CDmetrics employs neural networks to measure the difficulty of individual cases in a dataset. The metrics are tailored to different definitions of prediction difficulty and are designed to perform well across various datasets.
### Installation
The package was developed using Python. Below, we provide standard installation instructions and guidelines for using CDmetrics to calculate case difficulty on your own datasets.
_For users_
```
pip install CDmetrics
```
_For developers_
```
git clone https://github.com/data-intelligence-for-health-lab/CDmetrics.git
cd CDmetrics
```
#### Anaconda environment
We **strongly recommend** using a separate Python environment. We provide an env file [environment.yml](./environment.yml) to create a conda environment with all required dependencies:
```
conda env create --file environment.yml
```
### Usage
Each metric requires certain parameters to run.
- CDmc requires number_of_NNs (the number of neural network models to make predictions):
```
from CDmetrics import CDmc
CDmc.compute_metric(data, number_of_NNs, target_column)
```
- CDdm requires num_folds (the number of folds to divide the data):
```
from CDmetrics import CDdm
CDdm.compute_metric(data, num_folds, target_column, max_layers, max_units, resources)
```
- CDpu requires number_of_predictions (the number of prediction probabilities to generate):
```
from CDmetrics import CDpu
CDpu.compute_metric(data, target_column, number_of_predictions, max_layers, max_units, resources)
```
The hyperparameters are tuned using Grid Search with Ray.
To change the hyperparameter search space, update the search_space in tune_parameters function in CDmetrics/utils.py.
### Guidelines for input dataset
Please follow the recommendations below:
* The dataset should be preprocessed (scaling, imputation, and encoding must be done before running CDmetrics).
* Data needs to be passed in a dataframe.
* Do not include any index column.
* The target column name must be clearly specified.
* The metrics only support classification problems with tabular data.
## Citation
If you're using CDmetrics in your research or application, please cite our [paper](https://www.nature.com/articles/s41598-024-61284-z):
> Kwon, H., Greenberg, M., Josephson, C.B. and Lee, J., 2024. Measuring the prediction difficulty of individual cases in a dataset using machine learning. Scientific Reports, 14(1), p.10474.
```
@article{kwon2024measuring,
title={Measuring the prediction difficulty of individual cases in a dataset using machine learning},
author={Kwon, Hyunjin and Greenberg, Matthew and Josephson, Colin Bruce and Lee, Joon},
journal={Scientific Reports},
volume={14},
number={1},
pages={10474},
year={2024},
publisher={Nature Publishing Group UK London}
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "CDmetrics",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "data_difficulty, case_difficulty, instance_hardness, difficulty",
"author": null,
"author_email": "Eulee Kwon <euleekwon@gmail.com>, Yves Nsoga <nsogan@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/f0/6f/30ec4bff26f407882b34de49d02d783cd8ddd4fba79895c201a8fc70a719/cdmetrics-0.1.4.tar.gz",
"platform": null,
"description": "# CDmetrics\nCase Difficulty (Instance Hardness) metrics in Python, with three ways to measure the difficulty of individual cases: CDmc, CDdm, and CDpu.\n\n## Case Difficulty Metrics\n- Case Difficulty Model Complexity **(CDmc)**\n - CDmc is based on the complexity of the neural network required for accurate predictions.\n\n- Case Difficulty Double Model **(CDdm)**\n - CDdm utilizes a pair of neural networks: one predicts a given case, and the other assesses the likelihood that the prediction made by the first model is correct.\n\n- Case Difficulty Predictive Uncertainty **(CDpu)**\n - CDpu evaluates the variability of the neural network's predictions.\n\n\n## Getting Started\nCDmetrics employs neural networks to measure the difficulty of individual cases in a dataset. The metrics are tailored to different definitions of prediction difficulty and are designed to perform well across various datasets.\n\n\n### Installation\nThe package was developed using Python. Below, we provide standard installation instructions and guidelines for using CDmetrics to calculate case difficulty on your own datasets.\n\n_For users_\n```\npip install CDmetrics\n```\n\n_For developers_\n```\ngit clone https://github.com/data-intelligence-for-health-lab/CDmetrics.git\n\ncd CDmetrics\n```\n\n#### Anaconda environment\n\nWe **strongly recommend** using a separate Python environment. We provide an env file [environment.yml](./environment.yml) to create a conda environment with all required dependencies:\n\n```\nconda env create --file environment.yml\n```\n\n### Usage\n\nEach metric requires certain parameters to run.\n\n- CDmc requires number_of_NNs (the number of neural network models to make predictions):\n```\nfrom CDmetrics import CDmc\nCDmc.compute_metric(data, number_of_NNs, target_column)\n```\n\n- CDdm requires num_folds (the number of folds to divide the data):\n```\nfrom CDmetrics import CDdm\nCDdm.compute_metric(data, num_folds, target_column, max_layers, max_units, resources)\n```\n\n- CDpu requires number_of_predictions (the number of prediction probabilities to generate):\n```\nfrom CDmetrics import CDpu\nCDpu.compute_metric(data, target_column, number_of_predictions, max_layers, max_units, resources)\n```\n\nThe hyperparameters are tuned using Grid Search with Ray.\nTo change the hyperparameter search space, update the search_space in tune_parameters function in CDmetrics/utils.py.\n\n### Guidelines for input dataset\n\nPlease follow the recommendations below:\n\n* The dataset should be preprocessed (scaling, imputation, and encoding must be done before running CDmetrics).\n* Data needs to be passed in a dataframe.\n* Do not include any index column.\n* The target column name must be clearly specified.\n* The metrics only support classification problems with tabular data.\n\n## Citation\n\nIf you're using CDmetrics in your research or application, please cite our [paper](https://www.nature.com/articles/s41598-024-61284-z):\n\n> Kwon, H., Greenberg, M., Josephson, C.B. and Lee, J., 2024. Measuring the prediction difficulty of individual cases in a dataset using machine learning. Scientific Reports, 14(1), p.10474.\n\n```\n@article{kwon2024measuring,\n title={Measuring the prediction difficulty of individual cases in a dataset using machine learning},\n author={Kwon, Hyunjin and Greenberg, Matthew and Josephson, Colin Bruce and Lee, Joon},\n journal={Scientific Reports},\n volume={14},\n number={1},\n pages={10474},\n year={2024},\n publisher={Nature Publishing Group UK London}\n}\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Case difficulty metrics",
"version": "0.1.4",
"project_urls": {
"homepage": "https://cumming.ucalgary.ca/lab/dih/projects/current-projects/development-novel-case-specific-machine-learning-methodologies",
"repository": "https://github.com/data-intelligence-for-health-lab/d_metrics"
},
"split_keywords": [
"data_difficulty",
" case_difficulty",
" instance_hardness",
" difficulty"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ba83a4f186020e8b5d3bf20bb2b1f207df4c2ee0b089f1e2076df87fab098870",
"md5": "93dc95251fc62249f743b977f4ae2237",
"sha256": "bd6f3a0a2f1fa4c3b0ccd697e29affdc6cbd05531e598dcdfc79bffa207a7e1f"
},
"downloads": -1,
"filename": "CDmetrics-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "93dc95251fc62249f743b977f4ae2237",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 7965,
"upload_time": "2024-10-30T20:28:05",
"upload_time_iso_8601": "2024-10-30T20:28:05.082163Z",
"url": "https://files.pythonhosted.org/packages/ba/83/a4f186020e8b5d3bf20bb2b1f207df4c2ee0b089f1e2076df87fab098870/CDmetrics-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f06f30ec4bff26f407882b34de49d02d783cd8ddd4fba79895c201a8fc70a719",
"md5": "d96432c73d04d8d7e6d1a8f7a47468f8",
"sha256": "20b4c2af744101f18882fc7862607acae225a3f88a320c37910148366f9944f3"
},
"downloads": -1,
"filename": "cdmetrics-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "d96432c73d04d8d7e6d1a8f7a47468f8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 8221,
"upload_time": "2024-10-30T20:28:07",
"upload_time_iso_8601": "2024-10-30T20:28:07.110191Z",
"url": "https://files.pythonhosted.org/packages/f0/6f/30ec4bff26f407882b34de49d02d783cd8ddd4fba79895c201a8fc70a719/cdmetrics-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-30 20:28:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "data-intelligence-for-health-lab",
"github_project": "d_metrics",
"github_not_found": true,
"lcname": "cdmetrics"
}