CDmetrics

Name	CDmetrics JSON
Version	0.1.4 JSON
	download
home_page	None
Summary	Case difficulty metrics
upload_time	2024-10-30 20:28:07
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords	data_difficulty case_difficulty instance_hardness difficulty
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # CDmetrics
Case Difficulty (Instance Hardness) metrics in Python, with three ways to measure the difficulty of individual cases: CDmc, CDdm, and CDpu.

## Case Difficulty Metrics
- Case Difficulty Model Complexity **(CDmc)**
  - CDmc is based on the complexity of the neural network required for accurate predictions.

- Case Difficulty Double Model **(CDdm)**
  - CDdm utilizes a pair of neural networks: one predicts a given case, and the other assesses the likelihood that the prediction made by the first model is correct.

- Case Difficulty Predictive Uncertainty **(CDpu)**
  - CDpu evaluates the variability of the neural network's predictions.


## Getting Started
CDmetrics employs neural networks to measure the difficulty of individual cases in a dataset. The metrics are tailored to different definitions of prediction difficulty and are designed to perform well across various datasets.


### Installation
The package was developed using Python. Below, we provide standard installation instructions and guidelines for using CDmetrics to calculate case difficulty on your own datasets.

_For users_
```
pip install CDmetrics
```

_For developers_
```
git clone https://github.com/data-intelligence-for-health-lab/CDmetrics.git

cd CDmetrics
```

#### Anaconda environment

We **strongly recommend** using a separate Python environment. We provide an env file [environment.yml](./environment.yml) to create a conda environment with all required dependencies:

```
conda env create --file environment.yml
```

### Usage

Each metric requires certain parameters to run.

- CDmc requires number_of_NNs (the number of neural network models to make predictions):
```
from CDmetrics import CDmc
CDmc.compute_metric(data, number_of_NNs, target_column)
```

- CDdm requires num_folds (the number of folds to divide the data):
```
from CDmetrics import CDdm
CDdm.compute_metric(data, num_folds, target_column, max_layers, max_units, resources)
```

- CDpu requires number_of_predictions (the number of prediction probabilities to generate):
```
from CDmetrics import CDpu
CDpu.compute_metric(data, target_column, number_of_predictions, max_layers, max_units, resources)
```

The hyperparameters are tuned using Grid Search with Ray.
To change the hyperparameter search space, update the search_space in tune_parameters function in CDmetrics/utils.py.

### Guidelines for input dataset

Please follow the recommendations below:

* The dataset should be preprocessed (scaling, imputation, and encoding must be done before running CDmetrics).
* Data needs to be passed in a dataframe.
* Do not include any index column.
* The target column name must be clearly specified.
* The metrics only support classification problems with tabular data.

## Citation

If you're using CDmetrics in your research or application, please cite our [paper](https://www.nature.com/articles/s41598-024-61284-z):

> Kwon, H., Greenberg, M., Josephson, C.B. and Lee, J., 2024. Measuring the prediction difficulty of individual cases in a dataset using machine learning. Scientific Reports, 14(1), p.10474.

```
@article{kwon2024measuring,
  title={Measuring the prediction difficulty of individual cases in a dataset using machine learning},
  author={Kwon, Hyunjin and Greenberg, Matthew and Josephson, Colin Bruce and Lee, Joon},
  journal={Scientific Reports},
  volume={14},
  number={1},
  pages={10474},
  year={2024},
  publisher={Nature Publishing Group UK London}
}
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "CDmetrics",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "data_difficulty, case_difficulty, instance_hardness, difficulty",
    "author": null,
    "author_email": "Eulee Kwon <euleekwon@gmail.com>, Yves Nsoga <nsogan@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f0/6f/30ec4bff26f407882b34de49d02d783cd8ddd4fba79895c201a8fc70a719/cdmetrics-0.1.4.tar.gz",
    "platform": null,
    "description": "# CDmetrics\nCase Difficulty (Instance Hardness) metrics in Python, with three ways to measure the difficulty of individual cases: CDmc, CDdm, and CDpu.\n\n## Case Difficulty Metrics\n- Case Difficulty Model Complexity **(CDmc)**\n  - CDmc is based on the complexity of the neural network required for accurate predictions.\n\n- Case Difficulty Double Model **(CDdm)**\n  - CDdm utilizes a pair of neural networks: one predicts a given case, and the other assesses the likelihood that the prediction made by the first model is correct.\n\n- Case Difficulty Predictive Uncertainty **(CDpu)**\n  - CDpu evaluates the variability of the neural network's predictions.\n\n\n## Getting Started\nCDmetrics employs neural networks to measure the difficulty of individual cases in a dataset. The metrics are tailored to different definitions of prediction difficulty and are designed to perform well across various datasets.\n\n\n### Installation\nThe package was developed using Python. Below, we provide standard installation instructions and guidelines for using CDmetrics to calculate case difficulty on your own datasets.\n\n_For users_\n```\npip install CDmetrics\n```\n\n_For developers_\n```\ngit clone https://github.com/data-intelligence-for-health-lab/CDmetrics.git\n\ncd CDmetrics\n```\n\n#### Anaconda environment\n\nWe **strongly recommend** using a separate Python environment. We provide an env file [environment.yml](./environment.yml) to create a conda environment with all required dependencies:\n\n```\nconda env create --file environment.yml\n```\n\n### Usage\n\nEach metric requires certain parameters to run.\n\n- CDmc requires number_of_NNs (the number of neural network models to make predictions):\n```\nfrom CDmetrics import CDmc\nCDmc.compute_metric(data, number_of_NNs, target_column)\n```\n\n- CDdm requires num_folds (the number of folds to divide the data):\n```\nfrom CDmetrics import CDdm\nCDdm.compute_metric(data, num_folds, target_column, max_layers, max_units, resources)\n```\n\n- CDpu requires number_of_predictions (the number of prediction probabilities to generate):\n```\nfrom CDmetrics import CDpu\nCDpu.compute_metric(data, target_column, number_of_predictions, max_layers, max_units, resources)\n```\n\nThe hyperparameters are tuned using Grid Search with Ray.\nTo change the hyperparameter search space, update the search_space in tune_parameters function in CDmetrics/utils.py.\n\n### Guidelines for input dataset\n\nPlease follow the recommendations below:\n\n* The dataset should be preprocessed (scaling, imputation, and encoding must be done before running CDmetrics).\n* Data needs to be passed in a dataframe.\n* Do not include any index column.\n* The target column name must be clearly specified.\n* The metrics only support classification problems with tabular data.\n\n## Citation\n\nIf you're using CDmetrics in your research or application, please cite our [paper](https://www.nature.com/articles/s41598-024-61284-z):\n\n> Kwon, H., Greenberg, M., Josephson, C.B. and Lee, J., 2024. Measuring the prediction difficulty of individual cases in a dataset using machine learning. Scientific Reports, 14(1), p.10474.\n\n```\n@article{kwon2024measuring,\n  title={Measuring the prediction difficulty of individual cases in a dataset using machine learning},\n  author={Kwon, Hyunjin and Greenberg, Matthew and Josephson, Colin Bruce and Lee, Joon},\n  journal={Scientific Reports},\n  volume={14},\n  number={1},\n  pages={10474},\n  year={2024},\n  publisher={Nature Publishing Group UK London}\n}\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Case difficulty metrics",
    "version": "0.1.4",
    "project_urls": {
        "homepage": "https://cumming.ucalgary.ca/lab/dih/projects/current-projects/development-novel-case-specific-machine-learning-methodologies",
        "repository": "https://github.com/data-intelligence-for-health-lab/d_metrics"
    },
    "split_keywords": [
        "data_difficulty",
        " case_difficulty",
        " instance_hardness",
        " difficulty"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ba83a4f186020e8b5d3bf20bb2b1f207df4c2ee0b089f1e2076df87fab098870",
                "md5": "93dc95251fc62249f743b977f4ae2237",
                "sha256": "bd6f3a0a2f1fa4c3b0ccd697e29affdc6cbd05531e598dcdfc79bffa207a7e1f"
            },
            "downloads": -1,
            "filename": "CDmetrics-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "93dc95251fc62249f743b977f4ae2237",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 7965,
            "upload_time": "2024-10-30T20:28:05",
            "upload_time_iso_8601": "2024-10-30T20:28:05.082163Z",
            "url": "https://files.pythonhosted.org/packages/ba/83/a4f186020e8b5d3bf20bb2b1f207df4c2ee0b089f1e2076df87fab098870/CDmetrics-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f06f30ec4bff26f407882b34de49d02d783cd8ddd4fba79895c201a8fc70a719",
                "md5": "d96432c73d04d8d7e6d1a8f7a47468f8",
                "sha256": "20b4c2af744101f18882fc7862607acae225a3f88a320c37910148366f9944f3"
            },
            "downloads": -1,
            "filename": "cdmetrics-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "d96432c73d04d8d7e6d1a8f7a47468f8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 8221,
            "upload_time": "2024-10-30T20:28:07",
            "upload_time_iso_8601": "2024-10-30T20:28:07.110191Z",
            "url": "https://files.pythonhosted.org/packages/f0/6f/30ec4bff26f407882b34de49d02d783cd8ddd4fba79895c201a8fc70a719/cdmetrics-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-30 20:28:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "data-intelligence-for-health-lab",
    "github_project": "d_metrics",
    "github_not_found": true,
    "lcname": "cdmetrics"
}

None