MLstatkit


NameMLstatkit JSON
Version 0.1.9 PyPI version JSON
download
home_pageNone
SummaryMLstatkit integrates established statistical methods into ML workflows (DeLong test, bootstrapping CI, AUC2OR, permutation test, etc.).
upload_time2025-08-23 11:02:58
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords python statistics delong test bootstrapping auc2or machine learning permutation test
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MLstatkit

![PyPI - Status](https://img.shields.io/pypi/status/MLstatkit)
![PyPI - Wheel](https://img.shields.io/pypi/wheel/MLstatkit)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/MLstatkit)
![PyPI - Download](https://img.shields.io/pypi/dm/MLstatkit)
[![Downloads](https://static.pepy.tech/badge/MLstatkit)](https://pepy.tech/project/MLstatkit)

**MLstatkit** is a Python library that integrates established statistical methods into modern machine learning workflows.  
It provides a set of core functions widely used for model evaluation and statistical inference:

- **DeLong's test** (`Delong_test`) for comparing the AUCs of two correlated ROC curves.  

- **Bootstrapping** (`Bootstrapping`) for estimating confidence intervals of metrics such as ROC-AUC, F1-score, accuracy, precision, recall, and PR-AUC.  

- **Permutation test** (`Permutation_test`) for evaluating whether performance differences between two models are statistically significant.  

- **AUC to Odds Ratio conversion** (`AUC2OR`) for interpreting ROC-AUC values in terms of odds ratios and related effect size statistics.  

Since v0.1.9, the library has been **modularized** into dedicated files (`ci.py`, `conversions.py`, `delong.py`, `metrics.py`, `permutation.py`), while keeping a unified import interface through `stats.py`. This improves readability, maintainability, and extensibility for future methods.

## Installation

Install MLstatkit directly from PyPI using pip:

```bash
pip install MLstatkit
```

## Usage

### Delong's Test for ROC Curve

`Delong_test` function enables a statistical evaluation of the differences between the **areas under two correlated Receiver Operating Characteristic (ROC) curves derived from distinct models**. This facilitates a deeper understanding of comparative model performance.  
Since version `0.1.8`, the function also supports returning **confidence intervals (CIs)** for the AUCs of both models, similar to the functionality of `roc.test` in R.

#### Parameters (DeLong’s Test)

- **true** : array-like of shape (n_samples,)  
    True binary labels in range {0, 1}.

- **prob_A** : array-like of shape (n_samples,)  
    Predicted probabilities by the first model.

- **prob_B** : array-like of shape (n_samples,)  
    Predicted probabilities by the second model.

- **return_ci** : bool, default=False  
    If True, also return the confidence intervals (CIs) of AUCs for both models.

- **alpha** : float, default=0.95  
    Confidence level for the AUC CIs (e.g., 0.95 for a 95% confidence interval).

#### Returns (DeLong’s Test)

- **z_score** : float  
    The z score from comparing the AUCs of two models.

- **p_value** : float  
    The p value from comparing the AUCs of two models.

- **ci_A** : tuple(float, float), optional  
    Lower and upper bounds of the confidence interval for model A's AUC (if `return_ci=True`).

- **ci_B** : tuple(float, float), optional  
    Lower and upper bounds of the confidence interval for model B's AUC (if `return_ci=True`).

#### Example (DeLong’s Test)

```python
from MLstatkit import Delong_test
import numpy as np

# Example data
true = np.array([0, 1, 0, 1])
prob_A = np.array([0.1, 0.4, 0.35, 0.8])
prob_B = np.array([0.2, 0.3, 0.4, 0.7])

# Perform DeLong's test (z-score and p-value only)
z_score, p_value = Delong_test(true, prob_A, prob_B)
print(f"Z-Score: {z_score}, P-Value: {p_value}")

# Perform DeLong's test with 95% confidence intervals
z_score, p_value, ci_A, ci_B = Delong_test(true, prob_A, prob_B, return_ci=True, alpha=0.95)
print(f"Z-Score: {z_score}, P-Value: {p_value}")
print(f"Model A AUC 95% CI: {ci_A}")
print(f"Model B AUC 95% CI: {ci_B}")
```

This demonstrates the usage of `Delong_test` to statistically compare the AUCs of two models based on their probabilities and the true labels. The returned z-score and p-value help in understanding if the difference in model performances is statistically significant.

The output includes both significance testing (z-score and p-value) and, if requested, the confidence intervals for each model’s ROC-AUC. This makes it straightforward to compare model performance in a statistically rigorous way.

### Bootstrapping for Confidence Intervals

The `Bootstrapping` function calculates **confidence intervals (CIs)** for specified performance metrics using bootstrapping, providing a measure of the estimation's reliability. It supports calculation for AUROC (area under the ROC curve), AUPRC (area under the precision-recall curve), and F1 score metrics.

#### Parameters(Bootstrapping)

- **true** : array-like of shape (n_samples,)  
    True binary labels, where the labels are either {0, 1}.

- **prob** : array-like of shape (n_samples,)  
    Predicted probabilities, as returned by a classifier's predict_proba method, or binary predictions based on the specified scoring function and threshold.

- **metric_str** : str, default='f1'  
    Identifier for the scoring function to use. Supported values include 'f1', 'accuracy', 'recall', 'precision', 'roc_auc', 'pr_auc', and 'average_precision'.

- **n_bootstraps** : int, default=1000  
    The number of bootstrap iterations to perform. Increasing this number improves the reliability of the confidence interval estimation but also increases computational time.

- **confidence_level** : float, default=0.95  
    The confidence level for the interval estimation. For instance, 0.95 represents a 95% confidence interval.

- **threshold** : float, default=0.5  
    A threshold value used for converting probabilities to binary labels for metrics like 'f1', where applicable.

- **average** : str, default='macro'  
    Specifies the method of averaging to apply to multi-class/multi-label targets. Other options include 'micro', 'samples', 'weighted', and 'binary'.

- **random_state** : int, default=0  
    Seed for the random number generator. This parameter ensures reproducibility of results.

#### Returns(Bootstrapping)

- **original_score** : float  
    Metric score on the original (non-resampled) dataset.

- **confidence_lower** : float  
    Lower bound of the bootstrap confidence interval.

- **confidence_upper** : float  
    Upper bound of the bootstrap confidence interval.

#### Examples(Bootstrapping)

```python
from MLstatkit import Bootstrapping

# Example data
y_true = np.array([0, 1, 0, 0, 1, 1, 0, 1, 0])
y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.3, 0.4, 0.7, 0.05])

# Calculate confidence intervals for AUROC
original_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'roc_auc')
print(f"AUROC: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]")

# Calculate confidence intervals for AUPRC
original_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'pr_auc')
print(f"AUPRC: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]")

# Calculate confidence intervals for F1 score with a custom threshold
original_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'f1', threshold=0.5)
print(f"F1 Score: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]")

# Loop through multiple metrics
for score in ['roc_auc', 'pr_auc', 'f1']:
    original_score, conf_lower, conf_upper = Bootstrapping(y_true, y_prob, score, threshold=0.5)
    print(f"{score.upper()} original score: {original_score:.3f}, confidence interval: [{conf_lower:.3f} - {conf_upper:.3f}]")
```

### Permutation Test for Statistical Significance

The `Permutation_test` function evaluates whether the observed difference in performance between two models is **statistically significant**.  
It works by randomly shuffling the predictions between the models and recalculating the chosen metric many times to generate a null distribution of differences.  
This approach makes no assumptions about the underlying distribution of the data, making it a robust method for model comparison.

#### Parameters

- **y_true** : array-like of shape (n_samples,)  
  True binary labels in {0, 1}.  

- **prob_model_A** : array-like of shape (n_samples,)  
  Predicted probabilities from the first model.  

- **prob_model_B** : array-like of shape (n_samples,)  
  Predicted probabilities from the second model.  

- **metric_str** : str, default=`'f1'`  
  Metric to compare. Supported: `'f1'`, `'accuracy'`, `'recall'`, `'precision'`, `'roc_auc'`, `'pr_auc'`, `'average_precision'`.  

- **n_bootstraps** : int, default=`1000`  
  Number of permutation samples to generate.  

- **threshold** : float, default=`0.5`  
  Threshold for converting probabilities into binary predictions (used for metrics such as F1, precision, recall).  

- **average** : str, default=`'macro'`  
  Averaging strategy for multi-class/multi-label tasks. Options: `'binary'`, `'micro'`, `'macro'`, `'weighted'`, `'samples'`.  

- **random_state** : int, default=`0`  
  Random seed for reproducibility.  

#### Returns

- **metric_a** : float  
  Metric value for model A on the original data.  

- **metric_b** : float  
  Metric value for model B on the original data.  

- **p_value** : float  
  The p-value from the permutation test, i.e., the probability of observing a difference as extreme as the actual one under the null hypothesis.  

- **benchmark** : float  
  The observed absolute difference between the metrics of model A and model B.  

- **samples_mean** : float  
  Mean of the metric differences from permutation samples.  

- **samples_std** : float  
  Standard deviation of the metric differences from permutation samples.  

#### Example

```python
import numpy as np
from MLstatkit import Permutation_test

y_true = np.array([0, 1, 0, 0, 1, 1, 0, 1, 0])
prob_model_A = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.3, 0.4, 0.7, 0.05])
prob_model_B = np.array([0.2, 0.3, 0.25, 0.85, 0.15, 0.35, 0.45, 0.65, 0.01])

# Compare models using a permutation test on F1 score
metric_a, metric_b, p_value, benchmark, samples_mean, samples_std = Permutation_test(
    y_true, prob_model_A, prob_model_B, metric_str='f1'
)

print(f"F1 Score Model A: {metric_a:.5f}, Model B: {metric_b:.5f}")
print(f"Observed Difference: {benchmark:.5f}, p-value: {p_value:.5f}")
print(f"Permutation Samples Mean: {samples_mean:.5f}, Std: {samples_std:.5f}")
```

### Conversion of AUC to Odds Ratio (OR)

The `AUC2OR` function converts an **Area Under the ROC Curve (AUC)** value into an **Odds Ratio (OR)** under the binormal model.  
This transformation helps interpret classification performance in terms of effect sizes commonly used in statistics.  

- Under the binormal model:  

$$
AUC = \Phi\left(\frac{d}{\sqrt{2}}\right), \quad d \text{ is Cohen's } d
$$

$$
\ln(OR) = \frac{\pi}{\sqrt{3}} \times d
$$

Since version `0.1.9`, `AUC2OR` uses the exact **inverse normal CDF** (`scipy.stats.norm.ppf`) to compute \(z = \Phi^{-1}(AUC)\), improving accuracy over older approximations.

#### Parameters (AUC to OR)

- **AUC** : float  
  Area Under the ROC Curve, must be in (0, 1).  

- **return_all** : bool, default=`False`  
  If True, returns intermediate values `(z, d, ln_or, OR)` in addition to OR:  
  - **z** : probit (inverse normal CDF of AUC)  
  - **d** : effect size, `sqrt(2) * z`  
  - **ln_or** : natural logarithm of the Odds Ratio  
  - **OR** : Odds Ratio  

#### Returns (AUC to OR)

- **OR** : float  
  Odds Ratio corresponding to the given AUC.  

- **(z, d, ln_or, OR)** if `return_all=True`.  

#### Example (AUC to OR)

```python
from MLstatkit import AUC2OR

auc = 0.7  # Example AUC value

# Convert AUC to OR and retrieve intermediate values
z, d, ln_or, OR = AUC2OR(auc, return_all=True)
print(f"z: {z:.5f}, d: {d:.5f}, ln_OR: {ln_or:.5f}, OR: {OR:.5f}")

# Convert AUC to OR without intermediate values
OR = AUC2OR(auc)
print(f"OR: {OR:.5f}")
```

## References

### Delong's Test

The implementation of `Delong_test` in MLstatkit is based on the following publication:

- Xu Sun and Weichao Xu, "Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves," in *IEEE Signal Processing Letters*, vol. 21, no. 11, pp. 1389-1393, 2014, IEEE.

### Bootstrapping

The `Bootstrapping` method for calculating confidence intervals does not directly reference a single publication but is a widely accepted statistical technique for estimating the distribution of a metric by resampling with replacement. For a comprehensive overview of bootstrapping methods, see:

- B. Efron and R. Tibshirani, "An Introduction to the Bootstrap," Chapman & Hall/CRC Monographs on Statistics & Applied Probability, 1994.

### Permutation Test

The `Permutation_test` are utilized to assess the significance of the difference in performance metrics between two models by randomly reallocating observations to groups and computing the metric. This approach does not make specific distributional assumptions, making it versatile for various data types. For a foundational discussion on permutation tests, refer to:

- P. Good, "Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses," Springer Series in Statistics, 2000.

These references lay the groundwork for the statistical tests and methodologies implemented in MLstatkit, providing users with a deep understanding of their scientific basis and applicability.

### AUC2OR

The `AUC2OR` function converts the Area Under the Receiver Operating Characteristic Curve (AUC) into an **Odds Ratio (OR)** under the binormal model.  
When `return_all=True`, it also provides intermediate values:

- **z** : probit (Φ⁻¹ of AUC)  
- **d** : Cohen’s d effect size (`sqrt(2) * z`)  
- **ln_or** : natural logarithm of the odds ratio  
- **OR** : odds ratio  

This conversion is useful for interpreting ROC-AUC values in terms of effect sizes commonly used in statistical research.

- Salgado, J. F. (2018). *Transforming the area under the normal curve (AUC) into Cohen’s d, Pearson’s rpb, odds-ratio, and natural log odds-ratio: Two conversion tables.* European Journal of Psychology Applied to Legal Context, 10(1), 35–47.

## Contributing

We welcome contributions to MLstatkit! Please see our contribution guidelines for more details.

## License

MLstatkit is distributed under the MIT License. For more information, see the LICENSE file in the GitHub repository.

### Update log

- `0.1.9`  
  - **Refactor & modularization**: split `stats.py` into multiple modules (`ci.py`, `conversions.py`, `delong.py`, `metrics.py`, `permutation.py`) for better maintainability, while preserving a unified import interface.  
  - **Functions restored**: `Bootstrapping`, `Permutation_test`, and `AUC2OR` now available again after refactor.  
  - **AUC2OR** updated to use binormal model with exact `norm.ppf`, improving accuracy over the earlier polynomial approximation. Supports `return_all=True` to retrieve intermediate values `(z, d, ln_or, OR)`.  
  - **Improved testing**: added dedicated `tests/` for all core functions (Delong, Bootstrapping, Permutation test, AUC2OR, metrics, imports). Achieved full test coverage (`pytest` 16 passed).  
  - **README.md** updated with revised usage examples and clearer documentation.  
- `0.1.8`   Add return_ci option to Delong_test for AUC confidence intervals. Add `pyproject.toml`.
- `0.1.7`   Update `README.md`
- `0.1.6`   Debug.
- `0.1.5`   Update `README.md`, Add `AUC2OR` function.
- `0.1.4`   Update `README.md`, Add `Permutation_tests` function, Re-do `Bootstrapping` Parameters.
- `0.1.3`   Update `README.md`.
- `0.1.2`   Add `Bootstrapping` operation process progress display.
- `0.1.1`   Update `README.md`, `setup.py`. Add `CONTRIBUTING.md`.
- `0.1.0`   First edition.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "MLstatkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "python, statistics, DeLong test, bootstrapping, AUC2OR, machine learning, permutation test",
    "author": null,
    "author_email": "Yong-Zhen Huang <m946111005@tmu.edu.tw>",
    "download_url": "https://files.pythonhosted.org/packages/46/8e/73f6c3a900781eb03af5f66a7a009d390385c14c00715968d6a288d31e60/mlstatkit-0.1.9.tar.gz",
    "platform": null,
    "description": "# MLstatkit\n\n![PyPI - Status](https://img.shields.io/pypi/status/MLstatkit)\n![PyPI - Wheel](https://img.shields.io/pypi/wheel/MLstatkit)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/MLstatkit)\n![PyPI - Download](https://img.shields.io/pypi/dm/MLstatkit)\n[![Downloads](https://static.pepy.tech/badge/MLstatkit)](https://pepy.tech/project/MLstatkit)\n\n**MLstatkit** is a Python library that integrates established statistical methods into modern machine learning workflows.  \nIt provides a set of core functions widely used for model evaluation and statistical inference:\n\n- **DeLong's test** (`Delong_test`) for comparing the AUCs of two correlated ROC curves.  \n\n- **Bootstrapping** (`Bootstrapping`) for estimating confidence intervals of metrics such as ROC-AUC, F1-score, accuracy, precision, recall, and PR-AUC.  \n\n- **Permutation test** (`Permutation_test`) for evaluating whether performance differences between two models are statistically significant.  \n\n- **AUC to Odds Ratio conversion** (`AUC2OR`) for interpreting ROC-AUC values in terms of odds ratios and related effect size statistics.  \n\nSince v0.1.9, the library has been **modularized** into dedicated files (`ci.py`, `conversions.py`, `delong.py`, `metrics.py`, `permutation.py`), while keeping a unified import interface through `stats.py`. This improves readability, maintainability, and extensibility for future methods.\n\n## Installation\n\nInstall MLstatkit directly from PyPI using pip:\n\n```bash\npip install MLstatkit\n```\n\n## Usage\n\n### Delong's Test for ROC Curve\n\n`Delong_test` function enables a statistical evaluation of the differences between the **areas under two correlated Receiver Operating Characteristic (ROC) curves derived from distinct models**. This facilitates a deeper understanding of comparative model performance.  \nSince version `0.1.8`, the function also supports returning **confidence intervals (CIs)** for the AUCs of both models, similar to the functionality of `roc.test` in R.\n\n#### Parameters (DeLong\u2019s Test)\n\n- **true** : array-like of shape (n_samples,)  \n    True binary labels in range {0, 1}.\n\n- **prob_A** : array-like of shape (n_samples,)  \n    Predicted probabilities by the first model.\n\n- **prob_B** : array-like of shape (n_samples,)  \n    Predicted probabilities by the second model.\n\n- **return_ci** : bool, default=False  \n    If True, also return the confidence intervals (CIs) of AUCs for both models.\n\n- **alpha** : float, default=0.95  \n    Confidence level for the AUC CIs (e.g., 0.95 for a 95% confidence interval).\n\n#### Returns (DeLong\u2019s Test)\n\n- **z_score** : float  \n    The z score from comparing the AUCs of two models.\n\n- **p_value** : float  \n    The p value from comparing the AUCs of two models.\n\n- **ci_A** : tuple(float, float), optional  \n    Lower and upper bounds of the confidence interval for model A's AUC (if `return_ci=True`).\n\n- **ci_B** : tuple(float, float), optional  \n    Lower and upper bounds of the confidence interval for model B's AUC (if `return_ci=True`).\n\n#### Example (DeLong\u2019s Test)\n\n```python\nfrom MLstatkit import Delong_test\nimport numpy as np\n\n# Example data\ntrue = np.array([0, 1, 0, 1])\nprob_A = np.array([0.1, 0.4, 0.35, 0.8])\nprob_B = np.array([0.2, 0.3, 0.4, 0.7])\n\n# Perform DeLong's test (z-score and p-value only)\nz_score, p_value = Delong_test(true, prob_A, prob_B)\nprint(f\"Z-Score: {z_score}, P-Value: {p_value}\")\n\n# Perform DeLong's test with 95% confidence intervals\nz_score, p_value, ci_A, ci_B = Delong_test(true, prob_A, prob_B, return_ci=True, alpha=0.95)\nprint(f\"Z-Score: {z_score}, P-Value: {p_value}\")\nprint(f\"Model A AUC 95% CI: {ci_A}\")\nprint(f\"Model B AUC 95% CI: {ci_B}\")\n```\n\nThis demonstrates the usage of `Delong_test` to statistically compare the AUCs of two models based on their probabilities and the true labels. The returned z-score and p-value help in understanding if the difference in model performances is statistically significant.\n\nThe output includes both significance testing (z-score and p-value) and, if requested, the confidence intervals for each model\u2019s ROC-AUC. This makes it straightforward to compare model performance in a statistically rigorous way.\n\n### Bootstrapping for Confidence Intervals\n\nThe `Bootstrapping` function calculates **confidence intervals (CIs)** for specified performance metrics using bootstrapping, providing a measure of the estimation's reliability. It supports calculation for AUROC (area under the ROC curve), AUPRC (area under the precision-recall curve), and F1 score metrics.\n\n#### Parameters\uff08Bootstrapping\uff09\n\n- **true** : array-like of shape (n_samples,)  \n    True binary labels, where the labels are either {0, 1}.\n\n- **prob** : array-like of shape (n_samples,)  \n    Predicted probabilities, as returned by a classifier's predict_proba method, or binary predictions based on the specified scoring function and threshold.\n\n- **metric_str** : str, default='f1'  \n    Identifier for the scoring function to use. Supported values include 'f1', 'accuracy', 'recall', 'precision', 'roc_auc', 'pr_auc', and 'average_precision'.\n\n- **n_bootstraps** : int, default=1000  \n    The number of bootstrap iterations to perform. Increasing this number improves the reliability of the confidence interval estimation but also increases computational time.\n\n- **confidence_level** : float, default=0.95  \n    The confidence level for the interval estimation. For instance, 0.95 represents a 95% confidence interval.\n\n- **threshold** : float, default=0.5  \n    A threshold value used for converting probabilities to binary labels for metrics like 'f1', where applicable.\n\n- **average** : str, default='macro'  \n    Specifies the method of averaging to apply to multi-class/multi-label targets. Other options include 'micro', 'samples', 'weighted', and 'binary'.\n\n- **random_state** : int, default=0  \n    Seed for the random number generator. This parameter ensures reproducibility of results.\n\n#### Returns\uff08Bootstrapping\uff09\n\n- **original_score** : float  \n    Metric score on the original (non-resampled) dataset.\n\n- **confidence_lower** : float  \n    Lower bound of the bootstrap confidence interval.\n\n- **confidence_upper** : float  \n    Upper bound of the bootstrap confidence interval.\n\n#### Examples\uff08Bootstrapping\uff09\n\n```python\nfrom MLstatkit import Bootstrapping\n\n# Example data\ny_true = np.array([0, 1, 0, 0, 1, 1, 0, 1, 0])\ny_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.3, 0.4, 0.7, 0.05])\n\n# Calculate confidence intervals for AUROC\noriginal_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'roc_auc')\nprint(f\"AUROC: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]\")\n\n# Calculate confidence intervals for AUPRC\noriginal_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'pr_auc')\nprint(f\"AUPRC: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]\")\n\n# Calculate confidence intervals for F1 score with a custom threshold\noriginal_score, confidence_lower, confidence_upper = Bootstrapping(y_true, y_prob, 'f1', threshold=0.5)\nprint(f\"F1 Score: {original_score:.3f}, Confidence interval: [{confidence_lower:.3f} - {confidence_upper:.3f}]\")\n\n# Loop through multiple metrics\nfor score in ['roc_auc', 'pr_auc', 'f1']:\n    original_score, conf_lower, conf_upper = Bootstrapping(y_true, y_prob, score, threshold=0.5)\n    print(f\"{score.upper()} original score: {original_score:.3f}, confidence interval: [{conf_lower:.3f} - {conf_upper:.3f}]\")\n```\n\n### Permutation Test for Statistical Significance\n\nThe `Permutation_test` function evaluates whether the observed difference in performance between two models is **statistically significant**.  \nIt works by randomly shuffling the predictions between the models and recalculating the chosen metric many times to generate a null distribution of differences.  \nThis approach makes no assumptions about the underlying distribution of the data, making it a robust method for model comparison.\n\n#### Parameters\n\n- **y_true** : array-like of shape (n_samples,)  \n  True binary labels in {0, 1}.  \n\n- **prob_model_A** : array-like of shape (n_samples,)  \n  Predicted probabilities from the first model.  \n\n- **prob_model_B** : array-like of shape (n_samples,)  \n  Predicted probabilities from the second model.  \n\n- **metric_str** : str, default=`'f1'`  \n  Metric to compare. Supported: `'f1'`, `'accuracy'`, `'recall'`, `'precision'`, `'roc_auc'`, `'pr_auc'`, `'average_precision'`.  \n\n- **n_bootstraps** : int, default=`1000`  \n  Number of permutation samples to generate.  \n\n- **threshold** : float, default=`0.5`  \n  Threshold for converting probabilities into binary predictions (used for metrics such as F1, precision, recall).  \n\n- **average** : str, default=`'macro'`  \n  Averaging strategy for multi-class/multi-label tasks. Options: `'binary'`, `'micro'`, `'macro'`, `'weighted'`, `'samples'`.  \n\n- **random_state** : int, default=`0`  \n  Random seed for reproducibility.  \n\n#### Returns\n\n- **metric_a** : float  \n  Metric value for model A on the original data.  \n\n- **metric_b** : float  \n  Metric value for model B on the original data.  \n\n- **p_value** : float  \n  The p-value from the permutation test, i.e., the probability of observing a difference as extreme as the actual one under the null hypothesis.  \n\n- **benchmark** : float  \n  The observed absolute difference between the metrics of model A and model B.  \n\n- **samples_mean** : float  \n  Mean of the metric differences from permutation samples.  \n\n- **samples_std** : float  \n  Standard deviation of the metric differences from permutation samples.  \n\n#### Example\n\n```python\nimport numpy as np\nfrom MLstatkit import Permutation_test\n\ny_true = np.array([0, 1, 0, 0, 1, 1, 0, 1, 0])\nprob_model_A = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.3, 0.4, 0.7, 0.05])\nprob_model_B = np.array([0.2, 0.3, 0.25, 0.85, 0.15, 0.35, 0.45, 0.65, 0.01])\n\n# Compare models using a permutation test on F1 score\nmetric_a, metric_b, p_value, benchmark, samples_mean, samples_std = Permutation_test(\n    y_true, prob_model_A, prob_model_B, metric_str='f1'\n)\n\nprint(f\"F1 Score Model A: {metric_a:.5f}, Model B: {metric_b:.5f}\")\nprint(f\"Observed Difference: {benchmark:.5f}, p-value: {p_value:.5f}\")\nprint(f\"Permutation Samples Mean: {samples_mean:.5f}, Std: {samples_std:.5f}\")\n```\n\n### Conversion of AUC to Odds Ratio (OR)\n\nThe `AUC2OR` function converts an **Area Under the ROC Curve (AUC)** value into an **Odds Ratio (OR)** under the binormal model.  \nThis transformation helps interpret classification performance in terms of effect sizes commonly used in statistics.  \n\n- Under the binormal model:  \n\n$$\nAUC = \\Phi\\left(\\frac{d}{\\sqrt{2}}\\right), \\quad d \\text{ is Cohen's } d\n$$\n\n$$\n\\ln(OR) = \\frac{\\pi}{\\sqrt{3}} \\times d\n$$\n\nSince version `0.1.9`, `AUC2OR` uses the exact **inverse normal CDF** (`scipy.stats.norm.ppf`) to compute \\(z = \\Phi^{-1}(AUC)\\), improving accuracy over older approximations.\n\n#### Parameters \uff08AUC to OR\uff09\n\n- **AUC** : float  \n  Area Under the ROC Curve, must be in (0, 1).  \n\n- **return_all** : bool, default=`False`  \n  If True, returns intermediate values `(z, d, ln_or, OR)` in addition to OR:  \n  - **z** : probit (inverse normal CDF of AUC)  \n  - **d** : effect size, `sqrt(2) * z`  \n  - **ln_or** : natural logarithm of the Odds Ratio  \n  - **OR** : Odds Ratio  \n\n#### Returns \uff08AUC to OR\uff09\n\n- **OR** : float  \n  Odds Ratio corresponding to the given AUC.  \n\n- **(z, d, ln_or, OR)** if `return_all=True`.  \n\n#### Example \uff08AUC to OR\uff09\n\n```python\nfrom MLstatkit import AUC2OR\n\nauc = 0.7  # Example AUC value\n\n# Convert AUC to OR and retrieve intermediate values\nz, d, ln_or, OR = AUC2OR(auc, return_all=True)\nprint(f\"z: {z:.5f}, d: {d:.5f}, ln_OR: {ln_or:.5f}, OR: {OR:.5f}\")\n\n# Convert AUC to OR without intermediate values\nOR = AUC2OR(auc)\nprint(f\"OR: {OR:.5f}\")\n```\n\n## References\n\n### Delong's Test\n\nThe implementation of `Delong_test` in MLstatkit is based on the following publication:\n\n- Xu Sun and Weichao Xu, \"Fast implementation of DeLong\u2019s algorithm for comparing the areas under correlated receiver operating characteristic curves,\" in *IEEE Signal Processing Letters*, vol. 21, no. 11, pp. 1389-1393, 2014, IEEE.\n\n### Bootstrapping\n\nThe `Bootstrapping` method for calculating confidence intervals does not directly reference a single publication but is a widely accepted statistical technique for estimating the distribution of a metric by resampling with replacement. For a comprehensive overview of bootstrapping methods, see:\n\n- B. Efron and R. Tibshirani, \"An Introduction to the Bootstrap,\" Chapman & Hall/CRC Monographs on Statistics & Applied Probability, 1994.\n\n### Permutation Test\n\nThe `Permutation_test` are utilized to assess the significance of the difference in performance metrics between two models by randomly reallocating observations to groups and computing the metric. This approach does not make specific distributional assumptions, making it versatile for various data types. For a foundational discussion on permutation tests, refer to:\n\n- P. Good, \"Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses,\" Springer Series in Statistics, 2000.\n\nThese references lay the groundwork for the statistical tests and methodologies implemented in MLstatkit, providing users with a deep understanding of their scientific basis and applicability.\n\n### AUC2OR\n\nThe `AUC2OR` function converts the Area Under the Receiver Operating Characteristic Curve (AUC) into an **Odds Ratio (OR)** under the binormal model.  \nWhen `return_all=True`, it also provides intermediate values:\n\n- **z** : probit (\u03a6\u207b\u00b9 of AUC)  \n- **d** : Cohen\u2019s d effect size (`sqrt(2) * z`)  \n- **ln_or** : natural logarithm of the odds ratio  \n- **OR** : odds ratio  \n\nThis conversion is useful for interpreting ROC-AUC values in terms of effect sizes commonly used in statistical research.\n\n- Salgado, J. F. (2018). *Transforming the area under the normal curve (AUC) into Cohen\u2019s d, Pearson\u2019s rpb, odds-ratio, and natural log odds-ratio: Two conversion tables.* European Journal of Psychology Applied to Legal Context, 10(1), 35\u201347.\n\n## Contributing\n\nWe welcome contributions to MLstatkit! Please see our contribution guidelines for more details.\n\n## License\n\nMLstatkit is distributed under the MIT License. For more information, see the LICENSE file in the GitHub repository.\n\n### Update log\n\n- `0.1.9`  \n  - **Refactor & modularization**: split `stats.py` into multiple modules (`ci.py`, `conversions.py`, `delong.py`, `metrics.py`, `permutation.py`) for better maintainability, while preserving a unified import interface.  \n  - **Functions restored**: `Bootstrapping`, `Permutation_test`, and `AUC2OR` now available again after refactor.  \n  - **AUC2OR** updated to use binormal model with exact `norm.ppf`, improving accuracy over the earlier polynomial approximation. Supports `return_all=True` to retrieve intermediate values `(z, d, ln_or, OR)`.  \n  - **Improved testing**: added dedicated `tests/` for all core functions (Delong, Bootstrapping, Permutation test, AUC2OR, metrics, imports). Achieved full test coverage (`pytest` 16 passed).  \n  - **README.md** updated with revised usage examples and clearer documentation.  \n- `0.1.8`   Add return_ci option to Delong_test for AUC confidence intervals. Add `pyproject.toml`.\n- `0.1.7`   Update `README.md`\n- `0.1.6`   Debug.\n- `0.1.5`   Update `README.md`, Add `AUC2OR` function.\n- `0.1.4`   Update `README.md`, Add `Permutation_tests` function, Re-do `Bootstrapping` Parameters.\n- `0.1.3`   Update `README.md`.\n- `0.1.2`   Add `Bootstrapping` operation process progress display.\n- `0.1.1`   Update `README.md`, `setup.py`. Add `CONTRIBUTING.md`.\n- `0.1.0`   First edition.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "MLstatkit integrates established statistical methods into ML workflows (DeLong test, bootstrapping CI, AUC2OR, permutation test, etc.).",
    "version": "0.1.9",
    "project_urls": {
        "Documentation": "https://github.com/Brritany/MLstatkit#readme",
        "Homepage": "https://github.com/Brritany/MLstatkit",
        "Issues": "https://github.com/Brritany/MLstatkit/issues"
    },
    "split_keywords": [
        "python",
        " statistics",
        " delong test",
        " bootstrapping",
        " auc2or",
        " machine learning",
        " permutation test"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8f30c6a6f8fcb34478099fe48cca8ddda30e5f85fd985eb741a5c0906574c949",
                "md5": "0ee5b6b6559da73905d0ac6b4622ea3f",
                "sha256": "d7f80270440b073fd23bce46f7f368858034dbbf10a6fa3e19247ccd20ef3a22"
            },
            "downloads": -1,
            "filename": "mlstatkit-0.1.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0ee5b6b6559da73905d0ac6b4622ea3f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 13049,
            "upload_time": "2025-08-23T11:02:57",
            "upload_time_iso_8601": "2025-08-23T11:02:57.567161Z",
            "url": "https://files.pythonhosted.org/packages/8f/30/c6a6f8fcb34478099fe48cca8ddda30e5f85fd985eb741a5c0906574c949/mlstatkit-0.1.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "468e73f6c3a900781eb03af5f66a7a009d390385c14c00715968d6a288d31e60",
                "md5": "b7799a78e3b9a90cf04dc56f764837d8",
                "sha256": "c553d1d5b6116a66ad22893c85a0ec8d4d371ac96456d3db89582cfa882f8bb2"
            },
            "downloads": -1,
            "filename": "mlstatkit-0.1.9.tar.gz",
            "has_sig": false,
            "md5_digest": "b7799a78e3b9a90cf04dc56f764837d8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 13790,
            "upload_time": "2025-08-23T11:02:58",
            "upload_time_iso_8601": "2025-08-23T11:02:58.895706Z",
            "url": "https://files.pythonhosted.org/packages/46/8e/73f6c3a900781eb03af5f66a7a009d390385c14c00715968d6a288d31e60/mlstatkit-0.1.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-23 11:02:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Brritany",
    "github_project": "MLstatkit#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mlstatkit"
}
        
Elapsed time: 1.03704s