leakyblobs


Nameleakyblobs JSON
Version 0.2.4 PyPI version JSON
download
home_pagehttps://github.com/hfawal/clustering_leakage_analysis
SummaryClustering leakage analysis library.
upload_time2024-08-13 11:59:13
maintainerNone
docs_urlNone
authorHady Fawal
requires_pythonNone
licenseEquancy All Rights Reserved
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# LeakyBlobs

Evaluates the quality of a clustering by examinining the leakage between clusters using the predicted probabilities of a classification model.

---

**NOTE: This is not the full documentation.**
**[Read the docs here.](https://spiffy-bow-8b4.notion.site/LeakyBlobs-b17dd46549f64df4bf617e63d4f3bc01)**

---

## Overview

LeakyBlobs is a python package which provides a sensible alternative to traditional ways of evaluating the quality of a clustering, such as the Elbow Method, Silhouette Score, and Gap Statistic. These methods tend to oversimplify the problem of cluster evaluation by creating a single number which can be difficult to judge for human beings, often resulting in highly subjective choices for clustering hyperparameters such as the number of clusters in algorithms like KMeans. Instead, the LeakyBlobs package is based on the idea that a good clustering is a *predictable* clustering. The package provides tools to train simple classifiers to predict clusters and tools to analyze their probability outputs in order to see the extent to which clusters 'leak' into each other.

## Installation
This package is available through pip using the following command:
```bash
# Install the package
pip install leakyblobs
```

## Dependencies

```
numpy>=1.26.1
pandas>=2.0.0
openpyxl>=3.1.5
pyvis>=0.3.2
plotly>=5.20.0
scipy>=1.14.0
openpyxl>=3.1.5
setuptools>=72.1.0
scikit-learn>=1.5.1
```

## Usage

Below is a short example of how to use the LeakyBlobs package.

**[Read the full documentation here.](https://spiffy-bow-8b4.notion.site/LeakyBlobs-b17dd46549f64df4bf617e63d4f3bc01)**

```python

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from leakyblobs import ClusterPredictor, ClusterEvaluator

# Load iris data set as pandas DF, and concatenate target with features.
iris = load_iris()
data = pd.DataFrame(
    np.concatenate((iris.data, np.array([iris.target]).T), axis=1), 
    columns=iris.feature_names + ['target']
)
data = data.reset_index()
data["index"] = data["index"].astype("str")
data["target"] = data["target"].astype("int32")

# Use the leakyblobs package to train a cluster classification model.
predictor = ClusterPredictor(data, 
                             id_col="index", 
                             target_col="target",
                             nonlinear_boundary=True)

# Get the predictions and probability outputs on the test set.
test_predictions = predictor.get_test_predictions()

# Use the leakyblobs package to evaluate the leakage of a clustering
# given a cluster classification model's predictions and probability outputs.
evaluator = ClusterEvaluator(test_predictions)

# Save visualization in working directory.
evaluator.save_leakage_graph(detection_thresh=0.05,
                             leakage_thresh=0.02,
                             filename="blob_graph.html")

# Save report with leakage metrics in working directory.
evaluator.save_leakage_report(detection_thresh=0.05,
                              leakage_thresh=0.02,
                              significance_level=0.05,
                              filename="blob_report.xlsx")
```

## License

Equancy All Rights Reserved

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hfawal/clustering_leakage_analysis",
    "name": "leakyblobs",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Hady Fawal",
    "author_email": "hfawal@equancy.com",
    "download_url": "https://files.pythonhosted.org/packages/ac/f3/90e00825468db399a46a7a7b4d111625f0dfe1ad4c177853fb2b83babaab/leakyblobs-0.2.4.tar.gz",
    "platform": null,
    "description": "\r\n# LeakyBlobs\r\n\r\nEvaluates the quality of a clustering by examinining the leakage between clusters using the predicted probabilities of a classification model.\r\n\r\n---\r\n\r\n**NOTE: This is not the full documentation.**\r\n**[Read the docs here.](https://spiffy-bow-8b4.notion.site/LeakyBlobs-b17dd46549f64df4bf617e63d4f3bc01)**\r\n\r\n---\r\n\r\n## Overview\r\n\r\nLeakyBlobs is a python package which provides a sensible alternative to traditional ways of evaluating the quality of a clustering, such as the Elbow Method, Silhouette Score, and Gap Statistic. These methods tend to oversimplify the problem of cluster evaluation by creating a single number which can be difficult to judge for human beings, often resulting in highly subjective choices for clustering hyperparameters such as the number of clusters in algorithms like KMeans. Instead, the LeakyBlobs package is based on the idea that a good clustering is a *predictable* clustering. The package provides tools to train simple classifiers to predict clusters and tools to analyze their probability outputs in order to see the extent to which clusters 'leak' into each other.\r\n\r\n## Installation\r\nThis package is available through pip using the following command:\r\n```bash\r\n# Install the package\r\npip install leakyblobs\r\n```\r\n\r\n## Dependencies\r\n\r\n```\r\nnumpy>=1.26.1\r\npandas>=2.0.0\r\nopenpyxl>=3.1.5\r\npyvis>=0.3.2\r\nplotly>=5.20.0\r\nscipy>=1.14.0\r\nopenpyxl>=3.1.5\r\nsetuptools>=72.1.0\r\nscikit-learn>=1.5.1\r\n```\r\n\r\n## Usage\r\n\r\nBelow is a short example of how to use the LeakyBlobs package.\r\n\r\n**[Read the full documentation here.](https://spiffy-bow-8b4.notion.site/LeakyBlobs-b17dd46549f64df4bf617e63d4f3bc01)**\r\n\r\n```python\r\n\r\nimport pandas as pd\r\nimport numpy as np\r\nfrom sklearn.datasets import load_iris\r\nfrom leakyblobs import ClusterPredictor, ClusterEvaluator\r\n\r\n# Load iris data set as pandas DF, and concatenate target with features.\r\niris = load_iris()\r\ndata = pd.DataFrame(\r\n    np.concatenate((iris.data, np.array([iris.target]).T), axis=1), \r\n    columns=iris.feature_names + ['target']\r\n)\r\ndata = data.reset_index()\r\ndata[\"index\"] = data[\"index\"].astype(\"str\")\r\ndata[\"target\"] = data[\"target\"].astype(\"int32\")\r\n\r\n# Use the leakyblobs package to train a cluster classification model.\r\npredictor = ClusterPredictor(data, \r\n                             id_col=\"index\", \r\n                             target_col=\"target\",\r\n                             nonlinear_boundary=True)\r\n\r\n# Get the predictions and probability outputs on the test set.\r\ntest_predictions = predictor.get_test_predictions()\r\n\r\n# Use the leakyblobs package to evaluate the leakage of a clustering\r\n# given a cluster classification model's predictions and probability outputs.\r\nevaluator = ClusterEvaluator(test_predictions)\r\n\r\n# Save visualization in working directory.\r\nevaluator.save_leakage_graph(detection_thresh=0.05,\r\n                             leakage_thresh=0.02,\r\n                             filename=\"blob_graph.html\")\r\n\r\n# Save report with leakage metrics in working directory.\r\nevaluator.save_leakage_report(detection_thresh=0.05,\r\n                              leakage_thresh=0.02,\r\n                              significance_level=0.05,\r\n                              filename=\"blob_report.xlsx\")\r\n```\r\n\r\n## License\r\n\r\nEquancy All Rights Reserved\r\n",
    "bugtrack_url": null,
    "license": "Equancy All Rights Reserved",
    "summary": "Clustering leakage analysis library.",
    "version": "0.2.4",
    "project_urls": {
        "Homepage": "https://github.com/hfawal/clustering_leakage_analysis"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "acf390e00825468db399a46a7a7b4d111625f0dfe1ad4c177853fb2b83babaab",
                "md5": "22db161eeb075aa28e45ebc33e0cfbf1",
                "sha256": "27f0ff48c6e99241e3afbb02d1ffac94d9528f8c814b066111c695fe7df2ac59"
            },
            "downloads": -1,
            "filename": "leakyblobs-0.2.4.tar.gz",
            "has_sig": false,
            "md5_digest": "22db161eeb075aa28e45ebc33e0cfbf1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 13544,
            "upload_time": "2024-08-13T11:59:13",
            "upload_time_iso_8601": "2024-08-13T11:59:13.868162Z",
            "url": "https://files.pythonhosted.org/packages/ac/f3/90e00825468db399a46a7a7b4d111625f0dfe1ad4c177853fb2b83babaab/leakyblobs-0.2.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-13 11:59:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hfawal",
    "github_project": "clustering_leakage_analysis",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "leakyblobs"
}
        
Elapsed time: 0.80925s