chi-index


Namechi-index JSON
Version 2.1.1 PyPI version JSON
download
home_pagehttps://github.com/josemarialuna/ChiIndex
SummaryExternal Clustering Validation Chi index
upload_time2024-03-12 07:49:07
maintainer
docs_urlNone
authorJosé María Luna-Romera
requires_python
licenseMIT
keywords chi index cvi clustering machine learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <a name="readme-top"></a>
# External Clustering Validation Chi Index
[![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Stargazers][stars-shield]][stars-url]
[![Issues][issues-shield]][issues-url]
[![MIT License][license-shield]][license-url]
[![Personal][personal-shield]][personal-url]

[![PyPI](https://img.shields.io/pypi/v/chi-index.svg)](https://pypi.org/project/chi-index/)
[![Downloads](https://static.pepy.tech/badge/chi-index)](https://pepy.tech/project/chi-index)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/chi-index)

## About Chi Index
Chi Index is an external clustering validity index that measures the distance between the instances of a clustering result and the labels. Although clustering is an unsupervised learning machine learning technique, Chi index favours that the clusters formed have the least number of different labels.

For example, in the following image, we can see 3 different clustering solutions, in which each of the circles represents an instance of the dataset, and the color, the class to which it belongs. In A, we can see that there is a cluster that has 5 red instances, and two green instances, while in the other cluster, we have 2 red instances, 8 green instances, and 6 blue instances. In solution B, with k=3, we find that the cluster at the top of the figure has mostly red instances, the one on the left is mostly blue, and the one at the bottom has mostly green instances.

<p align="center">
  <img alt="Clustering Solutions" src="images\chi-solutions.jpg" width="60%">
</p>

Chi index measures the distribution of instances from the clusters formed and the number of instances of each label in them and calculates a metric based on the chi-square statistic. In the following table, we can see the chi index results for each of the clustering solutions. 

| k  | Chi Index(k) |
|:-------------: |:-------------:|
| 2        |  	0.890        |
| 3         |  **0.925**        |
| 4        |  0.760        |

As we can see, the clustering solution with the highest chi index value is k=3, which indicates that to separate instances of the same label into clusters, the optimal number of clusters is 3.

The higher the chi index value, the greater the dependency between clusters and labels, i.e. the clustering solution with the highest chi index will indicate that the instances belonging to the same class are grouped as well as possible in the clusters.

<p align="right">(<a href="#readme-top">back to top</a>)</p>

## Getting Started
Using Chi Index is very simple, and here is how to do it in a few steps. You just need to have installed the Chi Index library available through the pip, and after that, you will need to import it into your Python application.

### Installing Chi Index

The Chi index version of this repository is implemented in Python. You can use any version of Python from 3.7 onwards, although it is recommended to use 3.10. To install the library you only need to execute the following command:
```bash
pip install chi-index
```

### Examples

There are two examples to run the library: the first one that is quite similar to other metrics such *silhouette_score* from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html), and the second one that works as a Class and includes all the k-means execution. 

**Note**: To run this example you must have installed the chi index library by executing the command in the previous section. 
After that, you must download the file iris.data from the [UCI repository](https://archive.ics.uci.edu/ml/datasets/iris), and place it in a folder called "data". To make it easier for you, I leave here the link: [iris.data]([http://www.limni.net](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data))

#### Example 1

This is the easiest one and it's quite similar as other common metrics such as *silhouette_score*:
```python 
import pandas as pd
from chi_index import metrics
from sklearn import cluster
import numpy as np

def main():
    df = pd.read_csv('./test/data/iris.data', delimiter=",", header=None)
    print(df.columns)
    print(df.head())
    df.rename(columns={4: 'Class'}, inplace=True)

    X = np.array(df.drop(['Class'], axis=1))

    for clusters_num in range(2,11):        
        # Clustering stage
        kmeans_model = cluster.KMeans(n_clusters=clusters_num, n_init=100, max_iter=500, init='random').fit(X)
        labels = kmeans_model.predict(X)
        df.loc[:, 'cluster'] = labels   # saves the clustering labels into 'cluster' new column

        # chi_index_score receives the clustering result array and the class array
        valor = metrics.chi_index_score(df['cluster'], df['Class'], k=clusters_num)
        print(clusters_num , '\t', valor)


if __name__ == "__main__":
    main()
```

#### Example 2

In this case, the class include all the needed code to execute the K-means. You can copy and paste the following code that uses the Iris dataset:

```python 
import pandas as pd
from chi_index.model import ChiIndex


def main():
    df = pd.read_csv('./test/data/iris.data', delimiter=",", header=None)
    print(df.columns)
    print(df.head())
    df.rename(columns={4: 'Class'}, inplace=True)

    chi = ChiIndex(df, results_path='result')
    print(chi.list_chi)
    print(chi.optimum_chi)
    print(chi.optimum_k)
    chi.save_centroids()


if __name__ == "__main__":
    main()
```

If you have any problem, or you don't manage to execute the code, please contact me through [DISCUSSION](https://github.com/josemarialuna/Chi-Index/discussions) so I can help you.

<p align="right">(<a href="#readme-top">back to top</a>)</p>



## Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**. Read [CONTRIBUTING.md](CONTRIBUTING.md). We appreciate all kinds of help.
<p align="right">(<a href="#readme-top">back to top</a>)</p>

## License

This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details

<p align="right">(<a href="#readme-top">back to top</a>)</p>

## Contact 

* **José María Luna-Romera** - [Personal site](https://josemarialuna.com/)
* **José C. Riquelme** - [Research Group](https://grupo.us.es/minerva/)

<p align="right">(<a href="#readme-top">back to top</a>)</p>

## Cite this
Please, cite as: Luna-Romera JM, Martínez-Ballesteros M, García-Gutiérrez J, Riquelme JC. External clustering validity index based on chi-squared statistical test. Information Sciences (2019) 487: 1-17. https://doi.org/10.1016/j.ins.2019.02.046. (http://www.sciencedirect.com/science/article/pii/S0020025519301550)
```
@article{LUNAROMERA20191,
title = {External clustering validity index based on chi-squared statistical test},
journal = {Information Sciences},
volume = {487},
pages = {1-17},
year = {2019},
issn = {0020-0255},
doi = {https://doi.org/10.1016/j.ins.2019.02.046},
url = {https://www.sciencedirect.com/science/article/pii/S0020025519301550},
author = {José María Luna-Romera and María Martínez-Ballesteros and Jorge García-Gutiérrez and José C. Riquelme},
keywords = {Clustering analysis, External validity indices, Comparing clusters, Big data}
}
```

<p align="right">(<a href="#readme-top">back to top</a>)</p>

<!-- MARKDOWN LINKS & IMAGES -->
<!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->
[contributors-shield]: https://img.shields.io/github/contributors/josemarialuna/Chi-Index.svg?style=for-the-badge
[contributors-url]: https://github.com/josemarialuna/Chi-Index/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/josemarialuna/Chi-Index.svg?style=for-the-badge
[forks-url]: https://github.com/josemarialuna/Chi-Index/network/members
[stars-shield]: https://img.shields.io/github/stars/josemarialuna/Chi-Index.svg?style=for-the-badge
[stars-url]: https://github.com/josemarialuna/Chi-Index/stargazers
[issues-shield]: https://img.shields.io/github/issues/josemarialuna/Chi-Index.svg?style=for-the-badge
[issues-url]: https://github.com/josemarialuna/Chi-Index/issues
[license-shield]: https://img.shields.io/github/license/josemarialuna/Chi-Index.svg?style=for-the-badge
[license-url]: https://github.com/josemarialuna/Chi-Index/blob/master/LICENSE.txt
[personal-shield]: https://img.shields.io/badge/Personal%20Site-555?style=for-the-badge
[personal-url]: https://josemarialuna.com



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/josemarialuna/ChiIndex",
    "name": "chi-index",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "chi index,cvi,clustering,machine learning",
    "author": "Jos\u00e9 Mar\u00eda Luna-Romera",
    "author_email": "josemarialunaromera@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/7e/2a/6f37d5f4ea4a02a9c1a0b08a590e7e464c7fb53c6ae469914eacc88d035c/chi-index-2.1.1.tar.gz",
    "platform": null,
    "description": "<a name=\"readme-top\"></a>\r\n# External Clustering Validation Chi Index\r\n[![Contributors][contributors-shield]][contributors-url]\r\n[![Forks][forks-shield]][forks-url]\r\n[![Stargazers][stars-shield]][stars-url]\r\n[![Issues][issues-shield]][issues-url]\r\n[![MIT License][license-shield]][license-url]\r\n[![Personal][personal-shield]][personal-url]\r\n\r\n[![PyPI](https://img.shields.io/pypi/v/chi-index.svg)](https://pypi.org/project/chi-index/)\r\n[![Downloads](https://static.pepy.tech/badge/chi-index)](https://pepy.tech/project/chi-index)\r\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/chi-index)\r\n\r\n## About Chi Index\r\nChi Index is an external clustering validity index that measures the distance between the instances of a clustering result and the labels. Although clustering is an unsupervised learning machine learning technique, Chi index favours that the clusters formed have the least number of different labels.\r\n\r\nFor example, in the following image, we can see 3 different clustering solutions, in which each of the circles represents an instance of the dataset, and the color, the class to which it belongs. In A, we can see that there is a cluster that has 5 red instances, and two green instances, while in the other cluster, we have 2 red instances, 8 green instances, and 6 blue instances. In solution B, with k=3, we find that the cluster at the top of the figure has mostly red instances, the one on the left is mostly blue, and the one at the bottom has mostly green instances.\r\n\r\n<p align=\"center\">\r\n  <img alt=\"Clustering Solutions\" src=\"images\\chi-solutions.jpg\" width=\"60%\">\r\n</p>\r\n\r\nChi index measures the distribution of instances from the clusters formed and the number of instances of each label in them and calculates a metric based on the chi-square statistic. In the following table, we can see the chi index results for each of the clustering solutions. \r\n\r\n| k  | Chi\u00e2\u20ac\u2030Index(k) |\r\n|:-------------: |:-------------:|\r\n| 2        |  \t0.890        |\r\n| 3         |  **0.925**        |\r\n| 4        |  0.760        |\r\n\r\nAs we can see, the clustering solution with the highest chi index value is k=3, which indicates that to separate instances of the same label into clusters, the optimal number of clusters is 3.\r\n\r\nThe higher the chi index value, the greater the dependency between clusters and labels, i.e. the clustering solution with the highest chi index will indicate that the instances belonging to the same class are grouped as well as possible in the clusters.\r\n\r\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\r\n\r\n## Getting Started\r\nUsing Chi Index is very simple, and here is how to do it in a few steps. You just need to have installed the Chi Index library available through the pip, and after that, you will need to import it into your Python application.\r\n\r\n### Installing Chi Index\r\n\r\nThe Chi index version of this repository is implemented in Python. You can use any version of Python from 3.7 onwards, although it is recommended to use 3.10. To install the library you only need to execute the following command:\r\n```bash\r\npip install chi-index\r\n```\r\n\r\n### Examples\r\n\r\nThere are two examples to run the library: the first one that is quite similar to other metrics such *silhouette_score* from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html), and the second one that works as a Class and includes all the k-means execution. \r\n\r\n**Note**: To run this example you must have installed the chi index library by executing the command in the previous section. \r\nAfter that, you must download the file iris.data from the [UCI repository](https://archive.ics.uci.edu/ml/datasets/iris), and place it in a folder called \"data\". To make it easier for you, I leave here the link: [iris.data]([http://www.limni.net](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data))\r\n\r\n#### Example 1\r\n\r\nThis is the easiest one and it's quite similar as other common metrics such as *silhouette_score*:\r\n```python \r\nimport pandas as pd\r\nfrom chi_index import metrics\r\nfrom sklearn import cluster\r\nimport numpy as np\r\n\r\ndef main():\r\n    df = pd.read_csv('./test/data/iris.data', delimiter=\",\", header=None)\r\n    print(df.columns)\r\n    print(df.head())\r\n    df.rename(columns={4: 'Class'}, inplace=True)\r\n\r\n    X = np.array(df.drop(['Class'], axis=1))\r\n\r\n    for clusters_num in range(2,11):        \r\n        # Clustering stage\r\n        kmeans_model = cluster.KMeans(n_clusters=clusters_num, n_init=100, max_iter=500, init='random').fit(X)\r\n        labels = kmeans_model.predict(X)\r\n        df.loc[:, 'cluster'] = labels   # saves the clustering labels into 'cluster' new column\r\n\r\n        # chi_index_score receives the clustering result array and the class array\r\n        valor = metrics.chi_index_score(df['cluster'], df['Class'], k=clusters_num)\r\n        print(clusters_num , '\\t', valor)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    main()\r\n```\r\n\r\n#### Example 2\r\n\r\nIn this case, the class include all the needed code to execute the K-means. You can copy and paste the following code that uses the Iris dataset:\r\n\r\n```python \r\nimport pandas as pd\r\nfrom chi_index.model import ChiIndex\r\n\r\n\r\ndef main():\r\n    df = pd.read_csv('./test/data/iris.data', delimiter=\",\", header=None)\r\n    print(df.columns)\r\n    print(df.head())\r\n    df.rename(columns={4: 'Class'}, inplace=True)\r\n\r\n    chi = ChiIndex(df, results_path='result')\r\n    print(chi.list_chi)\r\n    print(chi.optimum_chi)\r\n    print(chi.optimum_k)\r\n    chi.save_centroids()\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    main()\r\n```\r\n\r\nIf you have any problem, or you don't manage to execute the code, please contact me through [DISCUSSION](https://github.com/josemarialuna/Chi-Index/discussions) so I can help you.\r\n\r\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\r\n\r\n\r\n\r\n## Contributing\r\n\r\nContributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**. Read [CONTRIBUTING.md](CONTRIBUTING.md). We appreciate all kinds of help.\r\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details\r\n\r\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\r\n\r\n## Contact \r\n\r\n* **Jos\u00c3\u00a9 Mar\u00c3\u00ada Luna-Romera** - [Personal site](https://josemarialuna.com/)\r\n* **Jos\u00c3\u00a9 C. Riquelme** - [Research Group](https://grupo.us.es/minerva/)\r\n\r\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\r\n\r\n## Cite this\r\nPlease, cite as: Luna-Romera JM, Mart\u00c3\u00adnez-Ballesteros M, Garc\u00c3\u00ada-Guti\u00c3\u00a9rrez J, Riquelme JC. External clustering validity index based on chi-squared statistical test. Information Sciences (2019) 487: 1-17. https://doi.org/10.1016/j.ins.2019.02.046. (http://www.sciencedirect.com/science/article/pii/S0020025519301550)\r\n```\r\n@article{LUNAROMERA20191,\r\ntitle = {External clustering validity index based on chi-squared statistical test},\r\njournal = {Information Sciences},\r\nvolume = {487},\r\npages = {1-17},\r\nyear = {2019},\r\nissn = {0020-0255},\r\ndoi = {https://doi.org/10.1016/j.ins.2019.02.046},\r\nurl = {https://www.sciencedirect.com/science/article/pii/S0020025519301550},\r\nauthor = {Jos\u00c3\u00a9 Mar\u00c3\u00ada Luna-Romera and Mar\u00c3\u00ada Mart\u00c3\u00adnez-Ballesteros and Jorge Garc\u00c3\u00ada-Guti\u00c3\u00a9rrez and Jos\u00c3\u00a9 C. Riquelme},\r\nkeywords = {Clustering analysis, External validity indices, Comparing clusters, Big data}\r\n}\r\n```\r\n\r\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\r\n\r\n<!-- MARKDOWN LINKS & IMAGES -->\r\n<!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->\r\n[contributors-shield]: https://img.shields.io/github/contributors/josemarialuna/Chi-Index.svg?style=for-the-badge\r\n[contributors-url]: https://github.com/josemarialuna/Chi-Index/graphs/contributors\r\n[forks-shield]: https://img.shields.io/github/forks/josemarialuna/Chi-Index.svg?style=for-the-badge\r\n[forks-url]: https://github.com/josemarialuna/Chi-Index/network/members\r\n[stars-shield]: https://img.shields.io/github/stars/josemarialuna/Chi-Index.svg?style=for-the-badge\r\n[stars-url]: https://github.com/josemarialuna/Chi-Index/stargazers\r\n[issues-shield]: https://img.shields.io/github/issues/josemarialuna/Chi-Index.svg?style=for-the-badge\r\n[issues-url]: https://github.com/josemarialuna/Chi-Index/issues\r\n[license-shield]: https://img.shields.io/github/license/josemarialuna/Chi-Index.svg?style=for-the-badge\r\n[license-url]: https://github.com/josemarialuna/Chi-Index/blob/master/LICENSE.txt\r\n[personal-shield]: https://img.shields.io/badge/Personal%20Site-555?style=for-the-badge\r\n[personal-url]: https://josemarialuna.com\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "External Clustering Validation Chi index",
    "version": "2.1.1",
    "project_urls": {
        "Download": "https://github.com/josemarialuna/ChiIndex/tarball/0.1",
        "Homepage": "https://github.com/josemarialuna/ChiIndex"
    },
    "split_keywords": [
        "chi index",
        "cvi",
        "clustering",
        "machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7e2a6f37d5f4ea4a02a9c1a0b08a590e7e464c7fb53c6ae469914eacc88d035c",
                "md5": "4e06af76c914a374599c75320fc5078d",
                "sha256": "e521e59f935dc1390938d4550da8dbca005ff6867b81dd5299a3c7405dd0d5c7"
            },
            "downloads": -1,
            "filename": "chi-index-2.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "4e06af76c914a374599c75320fc5078d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 11335,
            "upload_time": "2024-03-12T07:49:07",
            "upload_time_iso_8601": "2024-03-12T07:49:07.426176Z",
            "url": "https://files.pythonhosted.org/packages/7e/2a/6f37d5f4ea4a02a9c1a0b08a590e7e464c7fb53c6ae469914eacc88d035c/chi-index-2.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-12 07:49:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "josemarialuna",
    "github_project": "ChiIndex",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "chi-index"
}
        
Elapsed time: 0.72071s