CPFcluster

Name	CPFcluster JSON
Version	3.0 JSON
	download
home_page	None
Summary	An Implementation of Component-wise Peak Finding Clustering Method
upload_time	2024-12-06 23:16:49
maintainer	None
docs_url	None
author	Joshua Tobin, Mimi Zhang
requires_python	None
license	MIT
keywords	density-peak-clustering clustering machine-learning
VCS
bugtrack_url
requirements	numpy scipy scikit-learn matplotlib umap-learn
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # CPFcluster: the Component-wise Peak-Finding algorithm

To install the CPFcluster package: 

```sh
$ pip install CPFcluster
```


If you used this package in your research, please cite it:
```latex
@ARTICLE{10296014,
  author={Tobin, Joshua and Zhang, Mimi},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm}, 
  year={2024},
  volume={46},
  number={2},
  pages={1109-1120},
  doi={10.1109/TPAMI.2023.3327471}}
```


---


## Class: CPFcluster

**CPFcluster** is a scalable and flexible density-based clustering method that integrates the strengths of density-level set and mode-seeking approaches. This combination offers several advantages, including: (1) the ability to detect outliers, (2) effective identification of clusters with varying densities and overlapping regions, and (3) robustness against spurious density maxima. The `CPFcluster` class is designed to manage outliers, merge clusters seamlessly, and visualize clustering results using techniques like PCA, UMAP and t-SNE.
```python
CPFcluster(
    min_samples=5,               # minimum number of neighbors to consider for connectivity #
    rho=0.4,                     # parameter that controls the number of clusters for each component set #
    alpha=1,                     # parameter for edge-cutoff in cluster detection #
    n_jobs=1,                    # number of parallel jobs for computation #
    cutoff=1,                    # threshold for filtering out small connected components as outliers #
    merge=False,                 # whether to merge similar clusters based on thresholds #
    merge_threshold=0.5,         # distance threshold for merging clusters #
    density_ratio_threshold=0.1, # density ratio threshold for merging clusters #
    distance_metric='euclidean', # metric for distance computation (e.g., 'euclidean', 'manhattan', 'cosine') #
    remove_duplicates=False,     # whether to remove duplicate data points before clustering #
    plot_umap=False,             # whether to plot UMAP visualization after clustering #
    plot_pca=False,              # whether to plot PCA visualization after clustering #
    plot_tsne=False              # whether to plot t-SNE visualization after clustering #
)
```

### Parameters

- **`min_samples`** *(int)*:  
Number of nearest-neighbors used to create connected components from the dataset and compute the density. This parameter is used in the `build_CCgraph` function to construct the k-NN graph and extract the component sets.  
 *Default*: `5`

- **`rho`** *(float)*:  
  The `rho` parameter in Definition 10 of the paper "A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm". Varying the parameter `rho` determines the number of clusters for each component set.  
  *Default*: `0.4`

- **`alpha`** *(float)*:  
  An optional parameter used to set the threshold for edge weights during center selection, not discussed in the paper.  
  *Default*: `1`

- **`n_jobs`** *(int)*:  
  Number of parallel jobs for computation. Specify `n_jobs=-1` (and include the `__name__ == "__main__":` line in your script) to use all cores.  
  *Default*: `1`
  
- **`cutoff`** *(int)*:  
  In the mutual k-NN graph, vertices with a number of edges less than or equal to the specified `cutoff` value are identified as outliers.  
  *Default*: `1` 
  
- **`merge`** *(bool)*:  
  Specifies whether to merge clusters that are similar based on distance and density-ratio thresholds. Two clusters will be merged only if the distance between their centroids is less than the `merge_threshold` AND the density ratio exceeds the `density_ratio_threshold`.    
  *Default*: `False` 
  
- **`merge_threshold`** *(float)*:  
  The distance threshold that determines whether two clusters should be merged. Clusters will be merged if the distance between their centroids is less than the `merge_threshold`. This parameter helps to combine clusters that are close in the feature space, potentially reducing over-segmentation. A range of 0.1–1.0 works well across diverse datasets (after standardization).  
  *Default*: `0.5`  

- **`density_ratio_threshold`** *(float)*:  
  The density ratio threshold that determines whether two clusters should be merged. Clusters are merged if the ratio of densities between two clusters (lower density/higher density) exceeds the `density_ratio_threshold`, ensuring that only clusters with comparable densities are merged. A range of 0.1–0.5 is observed to work well across various datasets (after standardization).   
  *Default*: `0.1`  
  
- **`distance_metric`** *(str)*:  
  Metric to use for distance computation. Options include:  
  - `'euclidean'`: Euclidean distance (default).  
  - `'manhattan'`: sum of absolute differences.  
  - `'cosine'`: cosine similarity-based distance.  
  - `'chebyshev'`: maximum difference along any dimension.  
  - `'minkowski'`: generalized distance metric requiring a parameter \(p\) (e.g., \(p=1\) for Manhattan, \(p=2\) for Euclidean).  
  - `'hamming'`: fraction of differing attributes between samples (useful for binary data).  
  - `'jaccard'`: used for binary attributes to measure similarity based on set intersection and union.  

- **`remove_duplicates`** *(bool)*:  
  Whether to remove duplicate data points before clustering.  
  *Default*: `False`

- **`plot_clusters_umap`** *(bool)*:  
  Whether to visualize the clusters via UMAP.  
  *Default*: `False`

- **`plot_clusters_pca`** *(bool)*:  
  Whether to visualize the clusters via PCA.  
  *Default*: `False`
  
- **`plot_clusters_tsne`** *(bool)*:  
  Whether to visualize the clusters via t-SNE.  
  *Default*: `False`


### Methods

- `fit(X)`: Apply the CPF method to the input data X. <br>  
  - `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.
  - **Returns**:
    - None. Update the instance attributes with identified cluster labels. Outliers are labeled as `-1`.

- `calculate_centroids_and_densities(X, labels)`: Calculates the centroid and average density of each cluster.<br>  
  - `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.
  - `labels` *(np.ndarray)*: cluster labels of the input data.
  - **Returns**:
    - `centroids` *(np.ndarray)*: centroids of the clusters.
    - `densities` *(np.ndarray)*:  average density of each cluster.

- `merge_clusters(X, centroids, densities, labels)`: Merges similar clusters based on the distance and density-ratio thresholds.<br>  
  - `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.
  - `centroids` *(np.ndarray)*: centroids of the clusters.
  - `densities` *(np.ndarray)*:  average density of each cluster.
  - `labels` *(np.ndarray)*: cluster label for each sample. 
  - **Returns**: 
    - `labels` *(np.ndarray)*: updated cluster labels after merging.
    

---

## Helper Functions

### `build_CCgraph(X, min_samples, cutoff, n_jobs, distance_metric='euclidean')`
Construct the k-NN graph and extract the connected components.

- **Parameters**:
  - **`X`** *(np.ndarray)*: input data of shape `(n_samples, n_features)`.
  - ...

- **Returns**:
  - **`components`** *(np.ndarray)*: connected component for each sample. If a sample is an outlier, the value will be NaN.
  - **`CCmat`** *(scipy.sparse.csr_matrix)*: an n-by-n sparse matrix representation of the k-NN graph.
  - **`knn_radius`** *(np.ndarray)*: distance to the min_samples-th (i.e., k-th) neighbor for each sample.


### `get_density_dists_bb(X, k, components, knn_radius, n_jobs)`
Identify the "big brother" (nearest neighbor of higher density) for each sample.

- **Parameters**:
  - **`X`** *(np.ndarray)*: input data of shape `(n_samples, n_features)`.
  - **`k`** *(int)*: identical to `min_samples`, the neighborhood parameter \(k\) in the mutual k-NN graph.
  - ...

- **Returns**:
  - **`best_distance`** *(np.ndarray)*: best distance for each sample.
  - **`big_brother`** *(np.ndarray)*: index of the "big brother" for each sample.


### `get_y(CCmat, components, knn_radius, best_distance, big_brother, rho, alpha, d)`
Assigns cluster labels to data points based on density and connectivity properties.

- **Parameters**:
  - ...
  - **`d`** *(int)*: dimension of the input data `X`.

- **Returns**:
  - **`y_pred`** *(np.ndarray)*: identified cluster labels for the input data.

---

## Code Example 1: Clustering with the Dermatology Dataset
The script below demonstrates how to use CPFcluster with the Dermatology dataset, available in the Data folder.  

```python

import os
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
from core import CPFcluster


# Define the main function to utilize Python's multiprocessing unit (in Windows OS).
def main():
    # Step 1: Load the Data
    
    # Load the Dermatology dataset
    Data = np.load("Data/dermatology.npy")
    X = Data[:,:-1]  # Feature data
    y = Data[:,-1]  # True labels (used here for evaluation, not clustering)

    # Normalize dataset for easier hyperparameter tuning
    X = StandardScaler().fit_transform(X)
    
    
    # Step 2: Initialize CPFcluster
    cpf = CPFcluster(
        min_samples=10,
        rho=0.5,
        alpha=1.0,
        merge=True,
        merge_threshold=0.6,
        n_jobs=-1,
        plot_tsne=True,
        plot_pca=True,
        plot_umap=True
    )
    
    
    # Step 3: Fit the Model
    cpf.fit(X)

    # access the cluster labels
    print("Cluster labels:", cpf.labels)
    
    
    # Step 4: Calculate Cluster Validity Indices
    ari = adjusted_rand_score(y, cpf.labels)
    print(f"Adjusted Rand Index (ARI): {ari:.2f}")


if __name__ == "__main__":
    main()


```


### Visualize Results

If `plot_tsne`, `plot_pca`, or `plot_umap` was set to `True`, clustering results will be visualized automatically after fitting the model. Here are the PCA, UMAP, and t-SNE visualizations for the Dermatology dataset.


## Code Example 2: Grid Search for Optimal Parameter Configuration.
The script below demonstrates how to tune the parameters to obtain the highest Calinski-Harabasz score.  

```python

import os
import numpy as np
import csv
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import calinski_harabasz_score
from itertools import product
from time import time
from core import CPFcluster  
from tqdm import tqdm

def main():
    # Define datasets and parameter ranges
    datasets = ["seeds", "glass", "vertebral", "ecoli", "dermatology"]

    # Parameter ranges for grid search
    min_samples_range = [3, 5, 10]
    rho_range = [0.1, 0.4, 0.7]
    alpha_range = [0.5, 1, 1.5]
    cutoff_range = [1, 2, 3]
    merge_threshold_range = [0.3, 0.5, 0.7]
    density_ratio_threshold_range = [0.05, 0.1, 0.2]

    # Directory for saving results
    results_dir = "Results"
    if not os.path.exists(results_dir):
        os.makedirs(results_dir)

    results_path = os.path.join(results_dir, "CPFcluster_grid_search_results.csv")
    
    # Write header to results CSV
    with open(results_path, 'w', newline='') as fd:
        writer = csv.writer(fd)
        writer.writerow(["Dataset", "min_samples", "rho", "alpha", "cutoff", 
                         "merge_threshold", "density_ratio_threshold", 
                         "CH_Index", "Time"])

    # Iterate through datasets
    for dataset in datasets:
        # Load dataset
        Data = np.load(f"Data/{dataset}.npy")  
        X = Data[:, :-1]
        y = Data[:, -1]
        
        # Normalize dataset for easier parameter selection
        X = StandardScaler().fit_transform(X)

        # Parameter grid
        param_grid = product(min_samples_range, rho_range, alpha_range, 
                             cutoff_range, merge_threshold_range, 
                             density_ratio_threshold_range)

        for params in tqdm(param_grid, desc=f"Processing {dataset}"):
            min_samples, rho, alpha, cutoff, merge_threshold, density_ratio_threshold = params
            start_time = time()

            # Initialize CPFcluster with current parameters
            model = CPFcluster(
                min_samples=min_samples,
                rho=rho,
                alpha=alpha,
                cutoff=cutoff,
                merge=True,
                merge_threshold=merge_threshold,
                density_ratio_threshold=density_ratio_threshold,
                n_jobs=-1
            )
            
            # Fit model
            model.fit(X)
            labels = model.labels

            # Skip iteration if all points are assigned to one cluster
            if len(np.unique(labels)) < 2:
                continue

            # Compute the adjusted Calinski-Harabasz index
            ch_index = calinski_harabasz_score(X, labels)
            elapsed_time = time() - start_time

            # Write results to CSV
            with open(results_path, 'a', newline='') as fd:
                writer = csv.writer(fd)
                writer.writerow([dataset, min_samples, rho, alpha, cutoff, 
                                 merge_threshold, density_ratio_threshold, 
                                 ch_index, elapsed_time])

if __name__ == "__main__":
    main()

```


---

## Data

The data folder contains the following datasets for testing and benchmarking different clustering methods. The following datasets are from the UCI Machine Learning Repository:

- **Dermatology**: The Dermatology dataset for classifying erythemato-squamous diseases.

- **Ecoli**: This dataset predicts the localization sites of proteins within E. coli cells.

- **Glass Identification**: This dataset is used for classifying glass types for forensic analysis.

- **HTRU2**: HTRU2 distinguishes pulsar candidates from non-pulsar stars based on astronomical data.

- **Letter Recognition**: The dataset is designed to classify uppercase English alphabet letters.

- **MAGIC Gamma Telescope**: This dataset predicts whether detected particles are high-energy gamma rays.

- **Optical Recognition of Handwritten Digits**: This dataset is used for classifying handwritten digit images.

- **Page Blocks Classification**: Classify blocks of text in scanned documents into various categories.

- **Pen-Based Recognition of Handwritten Digits**: The dataset is for digit classification based on pen movement data.

- **Seeds**: This dataset classifies wheat seed varieties based on geometric properties.

- **Vertebral Column**: The dataset is used to classify conditions of vertebral disks as normal or abnormal.

The following two datasets are from Kaggle: **Fraud Detection Bank** and **Paris Housing Classification**. The **Phoneme** dataset is from R (https://r-packages.io/datasets/phoneme).  


---

## Key Features


1. **Outlier Detection**:  
   The algorithm identifies and excludes small connected components as outliers based on the `cutoff` parameter.

2. **Cluster Merging**:  
   Merges similar clusters by evaluating proximity (`merge_threshold`) and density similarity (`density_ratio_threshold`).

4. **Scalability**:  
   Supports parallel processing with the `n_jobs` parameter, enabling efficient computation for large datasets.

5. **Customizable Metrics**:  
   Offers flexibility through the `distance_metric` parameter, which supports multiple distance measures (e.g., `'euclidean'`, `'manhattan'`, `'cosine'`).

6. **Visualization Support**:  
   Includes built-in options for t-SNE, PCA, and UMAP visualizations to enhance interpretability of clustering results.


## Limitations

1. **Parameter Sensitivity**:  
   The results can be sensitive to the choice of `rho`, `alpha`, and `cutoff`, which require careful tuning for optimal results.

2. **Computational Overhead**:  
   Computing the nearest-neighbor graph for very large datasets can demand significant memory and processing resources.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "CPFcluster",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "density-peak-clustering, clustering, machine-learning",
    "author": "Joshua Tobin, Mimi Zhang",
    "author_email": "tobinjo@tcd.ie",
    "download_url": "https://files.pythonhosted.org/packages/1b/2c/ca9423d939f6223a2415015b8ee844e7eb36fab78d6d6b2f270e39bb833e/CPFcluster-3.0.tar.gz",
    "platform": null,
    "description": "# CPFcluster: the Component-wise Peak-Finding algorithm\n\nTo install the CPFcluster package: \n\n```sh\n$ pip install CPFcluster\n```\n\n\nIf you used this package in your research, please cite it:\n```latex\n@ARTICLE{10296014,\n  author={Tobin, Joshua and Zhang, Mimi},\n  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, \n  title={A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm}, \n  year={2024},\n  volume={46},\n  number={2},\n  pages={1109-1120},\n  doi={10.1109/TPAMI.2023.3327471}}\n```\n\n\n---\n\n\n## Class: CPFcluster\n\n**CPFcluster** is a scalable and flexible density-based clustering method that integrates the strengths of density-level set and mode-seeking approaches. This combination offers several advantages, including: (1) the ability to detect outliers, (2) effective identification of clusters with varying densities and overlapping regions, and (3) robustness against spurious density maxima. The `CPFcluster` class is designed to manage outliers, merge clusters seamlessly, and visualize clustering results using techniques like PCA, UMAP and t-SNE.\n```python\nCPFcluster(\n    min_samples=5,               # minimum number of neighbors to consider for connectivity #\n    rho=0.4,                     # parameter that controls the number of clusters for each component set #\n    alpha=1,                     # parameter for edge-cutoff in cluster detection #\n    n_jobs=1,                    # number of parallel jobs for computation #\n    cutoff=1,                    # threshold for filtering out small connected components as outliers #\n    merge=False,                 # whether to merge similar clusters based on thresholds #\n    merge_threshold=0.5,         # distance threshold for merging clusters #\n    density_ratio_threshold=0.1, # density ratio threshold for merging clusters #\n    distance_metric='euclidean', # metric for distance computation (e.g., 'euclidean', 'manhattan', 'cosine') #\n    remove_duplicates=False,     # whether to remove duplicate data points before clustering #\n    plot_umap=False,             # whether to plot UMAP visualization after clustering #\n    plot_pca=False,              # whether to plot PCA visualization after clustering #\n    plot_tsne=False              # whether to plot t-SNE visualization after clustering #\n)\n```\n\n### Parameters\n\n- **`min_samples`** *(int)*:  \nNumber of nearest-neighbors used to create connected components from the dataset and compute the density. This parameter is used in the `build_CCgraph` function to construct the k-NN graph and extract the component sets.  \n *Default*: `5`\n\n- **`rho`** *(float)*:  \n  The `rho` parameter in Definition 10 of the paper \"A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm\". Varying the parameter `rho` determines the number of clusters for each component set.  \n  *Default*: `0.4`\n\n- **`alpha`** *(float)*:  \n  An optional parameter used to set the threshold for edge weights during center selection, not discussed in the paper.  \n  *Default*: `1`\n\n- **`n_jobs`** *(int)*:  \n  Number of parallel jobs for computation. Specify `n_jobs=-1` (and include the `__name__ == \"__main__\":` line in your script) to use all cores.  \n  *Default*: `1`\n  \n- **`cutoff`** *(int)*:  \n  In the mutual k-NN graph, vertices with a number of edges less than or equal to the specified `cutoff` value are identified as outliers.  \n  *Default*: `1` \n  \n- **`merge`** *(bool)*:  \n  Specifies whether to merge clusters that are similar based on distance and density-ratio thresholds. Two clusters will be merged only if the distance between their centroids is less than the `merge_threshold` AND the density ratio exceeds the `density_ratio_threshold`.    \n  *Default*: `False` \n  \n- **`merge_threshold`** *(float)*:  \n  The distance threshold that determines whether two clusters should be merged. Clusters will be merged if the distance between their centroids is less than the `merge_threshold`. This parameter helps to combine clusters that are close in the feature space, potentially reducing over-segmentation. A range of 0.1\u20131.0 works well across diverse datasets (after standardization).  \n  *Default*: `0.5`  \n\n- **`density_ratio_threshold`** *(float)*:  \n  The density ratio threshold that determines whether two clusters should be merged. Clusters are merged if the ratio of densities between two clusters (lower density/higher density) exceeds the `density_ratio_threshold`, ensuring that only clusters with comparable densities are merged. A range of 0.1\u20130.5 is observed to work well across various datasets (after standardization).   \n  *Default*: `0.1`  \n  \n- **`distance_metric`** *(str)*:  \n  Metric to use for distance computation. Options include:  \n  - `'euclidean'`: Euclidean distance (default).  \n  - `'manhattan'`: sum of absolute differences.  \n  - `'cosine'`: cosine similarity-based distance.  \n  - `'chebyshev'`: maximum difference along any dimension.  \n  - `'minkowski'`: generalized distance metric requiring a parameter \\(p\\) (e.g., \\(p=1\\) for Manhattan, \\(p=2\\) for Euclidean).  \n  - `'hamming'`: fraction of differing attributes between samples (useful for binary data).  \n  - `'jaccard'`: used for binary attributes to measure similarity based on set intersection and union.  \n\n- **`remove_duplicates`** *(bool)*:  \n  Whether to remove duplicate data points before clustering.  \n  *Default*: `False`\n\n- **`plot_clusters_umap`** *(bool)*:  \n  Whether to visualize the clusters via UMAP.  \n  *Default*: `False`\n\n- **`plot_clusters_pca`** *(bool)*:  \n  Whether to visualize the clusters via PCA.  \n  *Default*: `False`\n  \n- **`plot_clusters_tsne`** *(bool)*:  \n  Whether to visualize the clusters via t-SNE.  \n  *Default*: `False`\n\n\n### Methods\n\n- `fit(X)`: Apply the CPF method to the input data X. <br>  \n  - `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.\n  - **Returns**:\n    - None. Update the instance attributes with identified cluster labels. Outliers are labeled as `-1`.\n\n- `calculate_centroids_and_densities(X, labels)`: Calculates the centroid and average density of each cluster.<br>  \n  - `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.\n  - `labels` *(np.ndarray)*: cluster labels of the input data.\n  - **Returns**:\n    - `centroids` *(np.ndarray)*: centroids of the clusters.\n    - `densities` *(np.ndarray)*:  average density of each cluster.\n\n- `merge_clusters(X, centroids, densities, labels)`: Merges similar clusters based on the distance and density-ratio thresholds.<br>  \n  - `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.\n  - `centroids` *(np.ndarray)*: centroids of the clusters.\n  - `densities` *(np.ndarray)*:  average density of each cluster.\n  - `labels` *(np.ndarray)*: cluster label for each sample. \n  - **Returns**: \n    - `labels` *(np.ndarray)*: updated cluster labels after merging.\n    \n\n---\n\n## Helper Functions\n\n### `build_CCgraph(X, min_samples, cutoff, n_jobs, distance_metric='euclidean')`\nConstruct the k-NN graph and extract the connected components.\n\n- **Parameters**:\n  - **`X`** *(np.ndarray)*: input data of shape `(n_samples, n_features)`.\n  - ...\n\n- **Returns**:\n  - **`components`** *(np.ndarray)*: connected component for each sample. If a sample is an outlier, the value will be NaN.\n  - **`CCmat`** *(scipy.sparse.csr_matrix)*: an n-by-n sparse matrix representation of the k-NN graph.\n  - **`knn_radius`** *(np.ndarray)*: distance to the min_samples-th (i.e., k-th) neighbor for each sample.\n\n\n### `get_density_dists_bb(X, k, components, knn_radius, n_jobs)`\nIdentify the \"big brother\" (nearest neighbor of higher density) for each sample.\n\n- **Parameters**:\n  - **`X`** *(np.ndarray)*: input data of shape `(n_samples, n_features)`.\n  - **`k`** *(int)*: identical to `min_samples`, the neighborhood parameter \\(k\\) in the mutual k-NN graph.\n  - ...\n\n- **Returns**:\n  - **`best_distance`** *(np.ndarray)*: best distance for each sample.\n  - **`big_brother`** *(np.ndarray)*: index of the \"big brother\" for each sample.\n\n\n### `get_y(CCmat, components, knn_radius, best_distance, big_brother, rho, alpha, d)`\nAssigns cluster labels to data points based on density and connectivity properties.\n\n- **Parameters**:\n  - ...\n  - **`d`** *(int)*: dimension of the input data `X`.\n\n- **Returns**:\n  - **`y_pred`** *(np.ndarray)*: identified cluster labels for the input data.\n\n---\n\n## Code Example 1: Clustering with the Dermatology Dataset\nThe script below demonstrates how to use CPFcluster with the Dermatology dataset, available in the Data folder.  \n\n```python\n\nimport os\nimport numpy as np\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.metrics import adjusted_rand_score\nfrom core import CPFcluster\n\n\n# Define the main function to utilize Python's multiprocessing unit (in Windows OS).\ndef main():\n    # Step 1: Load the Data\n    \n    # Load the Dermatology dataset\n    Data = np.load(\"Data/dermatology.npy\")\n    X = Data[:,:-1]  # Feature data\n    y = Data[:,-1]  # True labels (used here for evaluation, not clustering)\n\n    # Normalize dataset for easier hyperparameter tuning\n    X = StandardScaler().fit_transform(X)\n    \n    \n    # Step 2: Initialize CPFcluster\n    cpf = CPFcluster(\n        min_samples=10,\n        rho=0.5,\n        alpha=1.0,\n        merge=True,\n        merge_threshold=0.6,\n        n_jobs=-1,\n        plot_tsne=True,\n        plot_pca=True,\n        plot_umap=True\n    )\n    \n    \n    # Step 3: Fit the Model\n    cpf.fit(X)\n\n    # access the cluster labels\n    print(\"Cluster labels:\", cpf.labels)\n    \n    \n    # Step 4: Calculate Cluster Validity Indices\n    ari = adjusted_rand_score(y, cpf.labels)\n    print(f\"Adjusted Rand Index (ARI): {ari:.2f}\")\n\n\nif __name__ == \"__main__\":\n    main()\n\n\n```\n\n\n### Visualize Results\n\nIf `plot_tsne`, `plot_pca`, or `plot_umap` was set to `True`, clustering results will be visualized automatically after fitting the model. Here are the PCA, UMAP, and t-SNE visualizations for the Dermatology dataset.\n\n\n## Code Example 2: Grid Search for Optimal Parameter Configuration.\nThe script below demonstrates how to tune the parameters to obtain the highest Calinski-Harabasz score.  \n\n```python\n\nimport os\nimport numpy as np\nimport csv\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.metrics import calinski_harabasz_score\nfrom itertools import product\nfrom time import time\nfrom core import CPFcluster  \nfrom tqdm import tqdm\n\ndef main():\n    # Define datasets and parameter ranges\n    datasets = [\"seeds\", \"glass\", \"vertebral\", \"ecoli\", \"dermatology\"]\n\n    # Parameter ranges for grid search\n    min_samples_range = [3, 5, 10]\n    rho_range = [0.1, 0.4, 0.7]\n    alpha_range = [0.5, 1, 1.5]\n    cutoff_range = [1, 2, 3]\n    merge_threshold_range = [0.3, 0.5, 0.7]\n    density_ratio_threshold_range = [0.05, 0.1, 0.2]\n\n    # Directory for saving results\n    results_dir = \"Results\"\n    if not os.path.exists(results_dir):\n        os.makedirs(results_dir)\n\n    results_path = os.path.join(results_dir, \"CPFcluster_grid_search_results.csv\")\n    \n    # Write header to results CSV\n    with open(results_path, 'w', newline='') as fd:\n        writer = csv.writer(fd)\n        writer.writerow([\"Dataset\", \"min_samples\", \"rho\", \"alpha\", \"cutoff\", \n                         \"merge_threshold\", \"density_ratio_threshold\", \n                         \"CH_Index\", \"Time\"])\n\n    # Iterate through datasets\n    for dataset in datasets:\n        # Load dataset\n        Data = np.load(f\"Data/{dataset}.npy\")  \n        X = Data[:, :-1]\n        y = Data[:, -1]\n        \n        # Normalize dataset for easier parameter selection\n        X = StandardScaler().fit_transform(X)\n\n        # Parameter grid\n        param_grid = product(min_samples_range, rho_range, alpha_range, \n                             cutoff_range, merge_threshold_range, \n                             density_ratio_threshold_range)\n\n        for params in tqdm(param_grid, desc=f\"Processing {dataset}\"):\n            min_samples, rho, alpha, cutoff, merge_threshold, density_ratio_threshold = params\n            start_time = time()\n\n            # Initialize CPFcluster with current parameters\n            model = CPFcluster(\n                min_samples=min_samples,\n                rho=rho,\n                alpha=alpha,\n                cutoff=cutoff,\n                merge=True,\n                merge_threshold=merge_threshold,\n                density_ratio_threshold=density_ratio_threshold,\n                n_jobs=-1\n            )\n            \n            # Fit model\n            model.fit(X)\n            labels = model.labels\n\n            # Skip iteration if all points are assigned to one cluster\n            if len(np.unique(labels)) < 2:\n                continue\n\n            # Compute the adjusted Calinski-Harabasz index\n            ch_index = calinski_harabasz_score(X, labels)\n            elapsed_time = time() - start_time\n\n            # Write results to CSV\n            with open(results_path, 'a', newline='') as fd:\n                writer = csv.writer(fd)\n                writer.writerow([dataset, min_samples, rho, alpha, cutoff, \n                                 merge_threshold, density_ratio_threshold, \n                                 ch_index, elapsed_time])\n\nif __name__ == \"__main__\":\n    main()\n\n```\n\n\n---\n\n## Data\n\nThe data folder contains the following datasets for testing and benchmarking different clustering methods. The following datasets are from the UCI Machine Learning Repository:\n\n- **Dermatology**: The Dermatology dataset for classifying erythemato-squamous diseases.\n\n- **Ecoli**: This dataset predicts the localization sites of proteins within E. coli cells.\n\n- **Glass Identification**: This dataset is used for classifying glass types for forensic analysis.\n\n- **HTRU2**: HTRU2 distinguishes pulsar candidates from non-pulsar stars based on astronomical data.\n\n- **Letter Recognition**: The dataset is designed to classify uppercase English alphabet letters.\n\n- **MAGIC Gamma Telescope**: This dataset predicts whether detected particles are high-energy gamma rays.\n\n- **Optical Recognition of Handwritten Digits**: This dataset is used for classifying handwritten digit images.\n\n- **Page Blocks Classification**: Classify blocks of text in scanned documents into various categories.\n\n- **Pen-Based Recognition of Handwritten Digits**: The dataset is for digit classification based on pen movement data.\n\n- **Seeds**: This dataset classifies wheat seed varieties based on geometric properties.\n\n- **Vertebral Column**: The dataset is used to classify conditions of vertebral disks as normal or abnormal.\n\nThe following two datasets are from Kaggle: **Fraud Detection Bank** and **Paris Housing Classification**. The **Phoneme** dataset is from R (https://r-packages.io/datasets/phoneme).  \n\n\n---\n\n## Key Features\n\n\n1. **Outlier Detection**:  \n   The algorithm identifies and excludes small connected components as outliers based on the `cutoff` parameter.\n\n2. **Cluster Merging**:  \n   Merges similar clusters by evaluating proximity (`merge_threshold`) and density similarity (`density_ratio_threshold`).\n\n4. **Scalability**:  \n   Supports parallel processing with the `n_jobs` parameter, enabling efficient computation for large datasets.\n\n5. **Customizable Metrics**:  \n   Offers flexibility through the `distance_metric` parameter, which supports multiple distance measures (e.g., `'euclidean'`, `'manhattan'`, `'cosine'`).\n\n6. **Visualization Support**:  \n   Includes built-in options for t-SNE, PCA, and UMAP visualizations to enhance interpretability of clustering results.\n\n\n## Limitations\n\n1. **Parameter Sensitivity**:  \n   The results can be sensitive to the choice of `rho`, `alpha`, and `cutoff`, which require careful tuning for optimal results.\n\n2. **Computational Overhead**:  \n   Computing the nearest-neighbor graph for very large datasets can demand significant memory and processing resources.\n\n \n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An Implementation of Component-wise Peak Finding Clustering Method",
    "version": "3.0",
    "project_urls": {
        "Download": "https://github.com/tobinjo96/CPFcluster"
    },
    "split_keywords": [
        "density-peak-clustering",
        " clustering",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e305acf1decaabbcc4e43791ad66d3ec64ad787190d798ee0deb14ea741afa99",
                "md5": "2dc616cb165c3578d7c5730e26e074ca",
                "sha256": "b8735aa05d187d9df1e65722cb98da509627707283f2b7e12a54905d29de5bef"
            },
            "downloads": -1,
            "filename": "CPFcluster-3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2dc616cb165c3578d7c5730e26e074ca",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 12923,
            "upload_time": "2024-12-06T23:16:46",
            "upload_time_iso_8601": "2024-12-06T23:16:46.749645Z",
            "url": "https://files.pythonhosted.org/packages/e3/05/acf1decaabbcc4e43791ad66d3ec64ad787190d798ee0deb14ea741afa99/CPFcluster-3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1b2cca9423d939f6223a2415015b8ee844e7eb36fab78d6d6b2f270e39bb833e",
                "md5": "95ae72213d2e8df5b1f928c59004e8ce",
                "sha256": "d95a4d469b6c0bf50c9c23ff26d67e8fb88f0cc6b5d380bb7e670b725e8365a4"
            },
            "downloads": -1,
            "filename": "CPFcluster-3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "95ae72213d2e8df5b1f928c59004e8ce",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 12573,
            "upload_time": "2024-12-06T23:16:49",
            "upload_time_iso_8601": "2024-12-06T23:16:49.028171Z",
            "url": "https://files.pythonhosted.org/packages/1b/2c/ca9423d939f6223a2415015b8ee844e7eb36fab78d6d6b2f270e39bb833e/CPFcluster-3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-06 23:16:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tobinjo96",
    "github_project": "CPFcluster",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "scipy",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": []
        },
        {
            "name": "matplotlib",
            "specs": []
        },
        {
            "name": "umap-learn",
            "specs": []
        }
    ],
    "lcname": "cpfcluster"
}

Joshua Tobin, Mimi Zhang