# CPFcluster: the Component-wise Peak-Finding algorithm
To install the CPFcluster package:
```sh
$ pip install CPFcluster
```
If you used this package in your research, please cite it:
```latex
@ARTICLE{10296014,
author={Tobin, Joshua and Zhang, Mimi},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm},
year={2024},
volume={46},
number={2},
pages={1109-1120},
doi={10.1109/TPAMI.2023.3327471}}
```
---
## Class: CPFcluster
**CPFcluster** is a scalable and flexible density-based clustering method that integrates the strengths of density-level set and mode-seeking approaches. This combination offers several advantages, including: (1) the ability to detect outliers, (2) effective identification of clusters with varying densities and overlapping regions, and (3) robustness against spurious density maxima. The `CPFcluster` class is designed to manage outliers, merge clusters seamlessly, and visualize clustering results using techniques like PCA, UMAP and t-SNE.
```python
CPFcluster(
min_samples=5, # minimum number of neighbors to consider for connectivity #
rho=0.4, # parameter that controls the number of clusters for each component set #
alpha=1, # parameter for edge-cutoff in cluster detection #
n_jobs=1, # number of parallel jobs for computation #
cutoff=1, # threshold for filtering out small connected components as outliers #
merge=False, # whether to merge similar clusters based on thresholds #
merge_threshold=0.5, # distance threshold for merging clusters #
density_ratio_threshold=0.1, # density ratio threshold for merging clusters #
distance_metric='euclidean', # metric for distance computation (e.g., 'euclidean', 'manhattan', 'cosine') #
remove_duplicates=False, # whether to remove duplicate data points before clustering #
plot_umap=False, # whether to plot UMAP visualization after clustering #
plot_pca=False, # whether to plot PCA visualization after clustering #
plot_tsne=False # whether to plot t-SNE visualization after clustering #
)
```
### Parameters
- **`min_samples`** *(int)*:
Number of nearest-neighbors used to create connected components from the dataset and compute the density. This parameter is used in the `build_CCgraph` function to construct the k-NN graph and extract the component sets.
*Default*: `5`
- **`rho`** *(float)*:
The `rho` parameter in Definition 10 of the paper "A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm". Varying the parameter `rho` determines the number of clusters for each component set.
*Default*: `0.4`
- **`alpha`** *(float)*:
An optional parameter used to set the threshold for edge weights during center selection, not discussed in the paper.
*Default*: `1`
- **`n_jobs`** *(int)*:
Number of parallel jobs for computation. Specify `n_jobs=-1` (and include the `__name__ == "__main__":` line in your script) to use all cores.
*Default*: `1`
- **`cutoff`** *(int)*:
In the mutual k-NN graph, vertices with a number of edges less than or equal to the specified `cutoff` value are identified as outliers.
*Default*: `1`
- **`merge`** *(bool)*:
Specifies whether to merge clusters that are similar based on distance and density-ratio thresholds. Two clusters will be merged only if the distance between their centroids is less than the `merge_threshold` AND the density ratio exceeds the `density_ratio_threshold`.
*Default*: `False`
- **`merge_threshold`** *(float)*:
The distance threshold that determines whether two clusters should be merged. Clusters will be merged if the distance between their centroids is less than the `merge_threshold`. This parameter helps to combine clusters that are close in the feature space, potentially reducing over-segmentation. A range of 0.1–1.0 works well across diverse datasets (after standardization).
*Default*: `0.5`
- **`density_ratio_threshold`** *(float)*:
The density ratio threshold that determines whether two clusters should be merged. Clusters are merged if the ratio of densities between two clusters (lower density/higher density) exceeds the `density_ratio_threshold`, ensuring that only clusters with comparable densities are merged. A range of 0.1–0.5 is observed to work well across various datasets (after standardization).
*Default*: `0.1`
- **`distance_metric`** *(str)*:
Metric to use for distance computation. Options include:
- `'euclidean'`: Euclidean distance (default).
- `'manhattan'`: sum of absolute differences.
- `'cosine'`: cosine similarity-based distance.
- `'chebyshev'`: maximum difference along any dimension.
- `'minkowski'`: generalized distance metric requiring a parameter \(p\) (e.g., \(p=1\) for Manhattan, \(p=2\) for Euclidean).
- `'hamming'`: fraction of differing attributes between samples (useful for binary data).
- `'jaccard'`: used for binary attributes to measure similarity based on set intersection and union.
- **`remove_duplicates`** *(bool)*:
Whether to remove duplicate data points before clustering.
*Default*: `False`
- **`plot_clusters_umap`** *(bool)*:
Whether to visualize the clusters via UMAP.
*Default*: `False`
- **`plot_clusters_pca`** *(bool)*:
Whether to visualize the clusters via PCA.
*Default*: `False`
- **`plot_clusters_tsne`** *(bool)*:
Whether to visualize the clusters via t-SNE.
*Default*: `False`
### Methods
- `fit(X)`: Apply the CPF method to the input data X. <br>
- `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.
- **Returns**:
- None. Update the instance attributes with identified cluster labels. Outliers are labeled as `-1`.
- `calculate_centroids_and_densities(X, labels)`: Calculates the centroid and average density of each cluster.<br>
- `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.
- `labels` *(np.ndarray)*: cluster labels of the input data.
- **Returns**:
- `centroids` *(np.ndarray)*: centroids of the clusters.
- `densities` *(np.ndarray)*: average density of each cluster.
- `merge_clusters(X, centroids, densities, labels)`: Merges similar clusters based on the distance and density-ratio thresholds.<br>
- `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.
- `centroids` *(np.ndarray)*: centroids of the clusters.
- `densities` *(np.ndarray)*: average density of each cluster.
- `labels` *(np.ndarray)*: cluster label for each sample.
- **Returns**:
- `labels` *(np.ndarray)*: updated cluster labels after merging.
---
## Helper Functions
### `build_CCgraph(X, min_samples, cutoff, n_jobs, distance_metric='euclidean')`
Construct the k-NN graph and extract the connected components.
- **Parameters**:
- **`X`** *(np.ndarray)*: input data of shape `(n_samples, n_features)`.
- ...
- **Returns**:
- **`components`** *(np.ndarray)*: connected component for each sample. If a sample is an outlier, the value will be NaN.
- **`CCmat`** *(scipy.sparse.csr_matrix)*: an n-by-n sparse matrix representation of the k-NN graph.
- **`knn_radius`** *(np.ndarray)*: distance to the min_samples-th (i.e., k-th) neighbor for each sample.
### `get_density_dists_bb(X, k, components, knn_radius, n_jobs)`
Identify the "big brother" (nearest neighbor of higher density) for each sample.
- **Parameters**:
- **`X`** *(np.ndarray)*: input data of shape `(n_samples, n_features)`.
- **`k`** *(int)*: identical to `min_samples`, the neighborhood parameter \(k\) in the mutual k-NN graph.
- ...
- **Returns**:
- **`best_distance`** *(np.ndarray)*: best distance for each sample.
- **`big_brother`** *(np.ndarray)*: index of the "big brother" for each sample.
### `get_y(CCmat, components, knn_radius, best_distance, big_brother, rho, alpha, d)`
Assigns cluster labels to data points based on density and connectivity properties.
- **Parameters**:
- ...
- **`d`** *(int)*: dimension of the input data `X`.
- **Returns**:
- **`y_pred`** *(np.ndarray)*: identified cluster labels for the input data.
---
## Code Example 1: Clustering with the Dermatology Dataset
The script below demonstrates how to use CPFcluster with the Dermatology dataset, available in the Data folder.
```python
import os
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
from core import CPFcluster
# Define the main function to utilize Python's multiprocessing unit (in Windows OS).
def main():
# Step 1: Load the Data
# Load the Dermatology dataset
Data = np.load("Data/dermatology.npy")
X = Data[:,:-1] # Feature data
y = Data[:,-1] # True labels (used here for evaluation, not clustering)
# Normalize dataset for easier hyperparameter tuning
X = StandardScaler().fit_transform(X)
# Step 2: Initialize CPFcluster
cpf = CPFcluster(
min_samples=10,
rho=0.5,
alpha=1.0,
merge=True,
merge_threshold=0.6,
n_jobs=-1,
plot_tsne=True,
plot_pca=True,
plot_umap=True
)
# Step 3: Fit the Model
cpf.fit(X)
# access the cluster labels
print("Cluster labels:", cpf.labels)
# Step 4: Calculate Cluster Validity Indices
ari = adjusted_rand_score(y, cpf.labels)
print(f"Adjusted Rand Index (ARI): {ari:.2f}")
if __name__ == "__main__":
main()
```
### Visualize Results
If `plot_tsne`, `plot_pca`, or `plot_umap` was set to `True`, clustering results will be visualized automatically after fitting the model. Here are the PCA, UMAP, and t-SNE visualizations for the Dermatology dataset.
## Code Example 2: Grid Search for Optimal Parameter Configuration.
The script below demonstrates how to tune the parameters to obtain the highest Calinski-Harabasz score.
```python
import os
import numpy as np
import csv
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import calinski_harabasz_score
from itertools import product
from time import time
from core import CPFcluster
from tqdm import tqdm
def main():
# Define datasets and parameter ranges
datasets = ["seeds", "glass", "vertebral", "ecoli", "dermatology"]
# Parameter ranges for grid search
min_samples_range = [3, 5, 10]
rho_range = [0.1, 0.4, 0.7]
alpha_range = [0.5, 1, 1.5]
cutoff_range = [1, 2, 3]
merge_threshold_range = [0.3, 0.5, 0.7]
density_ratio_threshold_range = [0.05, 0.1, 0.2]
# Directory for saving results
results_dir = "Results"
if not os.path.exists(results_dir):
os.makedirs(results_dir)
results_path = os.path.join(results_dir, "CPFcluster_grid_search_results.csv")
# Write header to results CSV
with open(results_path, 'w', newline='') as fd:
writer = csv.writer(fd)
writer.writerow(["Dataset", "min_samples", "rho", "alpha", "cutoff",
"merge_threshold", "density_ratio_threshold",
"CH_Index", "Time"])
# Iterate through datasets
for dataset in datasets:
# Load dataset
Data = np.load(f"Data/{dataset}.npy")
X = Data[:, :-1]
y = Data[:, -1]
# Normalize dataset for easier parameter selection
X = StandardScaler().fit_transform(X)
# Parameter grid
param_grid = product(min_samples_range, rho_range, alpha_range,
cutoff_range, merge_threshold_range,
density_ratio_threshold_range)
for params in tqdm(param_grid, desc=f"Processing {dataset}"):
min_samples, rho, alpha, cutoff, merge_threshold, density_ratio_threshold = params
start_time = time()
# Initialize CPFcluster with current parameters
model = CPFcluster(
min_samples=min_samples,
rho=rho,
alpha=alpha,
cutoff=cutoff,
merge=True,
merge_threshold=merge_threshold,
density_ratio_threshold=density_ratio_threshold,
n_jobs=-1
)
# Fit model
model.fit(X)
labels = model.labels
# Skip iteration if all points are assigned to one cluster
if len(np.unique(labels)) < 2:
continue
# Compute the adjusted Calinski-Harabasz index
ch_index = calinski_harabasz_score(X, labels)
elapsed_time = time() - start_time
# Write results to CSV
with open(results_path, 'a', newline='') as fd:
writer = csv.writer(fd)
writer.writerow([dataset, min_samples, rho, alpha, cutoff,
merge_threshold, density_ratio_threshold,
ch_index, elapsed_time])
if __name__ == "__main__":
main()
```
---
## Data
The data folder contains the following datasets for testing and benchmarking different clustering methods. The following datasets are from the UCI Machine Learning Repository:
- **Dermatology**: The Dermatology dataset for classifying erythemato-squamous diseases.
- **Ecoli**: This dataset predicts the localization sites of proteins within E. coli cells.
- **Glass Identification**: This dataset is used for classifying glass types for forensic analysis.
- **HTRU2**: HTRU2 distinguishes pulsar candidates from non-pulsar stars based on astronomical data.
- **Letter Recognition**: The dataset is designed to classify uppercase English alphabet letters.
- **MAGIC Gamma Telescope**: This dataset predicts whether detected particles are high-energy gamma rays.
- **Optical Recognition of Handwritten Digits**: This dataset is used for classifying handwritten digit images.
- **Page Blocks Classification**: Classify blocks of text in scanned documents into various categories.
- **Pen-Based Recognition of Handwritten Digits**: The dataset is for digit classification based on pen movement data.
- **Seeds**: This dataset classifies wheat seed varieties based on geometric properties.
- **Vertebral Column**: The dataset is used to classify conditions of vertebral disks as normal or abnormal.
The following two datasets are from Kaggle: **Fraud Detection Bank** and **Paris Housing Classification**. The **Phoneme** dataset is from R (https://r-packages.io/datasets/phoneme).
---
## Key Features
1. **Outlier Detection**:
The algorithm identifies and excludes small connected components as outliers based on the `cutoff` parameter.
2. **Cluster Merging**:
Merges similar clusters by evaluating proximity (`merge_threshold`) and density similarity (`density_ratio_threshold`).
4. **Scalability**:
Supports parallel processing with the `n_jobs` parameter, enabling efficient computation for large datasets.
5. **Customizable Metrics**:
Offers flexibility through the `distance_metric` parameter, which supports multiple distance measures (e.g., `'euclidean'`, `'manhattan'`, `'cosine'`).
6. **Visualization Support**:
Includes built-in options for t-SNE, PCA, and UMAP visualizations to enhance interpretability of clustering results.
## Limitations
1. **Parameter Sensitivity**:
The results can be sensitive to the choice of `rho`, `alpha`, and `cutoff`, which require careful tuning for optimal results.
2. **Computational Overhead**:
Computing the nearest-neighbor graph for very large datasets can demand significant memory and processing resources.
Raw data
{
"_id": null,
"home_page": null,
"name": "CPFcluster",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "density-peak-clustering, clustering, machine-learning",
"author": "Joshua Tobin, Mimi Zhang",
"author_email": "tobinjo@tcd.ie",
"download_url": "https://files.pythonhosted.org/packages/1b/2c/ca9423d939f6223a2415015b8ee844e7eb36fab78d6d6b2f270e39bb833e/CPFcluster-3.0.tar.gz",
"platform": null,
"description": "# CPFcluster: the Component-wise Peak-Finding algorithm\n\nTo install the CPFcluster package: \n\n```sh\n$ pip install CPFcluster\n```\n\n\nIf you used this package in your research, please cite it:\n```latex\n@ARTICLE{10296014,\n author={Tobin, Joshua and Zhang, Mimi},\n journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, \n title={A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm}, \n year={2024},\n volume={46},\n number={2},\n pages={1109-1120},\n doi={10.1109/TPAMI.2023.3327471}}\n```\n\n\n---\n\n\n## Class: CPFcluster\n\n**CPFcluster** is a scalable and flexible density-based clustering method that integrates the strengths of density-level set and mode-seeking approaches. This combination offers several advantages, including: (1) the ability to detect outliers, (2) effective identification of clusters with varying densities and overlapping regions, and (3) robustness against spurious density maxima. The `CPFcluster` class is designed to manage outliers, merge clusters seamlessly, and visualize clustering results using techniques like PCA, UMAP and t-SNE.\n```python\nCPFcluster(\n min_samples=5, # minimum number of neighbors to consider for connectivity #\n rho=0.4, # parameter that controls the number of clusters for each component set #\n alpha=1, # parameter for edge-cutoff in cluster detection #\n n_jobs=1, # number of parallel jobs for computation #\n cutoff=1, # threshold for filtering out small connected components as outliers #\n merge=False, # whether to merge similar clusters based on thresholds #\n merge_threshold=0.5, # distance threshold for merging clusters #\n density_ratio_threshold=0.1, # density ratio threshold for merging clusters #\n distance_metric='euclidean', # metric for distance computation (e.g., 'euclidean', 'manhattan', 'cosine') #\n remove_duplicates=False, # whether to remove duplicate data points before clustering #\n plot_umap=False, # whether to plot UMAP visualization after clustering #\n plot_pca=False, # whether to plot PCA visualization after clustering #\n plot_tsne=False # whether to plot t-SNE visualization after clustering #\n)\n```\n\n### Parameters\n\n- **`min_samples`** *(int)*: \nNumber of nearest-neighbors used to create connected components from the dataset and compute the density. This parameter is used in the `build_CCgraph` function to construct the k-NN graph and extract the component sets. \n *Default*: `5`\n\n- **`rho`** *(float)*: \n The `rho` parameter in Definition 10 of the paper \"A Theoretical Analysis of Density Peaks Clustering and the Component-Wise Peak-Finding Algorithm\". Varying the parameter `rho` determines the number of clusters for each component set. \n *Default*: `0.4`\n\n- **`alpha`** *(float)*: \n An optional parameter used to set the threshold for edge weights during center selection, not discussed in the paper. \n *Default*: `1`\n\n- **`n_jobs`** *(int)*: \n Number of parallel jobs for computation. Specify `n_jobs=-1` (and include the `__name__ == \"__main__\":` line in your script) to use all cores. \n *Default*: `1`\n \n- **`cutoff`** *(int)*: \n In the mutual k-NN graph, vertices with a number of edges less than or equal to the specified `cutoff` value are identified as outliers. \n *Default*: `1` \n \n- **`merge`** *(bool)*: \n Specifies whether to merge clusters that are similar based on distance and density-ratio thresholds. Two clusters will be merged only if the distance between their centroids is less than the `merge_threshold` AND the density ratio exceeds the `density_ratio_threshold`. \n *Default*: `False` \n \n- **`merge_threshold`** *(float)*: \n The distance threshold that determines whether two clusters should be merged. Clusters will be merged if the distance between their centroids is less than the `merge_threshold`. This parameter helps to combine clusters that are close in the feature space, potentially reducing over-segmentation. A range of 0.1\u20131.0 works well across diverse datasets (after standardization). \n *Default*: `0.5` \n\n- **`density_ratio_threshold`** *(float)*: \n The density ratio threshold that determines whether two clusters should be merged. Clusters are merged if the ratio of densities between two clusters (lower density/higher density) exceeds the `density_ratio_threshold`, ensuring that only clusters with comparable densities are merged. A range of 0.1\u20130.5 is observed to work well across various datasets (after standardization). \n *Default*: `0.1` \n \n- **`distance_metric`** *(str)*: \n Metric to use for distance computation. Options include: \n - `'euclidean'`: Euclidean distance (default). \n - `'manhattan'`: sum of absolute differences. \n - `'cosine'`: cosine similarity-based distance. \n - `'chebyshev'`: maximum difference along any dimension. \n - `'minkowski'`: generalized distance metric requiring a parameter \\(p\\) (e.g., \\(p=1\\) for Manhattan, \\(p=2\\) for Euclidean). \n - `'hamming'`: fraction of differing attributes between samples (useful for binary data). \n - `'jaccard'`: used for binary attributes to measure similarity based on set intersection and union. \n\n- **`remove_duplicates`** *(bool)*: \n Whether to remove duplicate data points before clustering. \n *Default*: `False`\n\n- **`plot_clusters_umap`** *(bool)*: \n Whether to visualize the clusters via UMAP. \n *Default*: `False`\n\n- **`plot_clusters_pca`** *(bool)*: \n Whether to visualize the clusters via PCA. \n *Default*: `False`\n \n- **`plot_clusters_tsne`** *(bool)*: \n Whether to visualize the clusters via t-SNE. \n *Default*: `False`\n\n\n### Methods\n\n- `fit(X)`: Apply the CPF method to the input data X. <br> \n - `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.\n - **Returns**:\n - None. Update the instance attributes with identified cluster labels. Outliers are labeled as `-1`.\n\n- `calculate_centroids_and_densities(X, labels)`: Calculates the centroid and average density of each cluster.<br> \n - `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.\n - `labels` *(np.ndarray)*: cluster labels of the input data.\n - **Returns**:\n - `centroids` *(np.ndarray)*: centroids of the clusters.\n - `densities` *(np.ndarray)*: average density of each cluster.\n\n- `merge_clusters(X, centroids, densities, labels)`: Merges similar clusters based on the distance and density-ratio thresholds.<br> \n - `X` *(np.ndarray)*: input data of shape `(n_samples, n_features)`.\n - `centroids` *(np.ndarray)*: centroids of the clusters.\n - `densities` *(np.ndarray)*: average density of each cluster.\n - `labels` *(np.ndarray)*: cluster label for each sample. \n - **Returns**: \n - `labels` *(np.ndarray)*: updated cluster labels after merging.\n \n\n---\n\n## Helper Functions\n\n### `build_CCgraph(X, min_samples, cutoff, n_jobs, distance_metric='euclidean')`\nConstruct the k-NN graph and extract the connected components.\n\n- **Parameters**:\n - **`X`** *(np.ndarray)*: input data of shape `(n_samples, n_features)`.\n - ...\n\n- **Returns**:\n - **`components`** *(np.ndarray)*: connected component for each sample. If a sample is an outlier, the value will be NaN.\n - **`CCmat`** *(scipy.sparse.csr_matrix)*: an n-by-n sparse matrix representation of the k-NN graph.\n - **`knn_radius`** *(np.ndarray)*: distance to the min_samples-th (i.e., k-th) neighbor for each sample.\n\n\n### `get_density_dists_bb(X, k, components, knn_radius, n_jobs)`\nIdentify the \"big brother\" (nearest neighbor of higher density) for each sample.\n\n- **Parameters**:\n - **`X`** *(np.ndarray)*: input data of shape `(n_samples, n_features)`.\n - **`k`** *(int)*: identical to `min_samples`, the neighborhood parameter \\(k\\) in the mutual k-NN graph.\n - ...\n\n- **Returns**:\n - **`best_distance`** *(np.ndarray)*: best distance for each sample.\n - **`big_brother`** *(np.ndarray)*: index of the \"big brother\" for each sample.\n\n\n### `get_y(CCmat, components, knn_radius, best_distance, big_brother, rho, alpha, d)`\nAssigns cluster labels to data points based on density and connectivity properties.\n\n- **Parameters**:\n - ...\n - **`d`** *(int)*: dimension of the input data `X`.\n\n- **Returns**:\n - **`y_pred`** *(np.ndarray)*: identified cluster labels for the input data.\n\n---\n\n## Code Example 1: Clustering with the Dermatology Dataset\nThe script below demonstrates how to use CPFcluster with the Dermatology dataset, available in the Data folder. \n\n```python\n\nimport os\nimport numpy as np\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.metrics import adjusted_rand_score\nfrom core import CPFcluster\n\n\n# Define the main function to utilize Python's multiprocessing unit (in Windows OS).\ndef main():\n # Step 1: Load the Data\n \n # Load the Dermatology dataset\n Data = np.load(\"Data/dermatology.npy\")\n X = Data[:,:-1] # Feature data\n y = Data[:,-1] # True labels (used here for evaluation, not clustering)\n\n # Normalize dataset for easier hyperparameter tuning\n X = StandardScaler().fit_transform(X)\n \n \n # Step 2: Initialize CPFcluster\n cpf = CPFcluster(\n min_samples=10,\n rho=0.5,\n alpha=1.0,\n merge=True,\n merge_threshold=0.6,\n n_jobs=-1,\n plot_tsne=True,\n plot_pca=True,\n plot_umap=True\n )\n \n \n # Step 3: Fit the Model\n cpf.fit(X)\n\n # access the cluster labels\n print(\"Cluster labels:\", cpf.labels)\n \n \n # Step 4: Calculate Cluster Validity Indices\n ari = adjusted_rand_score(y, cpf.labels)\n print(f\"Adjusted Rand Index (ARI): {ari:.2f}\")\n\n\nif __name__ == \"__main__\":\n main()\n\n\n```\n\n\n### Visualize Results\n\nIf `plot_tsne`, `plot_pca`, or `plot_umap` was set to `True`, clustering results will be visualized automatically after fitting the model. Here are the PCA, UMAP, and t-SNE visualizations for the Dermatology dataset.\n\n\n## Code Example 2: Grid Search for Optimal Parameter Configuration.\nThe script below demonstrates how to tune the parameters to obtain the highest Calinski-Harabasz score. \n\n```python\n\nimport os\nimport numpy as np\nimport csv\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.metrics import calinski_harabasz_score\nfrom itertools import product\nfrom time import time\nfrom core import CPFcluster \nfrom tqdm import tqdm\n\ndef main():\n # Define datasets and parameter ranges\n datasets = [\"seeds\", \"glass\", \"vertebral\", \"ecoli\", \"dermatology\"]\n\n # Parameter ranges for grid search\n min_samples_range = [3, 5, 10]\n rho_range = [0.1, 0.4, 0.7]\n alpha_range = [0.5, 1, 1.5]\n cutoff_range = [1, 2, 3]\n merge_threshold_range = [0.3, 0.5, 0.7]\n density_ratio_threshold_range = [0.05, 0.1, 0.2]\n\n # Directory for saving results\n results_dir = \"Results\"\n if not os.path.exists(results_dir):\n os.makedirs(results_dir)\n\n results_path = os.path.join(results_dir, \"CPFcluster_grid_search_results.csv\")\n \n # Write header to results CSV\n with open(results_path, 'w', newline='') as fd:\n writer = csv.writer(fd)\n writer.writerow([\"Dataset\", \"min_samples\", \"rho\", \"alpha\", \"cutoff\", \n \"merge_threshold\", \"density_ratio_threshold\", \n \"CH_Index\", \"Time\"])\n\n # Iterate through datasets\n for dataset in datasets:\n # Load dataset\n Data = np.load(f\"Data/{dataset}.npy\") \n X = Data[:, :-1]\n y = Data[:, -1]\n \n # Normalize dataset for easier parameter selection\n X = StandardScaler().fit_transform(X)\n\n # Parameter grid\n param_grid = product(min_samples_range, rho_range, alpha_range, \n cutoff_range, merge_threshold_range, \n density_ratio_threshold_range)\n\n for params in tqdm(param_grid, desc=f\"Processing {dataset}\"):\n min_samples, rho, alpha, cutoff, merge_threshold, density_ratio_threshold = params\n start_time = time()\n\n # Initialize CPFcluster with current parameters\n model = CPFcluster(\n min_samples=min_samples,\n rho=rho,\n alpha=alpha,\n cutoff=cutoff,\n merge=True,\n merge_threshold=merge_threshold,\n density_ratio_threshold=density_ratio_threshold,\n n_jobs=-1\n )\n \n # Fit model\n model.fit(X)\n labels = model.labels\n\n # Skip iteration if all points are assigned to one cluster\n if len(np.unique(labels)) < 2:\n continue\n\n # Compute the adjusted Calinski-Harabasz index\n ch_index = calinski_harabasz_score(X, labels)\n elapsed_time = time() - start_time\n\n # Write results to CSV\n with open(results_path, 'a', newline='') as fd:\n writer = csv.writer(fd)\n writer.writerow([dataset, min_samples, rho, alpha, cutoff, \n merge_threshold, density_ratio_threshold, \n ch_index, elapsed_time])\n\nif __name__ == \"__main__\":\n main()\n\n```\n\n\n---\n\n## Data\n\nThe data folder contains the following datasets for testing and benchmarking different clustering methods. The following datasets are from the UCI Machine Learning Repository:\n\n- **Dermatology**: The Dermatology dataset for classifying erythemato-squamous diseases.\n\n- **Ecoli**: This dataset predicts the localization sites of proteins within E. coli cells.\n\n- **Glass Identification**: This dataset is used for classifying glass types for forensic analysis.\n\n- **HTRU2**: HTRU2 distinguishes pulsar candidates from non-pulsar stars based on astronomical data.\n\n- **Letter Recognition**: The dataset is designed to classify uppercase English alphabet letters.\n\n- **MAGIC Gamma Telescope**: This dataset predicts whether detected particles are high-energy gamma rays.\n\n- **Optical Recognition of Handwritten Digits**: This dataset is used for classifying handwritten digit images.\n\n- **Page Blocks Classification**: Classify blocks of text in scanned documents into various categories.\n\n- **Pen-Based Recognition of Handwritten Digits**: The dataset is for digit classification based on pen movement data.\n\n- **Seeds**: This dataset classifies wheat seed varieties based on geometric properties.\n\n- **Vertebral Column**: The dataset is used to classify conditions of vertebral disks as normal or abnormal.\n\nThe following two datasets are from Kaggle: **Fraud Detection Bank** and **Paris Housing Classification**. The **Phoneme** dataset is from R (https://r-packages.io/datasets/phoneme). \n\n\n---\n\n## Key Features\n\n\n1. **Outlier Detection**: \n The algorithm identifies and excludes small connected components as outliers based on the `cutoff` parameter.\n\n2. **Cluster Merging**: \n Merges similar clusters by evaluating proximity (`merge_threshold`) and density similarity (`density_ratio_threshold`).\n\n4. **Scalability**: \n Supports parallel processing with the `n_jobs` parameter, enabling efficient computation for large datasets.\n\n5. **Customizable Metrics**: \n Offers flexibility through the `distance_metric` parameter, which supports multiple distance measures (e.g., `'euclidean'`, `'manhattan'`, `'cosine'`).\n\n6. **Visualization Support**: \n Includes built-in options for t-SNE, PCA, and UMAP visualizations to enhance interpretability of clustering results.\n\n\n## Limitations\n\n1. **Parameter Sensitivity**: \n The results can be sensitive to the choice of `rho`, `alpha`, and `cutoff`, which require careful tuning for optimal results.\n\n2. **Computational Overhead**: \n Computing the nearest-neighbor graph for very large datasets can demand significant memory and processing resources.\n\n \n",
"bugtrack_url": null,
"license": "MIT",
"summary": "An Implementation of Component-wise Peak Finding Clustering Method",
"version": "3.0",
"project_urls": {
"Download": "https://github.com/tobinjo96/CPFcluster"
},
"split_keywords": [
"density-peak-clustering",
" clustering",
" machine-learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e305acf1decaabbcc4e43791ad66d3ec64ad787190d798ee0deb14ea741afa99",
"md5": "2dc616cb165c3578d7c5730e26e074ca",
"sha256": "b8735aa05d187d9df1e65722cb98da509627707283f2b7e12a54905d29de5bef"
},
"downloads": -1,
"filename": "CPFcluster-3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2dc616cb165c3578d7c5730e26e074ca",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 12923,
"upload_time": "2024-12-06T23:16:46",
"upload_time_iso_8601": "2024-12-06T23:16:46.749645Z",
"url": "https://files.pythonhosted.org/packages/e3/05/acf1decaabbcc4e43791ad66d3ec64ad787190d798ee0deb14ea741afa99/CPFcluster-3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1b2cca9423d939f6223a2415015b8ee844e7eb36fab78d6d6b2f270e39bb833e",
"md5": "95ae72213d2e8df5b1f928c59004e8ce",
"sha256": "d95a4d469b6c0bf50c9c23ff26d67e8fb88f0cc6b5d380bb7e670b725e8365a4"
},
"downloads": -1,
"filename": "CPFcluster-3.0.tar.gz",
"has_sig": false,
"md5_digest": "95ae72213d2e8df5b1f928c59004e8ce",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 12573,
"upload_time": "2024-12-06T23:16:49",
"upload_time_iso_8601": "2024-12-06T23:16:49.028171Z",
"url": "https://files.pythonhosted.org/packages/1b/2c/ca9423d939f6223a2415015b8ee844e7eb36fab78d6d6b2f270e39bb833e/CPFcluster-3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-06 23:16:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tobinjo96",
"github_project": "CPFcluster",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": []
},
{
"name": "scipy",
"specs": []
},
{
"name": "scikit-learn",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "umap-learn",
"specs": []
}
],
"lcname": "cpfcluster"
}