# PyTorch KMeans
## Introduction
`pt_kmeans` is a pure PyTorch implementation of the popular K-Means clustering algorithm, designed for seamless integration into PyTorch-based machine learning pipelines. It offers high performance on both CPU and GPU (CUDA), along with advanced features like K-Means++ initialization, hierarchical clustering, and cluster splitting, all while maintaining full PyTorch tensor compatibility.
Unlike K-Means implementations that require data transfers to NumPy or other libraries, `pt_kmeans` keeps your data on the PyTorch device (CPU or GPU) throughout the entire process, minimizing overhead and maximizing efficiency for large-scale datasets.
## Features
- **Pure PyTorch**: No external dependencies beyond PyTorch itself. All computations are performed using PyTorch tensors, making it ideal for integration with deep learning workflows.
- **Self-Contained & Portable**: The entire implementation resides in a single file, allowing for easy integration by simply copying the file into your project or an existing module.
- **CPU & GPU Support**: Leverages your available hardware. Optimized for CPU performance and efficient on GPUs.
- **K-Means++ Initialization**: Intelligent seeding of initial centroids for faster convergence and better clustering results.
- **L2 and Cosine Distance**: Supports the standard Euclidean (L2) distance and Cosine distance for various data types and applications (e.g., embeddings).
- **Chunked Processing**: Efficiently handles very large datasets by processing assignments in memory-friendly chunks, preventing Out-Of-Memory (OOM) errors.
- **Reproducibility**: Full control over randomness via `random_seed` for consistent results.
- **Hierarchical K-Means**: Implements a bottom-up hierarchical clustering approach, useful for creating multi-level cluster structures.
- **Cluster Splitting**: Provides a utility to refine existing clusters by splitting a single cluster into multiple sub-clusters.
## Installation
`pt_kmeans` requires PyTorch (`torch>=2.4.0` recommended).
First, ensure you have PyTorch installed (refer to the [official PyTorch website](https://pytorch.org/get-started/locally/) for installation instructions specific to your system and CUDA version).
Then, install `pt_kmeans` directly from PyPI:
```bash
pip install pt_kmeans
```
## Quick Start & Usage Examples
Here's how to get started with `pt_kmeans`.
```python
import torch
import matplotlib.pyplot as plt # For visualization
from pt_kmeans import hierarchical_kmeans
from pt_kmeans import kmeans
from pt_kmeans import split_cluster
```
### Basic K-Means Clustering
```python
# 1. Generate some synthetic data for demonstration
# Three distinct clusters
data = torch.cat([
torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 0.0]),
torch.randn(100, 2) * 0.5 + torch.tensor([5.0, 5.0]),
torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 5.0]),
])
# Move data to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
data = data.to(device)
n_clusters = 3
random_seed = 0
# 2. Run K-Means
print(f"Running K-Means on {device}...")
(centers, labels) = kmeans(
data,
n_clusters=n_clusters,
max_iters=100,
tol=1e-4,
distance_metric="l2", # or "cosine"
init_method="kmeans++", # or "random"
chunk_size=None, # Process all at once
random_seed=random_seed,
)
print("\nK-Means Results:")
print(f"Final Centers Shape: {centers.shape}")
print(f"First 5 Labels: {labels[:5]}")
print(f"Unique Labels: {torch.unique(labels)}")
# 3. (Optional) Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(data[:, 0].cpu(), data[:, 1].cpu(), c=labels.cpu(), cmap="viridis", s=10, alpha=0.7)
plt.scatter(centers[:, 0].cpu(), centers[:, 1].cpu(), c="red", marker="X", s=200, label="Centers")
plt.title("K-Means Clustering Result")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()
```
### Hierarchical K-Means
Build a multi-level clustering structure.
```python
# Use the 'data' generated in the previous example
n_clusters_levels = [15, 5, 3] # Define number of clusters for each level
print(f"Running Hierarchical K-Means on {device}...")
results = hierarchical_kmeans(
data,
n_clusters=n_clusters_levels,
max_iters=100,
tol=1e-4,
distance_metric="l2",
init_method="kmeans++",
random_seed=random_seed,
)
print("\nHierarchical K-Means Results:")
for i, level_result in enumerate(results):
print(f"Level {i} (n_clusters={n_clusters_levels[i]}):")
print(f" Centers Shape: {level_result['centers'].shape}")
print(f" Assignment Shape (original data): {level_result['assignment'].shape}")
print(f" Unique Assignments: {torch.unique(level_result['assignment'])}")
```
### Splitting an Existing Cluster
Refine a specific cluster by breaking it down into sub-clusters.
```python
# First, run a basic K-Means to get initial labels and centers
(initial_centers, initial_labels) = kmeans(
data, n_clusters=3, random_seed=random_seed, show_progress=False
)
cluster_to_split_id = 0 # Choose a cluster to split
num_sub_clusters = 2
print(f"Splitting Cluster {cluster_to_split_id} into {num_sub_clusters} sub-clusters on {device}...")
(new_sub_centers, updated_labels) = split_cluster(
data,
initial_labels,
cluster_id=cluster_to_split_id,
n_clusters=num_sub_clusters,
max_iters=50,
distance_metric="l2",
random_seed=random_seed + 1,
)
print("\nCluster Splitting Results:")
print(f"New Sub-Centers Shape: {new_sub_centers.shape}")
print(f"Updated Labels Shape: {updated_labels.shape}")
print(f"Unique Labels in updated set: {torch.unique(updated_labels).tolist()}")
# Verify that the original cluster_id is replaced by new ones or kept, and new ones are introduced
print(f"Original unique labels: {torch.unique(initial_labels).tolist()}")
print(f"Updated unique labels: {torch.unique(updated_labels).tolist()}")
```
### GPU Usage
To use your GPU, simply ensure your input tensor `x` is on a CUDA device:
```python
x_gpu = torch.randn(1_000_000, 128, device="cuda") # Create data directly on GPU
n_clusters_gpu = 100
(centers_gpu, labels_gpu) = kmeans(
x_gpu,
n_clusters=n_clusters_gpu,
distance_metric="cosine", # Often used for embeddings on GPU
chunk_size=64000, # Important for larger datasets on GPU to manage memory
show_progress=True,
)
print(f"GPU K-Means finished. Centers on: {centers_gpu.device}, Labels on: {labels_gpu.device}")
```
## Contributing
Contributions are very welcome! If you find a bug, have a feature request, or want to contribute code, please feel free to:
1. Open an issue on the [GitLab Issues page](https://gitlab.com/hassonofer/pt_kmeans/issues).
2. Submit a Pull Request.
Please ensure your code adheres to the existing style (Black, isort) and passes all tests.
## License
This project is licensed under the Apache-2.0 License - see the [LICENSE](https://gitlab.com/hassonofer/pt_kmeans/blob/main/LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "pt-kmeans",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "pytorch, kmeans, clustering, machine-learning, unsupervised-learning, hierarchical-clustering",
"author": "Ofer Hasson",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/a9/2a/1e7487538fa4a59bb944e5d1c2539a7dccf82e3cc1add4505335c2464c01/pt_kmeans-0.1.1.tar.gz",
"platform": null,
"description": "# PyTorch KMeans\n\n## Introduction\n\n`pt_kmeans` is a pure PyTorch implementation of the popular K-Means clustering algorithm, designed for seamless integration into PyTorch-based machine learning pipelines. It offers high performance on both CPU and GPU (CUDA), along with advanced features like K-Means++ initialization, hierarchical clustering, and cluster splitting, all while maintaining full PyTorch tensor compatibility.\n\nUnlike K-Means implementations that require data transfers to NumPy or other libraries, `pt_kmeans` keeps your data on the PyTorch device (CPU or GPU) throughout the entire process, minimizing overhead and maximizing efficiency for large-scale datasets.\n\n## Features\n\n- **Pure PyTorch**: No external dependencies beyond PyTorch itself. All computations are performed using PyTorch tensors, making it ideal for integration with deep learning workflows.\n- **Self-Contained & Portable**: The entire implementation resides in a single file, allowing for easy integration by simply copying the file into your project or an existing module.\n- **CPU & GPU Support**: Leverages your available hardware. Optimized for CPU performance and efficient on GPUs.\n- **K-Means++ Initialization**: Intelligent seeding of initial centroids for faster convergence and better clustering results.\n- **L2 and Cosine Distance**: Supports the standard Euclidean (L2) distance and Cosine distance for various data types and applications (e.g., embeddings).\n- **Chunked Processing**: Efficiently handles very large datasets by processing assignments in memory-friendly chunks, preventing Out-Of-Memory (OOM) errors.\n- **Reproducibility**: Full control over randomness via `random_seed` for consistent results.\n- **Hierarchical K-Means**: Implements a bottom-up hierarchical clustering approach, useful for creating multi-level cluster structures.\n- **Cluster Splitting**: Provides a utility to refine existing clusters by splitting a single cluster into multiple sub-clusters.\n\n## Installation\n\n`pt_kmeans` requires PyTorch (`torch>=2.4.0` recommended).\n\nFirst, ensure you have PyTorch installed (refer to the [official PyTorch website](https://pytorch.org/get-started/locally/) for installation instructions specific to your system and CUDA version).\n\nThen, install `pt_kmeans` directly from PyPI:\n\n```bash\npip install pt_kmeans\n```\n\n## Quick Start & Usage Examples\n\nHere's how to get started with `pt_kmeans`.\n\n```python\nimport torch\nimport matplotlib.pyplot as plt # For visualization\n\nfrom pt_kmeans import hierarchical_kmeans\nfrom pt_kmeans import kmeans\nfrom pt_kmeans import split_cluster\n```\n\n### Basic K-Means Clustering\n\n```python\n# 1. Generate some synthetic data for demonstration\n# Three distinct clusters\ndata = torch.cat([\n torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 0.0]),\n torch.randn(100, 2) * 0.5 + torch.tensor([5.0, 5.0]),\n torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 5.0]),\n])\n\n# Move data to GPU if available\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\ndata = data.to(device)\n\nn_clusters = 3\nrandom_seed = 0\n\n# 2. Run K-Means\nprint(f\"Running K-Means on {device}...\")\n(centers, labels) = kmeans(\n data,\n n_clusters=n_clusters,\n max_iters=100,\n tol=1e-4,\n distance_metric=\"l2\", # or \"cosine\"\n init_method=\"kmeans++\", # or \"random\"\n chunk_size=None, # Process all at once\n random_seed=random_seed,\n)\n\nprint(\"\\nK-Means Results:\")\nprint(f\"Final Centers Shape: {centers.shape}\")\nprint(f\"First 5 Labels: {labels[:5]}\")\nprint(f\"Unique Labels: {torch.unique(labels)}\")\n\n# 3. (Optional) Visualize the clusters\nplt.figure(figsize=(8, 6))\nplt.scatter(data[:, 0].cpu(), data[:, 1].cpu(), c=labels.cpu(), cmap=\"viridis\", s=10, alpha=0.7)\nplt.scatter(centers[:, 0].cpu(), centers[:, 1].cpu(), c=\"red\", marker=\"X\", s=200, label=\"Centers\")\nplt.title(\"K-Means Clustering Result\")\nplt.xlabel(\"Feature 1\")\nplt.ylabel(\"Feature 2\")\nplt.legend()\nplt.grid(True)\nplt.show()\n```\n\n### Hierarchical K-Means\n\nBuild a multi-level clustering structure.\n\n```python\n# Use the 'data' generated in the previous example\nn_clusters_levels = [15, 5, 3] # Define number of clusters for each level\n\nprint(f\"Running Hierarchical K-Means on {device}...\")\nresults = hierarchical_kmeans(\n data,\n n_clusters=n_clusters_levels,\n max_iters=100,\n tol=1e-4,\n distance_metric=\"l2\",\n init_method=\"kmeans++\",\n random_seed=random_seed,\n)\n\nprint(\"\\nHierarchical K-Means Results:\")\nfor i, level_result in enumerate(results):\n print(f\"Level {i} (n_clusters={n_clusters_levels[i]}):\")\n print(f\" Centers Shape: {level_result['centers'].shape}\")\n print(f\" Assignment Shape (original data): {level_result['assignment'].shape}\")\n print(f\" Unique Assignments: {torch.unique(level_result['assignment'])}\")\n```\n\n### Splitting an Existing Cluster\n\nRefine a specific cluster by breaking it down into sub-clusters.\n\n```python\n# First, run a basic K-Means to get initial labels and centers\n(initial_centers, initial_labels) = kmeans(\n data, n_clusters=3, random_seed=random_seed, show_progress=False\n)\n\ncluster_to_split_id = 0 # Choose a cluster to split\nnum_sub_clusters = 2\n\nprint(f\"Splitting Cluster {cluster_to_split_id} into {num_sub_clusters} sub-clusters on {device}...\")\n(new_sub_centers, updated_labels) = split_cluster(\n data,\n initial_labels,\n cluster_id=cluster_to_split_id,\n n_clusters=num_sub_clusters,\n max_iters=50,\n distance_metric=\"l2\",\n random_seed=random_seed + 1,\n)\n\nprint(\"\\nCluster Splitting Results:\")\nprint(f\"New Sub-Centers Shape: {new_sub_centers.shape}\")\nprint(f\"Updated Labels Shape: {updated_labels.shape}\")\nprint(f\"Unique Labels in updated set: {torch.unique(updated_labels).tolist()}\")\n\n# Verify that the original cluster_id is replaced by new ones or kept, and new ones are introduced\nprint(f\"Original unique labels: {torch.unique(initial_labels).tolist()}\")\nprint(f\"Updated unique labels: {torch.unique(updated_labels).tolist()}\")\n```\n\n### GPU Usage\n\nTo use your GPU, simply ensure your input tensor `x` is on a CUDA device:\n\n```python\nx_gpu = torch.randn(1_000_000, 128, device=\"cuda\") # Create data directly on GPU\nn_clusters_gpu = 100\n\n(centers_gpu, labels_gpu) = kmeans(\n x_gpu,\n n_clusters=n_clusters_gpu,\n distance_metric=\"cosine\", # Often used for embeddings on GPU\n chunk_size=64000, # Important for larger datasets on GPU to manage memory\n show_progress=True,\n)\n\nprint(f\"GPU K-Means finished. Centers on: {centers_gpu.device}, Labels on: {labels_gpu.device}\")\n```\n\n## Contributing\n\nContributions are very welcome! If you find a bug, have a feature request, or want to contribute code, please feel free to:\n\n1. Open an issue on the [GitLab Issues page](https://gitlab.com/hassonofer/pt_kmeans/issues).\n2. Submit a Pull Request.\n\nPlease ensure your code adheres to the existing style (Black, isort) and passes all tests.\n\n## License\n\nThis project is licensed under the Apache-2.0 License - see the [LICENSE](https://gitlab.com/hassonofer/pt_kmeans/blob/main/LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "K-Means and Hierarchical K-Means implementation in PyTorch",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://gitlab.com/hassonofer/pt_kmeans",
"Issues": "https://gitlab.com/hassonofer/pt_kmeans/-/issues"
},
"split_keywords": [
"pytorch",
" kmeans",
" clustering",
" machine-learning",
" unsupervised-learning",
" hierarchical-clustering"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e946aab8a189b9b8488f341911a70b1fccae054bb103a3979c68d828f8ecc803",
"md5": "d228e4940e9f5a069926c0d567d36055",
"sha256": "b4a31c44f7604feca9a78dbf4a5899fddd3a1c7d38f2889f69261531fc0a0cfd"
},
"downloads": -1,
"filename": "pt_kmeans-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d228e4940e9f5a069926c0d567d36055",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 12313,
"upload_time": "2025-07-28T17:23:48",
"upload_time_iso_8601": "2025-07-28T17:23:48.446917Z",
"url": "https://files.pythonhosted.org/packages/e9/46/aab8a189b9b8488f341911a70b1fccae054bb103a3979c68d828f8ecc803/pt_kmeans-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a92a1e7487538fa4a59bb944e5d1c2539a7dccf82e3cc1add4505335c2464c01",
"md5": "56f059f149a688d8209951f02c390f3a",
"sha256": "d31a44e36bde61d4258046081d2e58787df5c5600ae0965aaca95f0cf451082d"
},
"downloads": -1,
"filename": "pt_kmeans-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "56f059f149a688d8209951f02c390f3a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 14603,
"upload_time": "2025-07-28T17:23:49",
"upload_time_iso_8601": "2025-07-28T17:23:49.273335Z",
"url": "https://files.pythonhosted.org/packages/a9/2a/1e7487538fa4a59bb944e5d1c2539a7dccf82e3cc1add4505335c2464c01/pt_kmeans-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-28 17:23:49",
"github": false,
"gitlab": true,
"bitbucket": false,
"codeberg": false,
"gitlab_user": "hassonofer",
"gitlab_project": "pt_kmeans",
"lcname": "pt-kmeans"
}