pt-kmeans


Namept-kmeans JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryK-Means and Hierarchical K-Means implementation in PyTorch
upload_time2025-07-28 17:23:49
maintainerNone
docs_urlNone
authorOfer Hasson
requires_python>=3.11
licenseNone
keywords pytorch kmeans clustering machine-learning unsupervised-learning hierarchical-clustering
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PyTorch KMeans

## Introduction

`pt_kmeans` is a pure PyTorch implementation of the popular K-Means clustering algorithm, designed for seamless integration into PyTorch-based machine learning pipelines. It offers high performance on both CPU and GPU (CUDA), along with advanced features like K-Means++ initialization, hierarchical clustering, and cluster splitting, all while maintaining full PyTorch tensor compatibility.

Unlike K-Means implementations that require data transfers to NumPy or other libraries, `pt_kmeans` keeps your data on the PyTorch device (CPU or GPU) throughout the entire process, minimizing overhead and maximizing efficiency for large-scale datasets.

## Features

- **Pure PyTorch**: No external dependencies beyond PyTorch itself. All computations are performed using PyTorch tensors, making it ideal for integration with deep learning workflows.
- **Self-Contained & Portable**: The entire implementation resides in a single file, allowing for easy integration by simply copying the file into your project or an existing module.
- **CPU & GPU Support**: Leverages your available hardware. Optimized for CPU performance and efficient on GPUs.
- **K-Means++ Initialization**: Intelligent seeding of initial centroids for faster convergence and better clustering results.
- **L2 and Cosine Distance**: Supports the standard Euclidean (L2) distance and Cosine distance for various data types and applications (e.g., embeddings).
- **Chunked Processing**: Efficiently handles very large datasets by processing assignments in memory-friendly chunks, preventing Out-Of-Memory (OOM) errors.
- **Reproducibility**: Full control over randomness via `random_seed` for consistent results.
- **Hierarchical K-Means**: Implements a bottom-up hierarchical clustering approach, useful for creating multi-level cluster structures.
- **Cluster Splitting**: Provides a utility to refine existing clusters by splitting a single cluster into multiple sub-clusters.

## Installation

`pt_kmeans` requires PyTorch (`torch>=2.4.0` recommended).

First, ensure you have PyTorch installed (refer to the [official PyTorch website](https://pytorch.org/get-started/locally/) for installation instructions specific to your system and CUDA version).

Then, install `pt_kmeans` directly from PyPI:

```bash
pip install pt_kmeans
```

## Quick Start & Usage Examples

Here's how to get started with `pt_kmeans`.

```python
import torch
import matplotlib.pyplot as plt  # For visualization

from pt_kmeans import hierarchical_kmeans
from pt_kmeans import kmeans
from pt_kmeans import split_cluster
```

### Basic K-Means Clustering

```python
# 1. Generate some synthetic data for demonstration
# Three distinct clusters
data = torch.cat([
    torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 0.0]),
    torch.randn(100, 2) * 0.5 + torch.tensor([5.0, 5.0]),
    torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 5.0]),
])

# Move data to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
data = data.to(device)

n_clusters = 3
random_seed = 0

# 2. Run K-Means
print(f"Running K-Means on {device}...")
(centers, labels) = kmeans(
    data,
    n_clusters=n_clusters,
    max_iters=100,
    tol=1e-4,
    distance_metric="l2",      # or "cosine"
    init_method="kmeans++",    # or "random"
    chunk_size=None,           # Process all at once
    random_seed=random_seed,
)

print("\nK-Means Results:")
print(f"Final Centers Shape: {centers.shape}")
print(f"First 5 Labels: {labels[:5]}")
print(f"Unique Labels: {torch.unique(labels)}")

# 3. (Optional) Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(data[:, 0].cpu(), data[:, 1].cpu(), c=labels.cpu(), cmap="viridis", s=10, alpha=0.7)
plt.scatter(centers[:, 0].cpu(), centers[:, 1].cpu(), c="red", marker="X", s=200, label="Centers")
plt.title("K-Means Clustering Result")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()
```

### Hierarchical K-Means

Build a multi-level clustering structure.

```python
# Use the 'data' generated in the previous example
n_clusters_levels = [15, 5, 3] # Define number of clusters for each level

print(f"Running Hierarchical K-Means on {device}...")
results = hierarchical_kmeans(
    data,
    n_clusters=n_clusters_levels,
    max_iters=100,
    tol=1e-4,
    distance_metric="l2",
    init_method="kmeans++",
    random_seed=random_seed,
)

print("\nHierarchical K-Means Results:")
for i, level_result in enumerate(results):
    print(f"Level {i} (n_clusters={n_clusters_levels[i]}):")
    print(f"  Centers Shape: {level_result['centers'].shape}")
    print(f"  Assignment Shape (original data): {level_result['assignment'].shape}")
    print(f"  Unique Assignments: {torch.unique(level_result['assignment'])}")
```

### Splitting an Existing Cluster

Refine a specific cluster by breaking it down into sub-clusters.

```python
# First, run a basic K-Means to get initial labels and centers
(initial_centers, initial_labels) = kmeans(
    data, n_clusters=3, random_seed=random_seed, show_progress=False
)

cluster_to_split_id = 0  # Choose a cluster to split
num_sub_clusters = 2

print(f"Splitting Cluster {cluster_to_split_id} into {num_sub_clusters} sub-clusters on {device}...")
(new_sub_centers, updated_labels) = split_cluster(
    data,
    initial_labels,
    cluster_id=cluster_to_split_id,
    n_clusters=num_sub_clusters,
    max_iters=50,
    distance_metric="l2",
    random_seed=random_seed + 1,
)

print("\nCluster Splitting Results:")
print(f"New Sub-Centers Shape: {new_sub_centers.shape}")
print(f"Updated Labels Shape: {updated_labels.shape}")
print(f"Unique Labels in updated set: {torch.unique(updated_labels).tolist()}")

# Verify that the original cluster_id is replaced by new ones or kept, and new ones are introduced
print(f"Original unique labels: {torch.unique(initial_labels).tolist()}")
print(f"Updated unique labels: {torch.unique(updated_labels).tolist()}")
```

### GPU Usage

To use your GPU, simply ensure your input tensor `x` is on a CUDA device:

```python
x_gpu = torch.randn(1_000_000, 128, device="cuda")  # Create data directly on GPU
n_clusters_gpu = 100

(centers_gpu, labels_gpu) = kmeans(
    x_gpu,
    n_clusters=n_clusters_gpu,
    distance_metric="cosine",  # Often used for embeddings on GPU
    chunk_size=64000,          # Important for larger datasets on GPU to manage memory
    show_progress=True,
)

print(f"GPU K-Means finished. Centers on: {centers_gpu.device}, Labels on: {labels_gpu.device}")
```

## Contributing

Contributions are very welcome! If you find a bug, have a feature request, or want to contribute code, please feel free to:

1. Open an issue on the [GitLab Issues page](https://gitlab.com/hassonofer/pt_kmeans/issues).
2. Submit a Pull Request.

Please ensure your code adheres to the existing style (Black, isort) and passes all tests.

## License

This project is licensed under the Apache-2.0 License - see the [LICENSE](https://gitlab.com/hassonofer/pt_kmeans/blob/main/LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pt-kmeans",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "pytorch, kmeans, clustering, machine-learning, unsupervised-learning, hierarchical-clustering",
    "author": "Ofer Hasson",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/a9/2a/1e7487538fa4a59bb944e5d1c2539a7dccf82e3cc1add4505335c2464c01/pt_kmeans-0.1.1.tar.gz",
    "platform": null,
    "description": "# PyTorch KMeans\n\n## Introduction\n\n`pt_kmeans` is a pure PyTorch implementation of the popular K-Means clustering algorithm, designed for seamless integration into PyTorch-based machine learning pipelines. It offers high performance on both CPU and GPU (CUDA), along with advanced features like K-Means++ initialization, hierarchical clustering, and cluster splitting, all while maintaining full PyTorch tensor compatibility.\n\nUnlike K-Means implementations that require data transfers to NumPy or other libraries, `pt_kmeans` keeps your data on the PyTorch device (CPU or GPU) throughout the entire process, minimizing overhead and maximizing efficiency for large-scale datasets.\n\n## Features\n\n- **Pure PyTorch**: No external dependencies beyond PyTorch itself. All computations are performed using PyTorch tensors, making it ideal for integration with deep learning workflows.\n- **Self-Contained & Portable**: The entire implementation resides in a single file, allowing for easy integration by simply copying the file into your project or an existing module.\n- **CPU & GPU Support**: Leverages your available hardware. Optimized for CPU performance and efficient on GPUs.\n- **K-Means++ Initialization**: Intelligent seeding of initial centroids for faster convergence and better clustering results.\n- **L2 and Cosine Distance**: Supports the standard Euclidean (L2) distance and Cosine distance for various data types and applications (e.g., embeddings).\n- **Chunked Processing**: Efficiently handles very large datasets by processing assignments in memory-friendly chunks, preventing Out-Of-Memory (OOM) errors.\n- **Reproducibility**: Full control over randomness via `random_seed` for consistent results.\n- **Hierarchical K-Means**: Implements a bottom-up hierarchical clustering approach, useful for creating multi-level cluster structures.\n- **Cluster Splitting**: Provides a utility to refine existing clusters by splitting a single cluster into multiple sub-clusters.\n\n## Installation\n\n`pt_kmeans` requires PyTorch (`torch>=2.4.0` recommended).\n\nFirst, ensure you have PyTorch installed (refer to the [official PyTorch website](https://pytorch.org/get-started/locally/) for installation instructions specific to your system and CUDA version).\n\nThen, install `pt_kmeans` directly from PyPI:\n\n```bash\npip install pt_kmeans\n```\n\n## Quick Start & Usage Examples\n\nHere's how to get started with `pt_kmeans`.\n\n```python\nimport torch\nimport matplotlib.pyplot as plt  # For visualization\n\nfrom pt_kmeans import hierarchical_kmeans\nfrom pt_kmeans import kmeans\nfrom pt_kmeans import split_cluster\n```\n\n### Basic K-Means Clustering\n\n```python\n# 1. Generate some synthetic data for demonstration\n# Three distinct clusters\ndata = torch.cat([\n    torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 0.0]),\n    torch.randn(100, 2) * 0.5 + torch.tensor([5.0, 5.0]),\n    torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 5.0]),\n])\n\n# Move data to GPU if available\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\ndata = data.to(device)\n\nn_clusters = 3\nrandom_seed = 0\n\n# 2. Run K-Means\nprint(f\"Running K-Means on {device}...\")\n(centers, labels) = kmeans(\n    data,\n    n_clusters=n_clusters,\n    max_iters=100,\n    tol=1e-4,\n    distance_metric=\"l2\",      # or \"cosine\"\n    init_method=\"kmeans++\",    # or \"random\"\n    chunk_size=None,           # Process all at once\n    random_seed=random_seed,\n)\n\nprint(\"\\nK-Means Results:\")\nprint(f\"Final Centers Shape: {centers.shape}\")\nprint(f\"First 5 Labels: {labels[:5]}\")\nprint(f\"Unique Labels: {torch.unique(labels)}\")\n\n# 3. (Optional) Visualize the clusters\nplt.figure(figsize=(8, 6))\nplt.scatter(data[:, 0].cpu(), data[:, 1].cpu(), c=labels.cpu(), cmap=\"viridis\", s=10, alpha=0.7)\nplt.scatter(centers[:, 0].cpu(), centers[:, 1].cpu(), c=\"red\", marker=\"X\", s=200, label=\"Centers\")\nplt.title(\"K-Means Clustering Result\")\nplt.xlabel(\"Feature 1\")\nplt.ylabel(\"Feature 2\")\nplt.legend()\nplt.grid(True)\nplt.show()\n```\n\n### Hierarchical K-Means\n\nBuild a multi-level clustering structure.\n\n```python\n# Use the 'data' generated in the previous example\nn_clusters_levels = [15, 5, 3] # Define number of clusters for each level\n\nprint(f\"Running Hierarchical K-Means on {device}...\")\nresults = hierarchical_kmeans(\n    data,\n    n_clusters=n_clusters_levels,\n    max_iters=100,\n    tol=1e-4,\n    distance_metric=\"l2\",\n    init_method=\"kmeans++\",\n    random_seed=random_seed,\n)\n\nprint(\"\\nHierarchical K-Means Results:\")\nfor i, level_result in enumerate(results):\n    print(f\"Level {i} (n_clusters={n_clusters_levels[i]}):\")\n    print(f\"  Centers Shape: {level_result['centers'].shape}\")\n    print(f\"  Assignment Shape (original data): {level_result['assignment'].shape}\")\n    print(f\"  Unique Assignments: {torch.unique(level_result['assignment'])}\")\n```\n\n### Splitting an Existing Cluster\n\nRefine a specific cluster by breaking it down into sub-clusters.\n\n```python\n# First, run a basic K-Means to get initial labels and centers\n(initial_centers, initial_labels) = kmeans(\n    data, n_clusters=3, random_seed=random_seed, show_progress=False\n)\n\ncluster_to_split_id = 0  # Choose a cluster to split\nnum_sub_clusters = 2\n\nprint(f\"Splitting Cluster {cluster_to_split_id} into {num_sub_clusters} sub-clusters on {device}...\")\n(new_sub_centers, updated_labels) = split_cluster(\n    data,\n    initial_labels,\n    cluster_id=cluster_to_split_id,\n    n_clusters=num_sub_clusters,\n    max_iters=50,\n    distance_metric=\"l2\",\n    random_seed=random_seed + 1,\n)\n\nprint(\"\\nCluster Splitting Results:\")\nprint(f\"New Sub-Centers Shape: {new_sub_centers.shape}\")\nprint(f\"Updated Labels Shape: {updated_labels.shape}\")\nprint(f\"Unique Labels in updated set: {torch.unique(updated_labels).tolist()}\")\n\n# Verify that the original cluster_id is replaced by new ones or kept, and new ones are introduced\nprint(f\"Original unique labels: {torch.unique(initial_labels).tolist()}\")\nprint(f\"Updated unique labels: {torch.unique(updated_labels).tolist()}\")\n```\n\n### GPU Usage\n\nTo use your GPU, simply ensure your input tensor `x` is on a CUDA device:\n\n```python\nx_gpu = torch.randn(1_000_000, 128, device=\"cuda\")  # Create data directly on GPU\nn_clusters_gpu = 100\n\n(centers_gpu, labels_gpu) = kmeans(\n    x_gpu,\n    n_clusters=n_clusters_gpu,\n    distance_metric=\"cosine\",  # Often used for embeddings on GPU\n    chunk_size=64000,          # Important for larger datasets on GPU to manage memory\n    show_progress=True,\n)\n\nprint(f\"GPU K-Means finished. Centers on: {centers_gpu.device}, Labels on: {labels_gpu.device}\")\n```\n\n## Contributing\n\nContributions are very welcome! If you find a bug, have a feature request, or want to contribute code, please feel free to:\n\n1. Open an issue on the [GitLab Issues page](https://gitlab.com/hassonofer/pt_kmeans/issues).\n2. Submit a Pull Request.\n\nPlease ensure your code adheres to the existing style (Black, isort) and passes all tests.\n\n## License\n\nThis project is licensed under the Apache-2.0 License - see the [LICENSE](https://gitlab.com/hassonofer/pt_kmeans/blob/main/LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "K-Means and Hierarchical K-Means implementation in PyTorch",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://gitlab.com/hassonofer/pt_kmeans",
        "Issues": "https://gitlab.com/hassonofer/pt_kmeans/-/issues"
    },
    "split_keywords": [
        "pytorch",
        " kmeans",
        " clustering",
        " machine-learning",
        " unsupervised-learning",
        " hierarchical-clustering"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e946aab8a189b9b8488f341911a70b1fccae054bb103a3979c68d828f8ecc803",
                "md5": "d228e4940e9f5a069926c0d567d36055",
                "sha256": "b4a31c44f7604feca9a78dbf4a5899fddd3a1c7d38f2889f69261531fc0a0cfd"
            },
            "downloads": -1,
            "filename": "pt_kmeans-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d228e4940e9f5a069926c0d567d36055",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 12313,
            "upload_time": "2025-07-28T17:23:48",
            "upload_time_iso_8601": "2025-07-28T17:23:48.446917Z",
            "url": "https://files.pythonhosted.org/packages/e9/46/aab8a189b9b8488f341911a70b1fccae054bb103a3979c68d828f8ecc803/pt_kmeans-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a92a1e7487538fa4a59bb944e5d1c2539a7dccf82e3cc1add4505335c2464c01",
                "md5": "56f059f149a688d8209951f02c390f3a",
                "sha256": "d31a44e36bde61d4258046081d2e58787df5c5600ae0965aaca95f0cf451082d"
            },
            "downloads": -1,
            "filename": "pt_kmeans-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "56f059f149a688d8209951f02c390f3a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 14603,
            "upload_time": "2025-07-28T17:23:49",
            "upload_time_iso_8601": "2025-07-28T17:23:49.273335Z",
            "url": "https://files.pythonhosted.org/packages/a9/2a/1e7487538fa4a59bb944e5d1c2539a7dccf82e3cc1add4505335c2464c01/pt_kmeans-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-28 17:23:49",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "hassonofer",
    "gitlab_project": "pt_kmeans",
    "lcname": "pt-kmeans"
}
        
Elapsed time: 1.54402s