pt-kmeans

Name	pt-kmeans JSON
Version	0.8.3 JSON
	download
home_page	None
Summary	K-Means and Hierarchical K-Means implementation in PyTorch
upload_time	2025-09-07 16:11:29
maintainer	None
docs_url	None
author	Ofer Hasson
requires_python	>=3.11
license	None
keywords	pytorch kmeans clustering machine-learning unsupervised-learning hierarchical-clustering
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # PyTorch KMeans

## Introduction

`pt_kmeans` is a pure PyTorch implementation of the popular K-Means clustering algorithm, designed for seamless integration into PyTorch-based machine learning pipelines.
It offers high performance on both CPU and GPU (CUDA), along with advanced features like K-Means++ initialization, hierarchical clustering, and cluster splitting, all while maintaining full PyTorch tensor compatibility.

A core design principle of `pt_kmeans` is **efficient memory management for large datasets**.
While you can pass data already on a GPU, the library is optimized to allow your main input data (`x`) to reside on **CPU memory (typically more abundant)**.
Computations are then performed on a specified `device` (e.g., CUDA GPU) by moving only necessary data chunks or tensors, maximizing utilization of faster hardware without exceeding its memory limits.
Final results (cluster centers and assignments) are consistently returned on the CPU for ease of post-processing, visualization, or saving.

## Features

- **Pure PyTorch**: No external dependencies beyond PyTorch itself. All computations are performed using PyTorch tensors, making it ideal for integration with deep learning workflows.
- **Self-Contained & Portable**: The entire implementation resides in a single file, allowing for easy integration by simply copying the file into your project or an existing module.
- **CPU & GPU Support**: Explicitly control computation device (CPU or CUDA) for optimal performance and memory usage.
- **K-Means++ Initialization**: Intelligent seeding of initial centroids for faster convergence and better clustering results.
- **L2 and Cosine Distance**: Supports the standard Euclidean (L2) distance and Cosine distance for various data types and applications (e.g., embeddings).
- **Chunked Distance Computations**: Enhances memory efficiency by enabling chunked processing of distance calculations directly within the `compute_distance` function. This mechanism is leveraged by both cluster assignment (`_assign_clusters`) and K-Means++ initialization (`_kmeans_plusplus_init`), allowing for handling large datasets and preventing Out-Of-Memory (OOM) errors on memory-constrained devices.
- **Reproducibility**: Full control over randomness via `random_seed` for consistent results.
- **Hierarchical K-Means**: Implements a bottom-up hierarchical clustering approach, useful for creating multi-level cluster structures.
- **Resampling for Hierarchical K-Means**: Refine hierarchical cluster centers through resampling, which can lead to more robust estimations for certain datasets. Configurable with `method="resampled"`, `n_samples`, and `n_resamples` parameters.
- **Cluster Splitting**: Provides a utility to refine existing clusters by splitting a single cluster into multiple sub-clusters.
- **Automatic Caching**: Resume interrupted computations with the `cache_dir` parameter. Essential for long-running jobs on large datasets - caches initial centers and final results, allowing automatic recovery from failures.

### Example: Hierarchical K-Means

![Hierarchical K-Means](docs/img/hierarchical_kmeans.png)

### Example: K-Means with Resampling

This implementation follows the method described in
[Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach](https://arxiv.org/abs/2405.15613).

![Voronoi K-Means](docs/img/voronoi_kmeans.png)

## Installation

`pt_kmeans` requires PyTorch (`torch>=2.4.0` recommended).

First, ensure you have PyTorch installed (refer to the [official PyTorch website](https://pytorch.org/get-started/locally/) for installation instructions specific to your system and CUDA version).

Then, install `pt_kmeans` directly from PyPI:

```bash
pip install pt-kmeans
```

## Quick Start & Usage Examples

Here's how to get started with `pt_kmeans`.

```python
import torch
import matplotlib.pyplot as plt

from pt_kmeans import hierarchical_kmeans
from pt_kmeans import kmeans
from pt_kmeans import predict
from pt_kmeans import split_cluster
```

### Basic K-Means Clustering

```python
# 1. Generate some synthetic data for demonstration
# Three distinct clusters
data = torch.concat([
    torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 0.0]),
    torch.randn(100, 2) * 0.5 + torch.tensor([5.0, 5.0]),
    torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 5.0]),
])

# Define the compute device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

n_clusters = 3
random_seed = 0

# 2. Run K-Means
print(f"Running K-Means on {device}...")
(centers, labels) = kmeans(
    data,
    n_clusters=n_clusters,
    max_iters=100,
    tol=1e-4,
    distance_metric="l2",      # or "cosine"
    init_method="kmeans++",    # or "random"
    chunk_size=None,           # Process all at once
    random_seed=random_seed,
    device=device,
)

print("\nK-Means Results:")
print(f"Final Centers Shape: {centers.shape}")
print(f"First 5 Labels: {labels[:5]}")
print(f"Unique Labels: {torch.unique(labels)}")

# 3. (Optional) Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(data[:, 0].cpu(), data[:, 1].cpu(), c=labels.cpu(), cmap="viridis", s=10, alpha=0.7)
plt.scatter(centers[:, 0].cpu(), centers[:, 1].cpu(), c="red", marker="X", s=200, label="Centers")
plt.title("K-Means Clustering Result")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()
```

### Assigning New Data with `predict`

After training, assign new data points to the learned clusters.

```python
# Use the 'centers' obtained from the basic K-Means example
# Generate some new data
new_data = torch.concat([
    torch.randn(10, 2) * 0.5 + torch.tensor([0.2, 0.2]),
    torch.randn(10, 2) * 0.5 + torch.tensor([5.2, 5.2]),
])

print(f"\nAssigning new data points using 'predict' on {device}...")
new_labels = predict(
    new_data,
    centers, # Use the centers from the previous kmeans run
    distance_metric="l2",
    device=device,
)

print(f"New Data Shape: {new_data.shape}")
print(f"Labels for new data: {new_labels.tolist()}")
print(f"Unique labels for new data: {torch.unique(new_labels).tolist()}")

# (Optional) Visualize new data with existing clusters
plt.figure(figsize=(8, 6))
plt.scatter(data[:, 0].cpu(), data[:, 1].cpu(), c=labels.cpu(), cmap="viridis", s=10, alpha=0.3, label="Training Data")
plt.scatter(centers[:, 0].cpu(), centers[:, 1].cpu(), c="red", marker="X", s=200, label="Centers")
plt.scatter(
    new_data[:, 0].cpu(),
    new_data[:, 1].cpu(),
    c=new_labels.cpu(),
    marker="o",
    edgecolors="black",
    s=100,
    linewidth=1.5,
    cmap="viridis",
    label="New Data",
)
plt.title("Prediction on New Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()
```

### Hierarchical K-Means

Build a multi-level clustering structure.

```python
# Use the 'data' generated in the previous example
n_clusters_levels = [15, 5, 3] # Define number of clusters for each level

print(f"Running Hierarchical K-Means on {device}...")
results = hierarchical_kmeans(
    data,
    n_clusters=n_clusters_levels,
    max_iters=100,
    tol=1e-4,
    distance_metric="l2",
    init_method="kmeans++",
    random_seed=random_seed,
    device=device
)

print("\nHierarchical K-Means Results:")
for i, level_result in enumerate(results):
    print(f"Level {i} (n_clusters={n_clusters_levels[i]}):")
    print(f"  Centers Shape: {level_result['centers'].shape}")
    print(f"  Assignment Shape (original data): {level_result['assignment'].shape}")
    print(f"  Unique Assignments: {torch.unique(level_result['assignment'])}")
```

### Hierarchical K-Means Using Resampling

Build a multi-level clustering structure with enhanced center refinement using resampling.

```python
print(f"Running Hierarchical K-Means (resampled method) on {device}...")
results_resampled_method = hierarchical_kmeans(
    data,
    n_clusters=n_clusters_levels,
    max_iters=100,
    tol=1e-4,
    distance_metric="l2",
    init_method="kmeans++",
    random_seed=random_seed,
    device=device,
    method="resampled",
    n_samples=[5, 3, 0],
)

print("\nHierarchical K-Means Results (resampled method):")
for i, level_result in enumerate(results_resampled_method):
    print(f"Level {i} (n_clusters={n_clusters_levels[i]}):")
    print(f"  Centers Shape: {level_result['centers'].shape}")
    print(f"  Assignment Shape (original data): {level_result['assignment'].shape}")
    print(f"  Unique Assignments: {torch.unique(level_result['assignment'])}")
```

### Splitting an Existing Cluster

Refine a specific cluster by breaking it down into sub-clusters.

```python
# First, run a basic K-Means to get initial labels and centers
(initial_centers, initial_labels) = kmeans(
    data, n_clusters=3, random_seed=random_seed, show_progress=False, device=device
)

cluster_to_split_id = 0  # Choose a cluster to split
num_sub_clusters = 2

print(f"Splitting Cluster {cluster_to_split_id} into {num_sub_clusters} sub-clusters, computations on {device}...")
(new_sub_centers, updated_labels) = split_cluster(
    data,
    initial_labels,
    cluster_id=cluster_to_split_id,
    n_clusters=num_sub_clusters,
    max_iters=50,
    distance_metric="l2",
    random_seed=random_seed + 1,
    device=device
)

print("\nCluster Splitting Results:")
print(f"New Sub-Centers Shape: {new_sub_centers.shape}")
print(f"Updated Labels Shape: {updated_labels.shape}")
print(f"Unique Labels in updated set: {torch.unique(updated_labels).tolist()}")

# Verify that the original cluster_id is replaced by new ones or kept, and new ones are introduced
print(f"Original unique labels: {torch.unique(initial_labels).tolist()}")
print(f"Updated unique labels: {torch.unique(updated_labels).tolist()}")
```

### GPU Usage

`pt_kmeans` is designed to be memory-efficient, allowing you to process datasets larger than your GPU's VRAM.
The general strategy is:

1. Keep your primary dataset `x` on the CPU (if very large).
1. Specify a `device` (e.g., `"cuda"`) for computations. `pt_kmeans` will intelligently move chunks of `x` or relevant centers to this device as needed.

Here's an example demonstrating this, emphasizing `chunk_size` and the `device` parameter:

```python
# Example 1: Large dataset on CPU, compute on GPU with chunking
large_data_cpu = torch.randn(1_000_000, 128, device=torch.device("cpu"))
n_clusters_large = 1000

(centers_large, labels_large) = kmeans(
    large_data_cpu,
    n_clusters=n_clusters_large,
    distance_metric="cosine",
    chunk_size=64000,               # Important for larger datasets on GPU to manage memory
    show_progress=True,
    device=torch.device("cuda"),    # Tell kmeans to use the GPU for calculations
)

print(f"GPU K-Means finished. Centers on: {centers_large.device}, Labels on: {labels_large.device}")

# Example 2: Data already on GPU, compute on GPU (chunking still applies for iterations)
# Here, 'x' is already on GPU. By default, 'kmeans' will use 'x.device' for computation.
x_gpu = torch.randn(1_000_000, 128, device=torch.device("cuda"))
n_clusters_gpu = 100

(centers_gpu, labels_gpu) = kmeans(
    x_gpu,
    n_clusters=n_clusters_gpu,
    distance_metric="cosine",
    init_method="kmeans++",
    chunk_size=64000,              # Still important for iterative steps, especially for very large N and K
    show_progress=True,
    # device=torch.device("cuda"), # Not mandatory to pass 'device' here, as it defaults to x.device
)
print(f"GPU K-Means finished. Centers on: {centers_gpu.device}, Labels on: {labels_gpu.device}")
```

### Memory-Efficient Processing for Large Datasets

`pt_kmeans` is specifically designed to handle datasets that are too large to fit entirely into RAM, leveraging `numpy.memmap` to keep data on disk while processing.
By using `torch.from_numpy()` with a `memmap` array and specifying a `chunk_size`, `pt_kmeans` will only load chunks of data into memory (and to the specified `device`) as needed, allowing you to cluster truly massive datasets.

For example, `pt_kmeans` will accept input such as:

```python
import numpy as np
import torch

x_memmap = torch.from_numpy(np.load(dataset_path, mmap_mode="r+"))

# Now, you can pass x_memmap to kmeans:
# (centers, labels) = kmeans(x_memmap, n_clusters=..., chunk_size=..., device=...)
```

### Caching for Long-Running Computations

For large datasets where K-Means can take hours or days, use the `cache_dir` parameter to enable automatic checkpointing:

```python
# Basic K-Means with caching
(centers, labels) = kmeans(
    large_data,
    n_clusters=10000,
    cache_dir="./kmeans_cache",  # Will resume from here if interrupted
    device=device,
)

# Hierarchical K-Means with caching
# Each level gets its own cache subdirectory
results = hierarchical_kmeans(
    large_data,
    n_clusters=[10000, 1000, 100],
    method="resampled",
    n_samples=[50, 20, 0],
    cache_dir="./hierarchical_cache",  # Creates level_0/, level_1/, etc.
    device=device,
)
```

## Contributing

Contributions are very welcome! If you find a bug, have a feature request, or want to contribute code, please feel free to:

1. Open an issue on the [GitLab Issues page](https://gitlab.com/hassonofer/pt_kmeans/issues).
2. Submit a Pull Request.

Please ensure your code adheres to the existing style (Black, isort) and passes all tests.

## License

This project is licensed under the Apache-2.0 License - see the [LICENSE](https://gitlab.com/hassonofer/pt_kmeans/blob/main/LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pt-kmeans",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "pytorch, kmeans, clustering, machine-learning, unsupervised-learning, hierarchical-clustering",
    "author": "Ofer Hasson",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/54/7f/a16b4c096541cfde422ba7feba102e44fcb2b02402327a540675540c7f2f/pt_kmeans-0.8.3.tar.gz",
    "platform": null,
    "description": "# PyTorch KMeans\n\n## Introduction\n\n`pt_kmeans` is a pure PyTorch implementation of the popular K-Means clustering algorithm, designed for seamless integration into PyTorch-based machine learning pipelines.\nIt offers high performance on both CPU and GPU (CUDA), along with advanced features like K-Means++ initialization, hierarchical clustering, and cluster splitting, all while maintaining full PyTorch tensor compatibility.\n\nA core design principle of `pt_kmeans` is **efficient memory management for large datasets**.\nWhile you can pass data already on a GPU, the library is optimized to allow your main input data (`x`) to reside on **CPU memory (typically more abundant)**.\nComputations are then performed on a specified `device` (e.g., CUDA GPU) by moving only necessary data chunks or tensors, maximizing utilization of faster hardware without exceeding its memory limits.\nFinal results (cluster centers and assignments) are consistently returned on the CPU for ease of post-processing, visualization, or saving.\n\n## Features\n\n- **Pure PyTorch**: No external dependencies beyond PyTorch itself. All computations are performed using PyTorch tensors, making it ideal for integration with deep learning workflows.\n- **Self-Contained & Portable**: The entire implementation resides in a single file, allowing for easy integration by simply copying the file into your project or an existing module.\n- **CPU & GPU Support**: Explicitly control computation device (CPU or CUDA) for optimal performance and memory usage.\n- **K-Means++ Initialization**: Intelligent seeding of initial centroids for faster convergence and better clustering results.\n- **L2 and Cosine Distance**: Supports the standard Euclidean (L2) distance and Cosine distance for various data types and applications (e.g., embeddings).\n- **Chunked Distance Computations**: Enhances memory efficiency by enabling chunked processing of distance calculations directly within the `compute_distance` function. This mechanism is leveraged by both cluster assignment (`_assign_clusters`) and K-Means++ initialization (`_kmeans_plusplus_init`), allowing for handling large datasets and preventing Out-Of-Memory (OOM) errors on memory-constrained devices.\n- **Reproducibility**: Full control over randomness via `random_seed` for consistent results.\n- **Hierarchical K-Means**: Implements a bottom-up hierarchical clustering approach, useful for creating multi-level cluster structures.\n- **Resampling for Hierarchical K-Means**: Refine hierarchical cluster centers through resampling, which can lead to more robust estimations for certain datasets. Configurable with `method=\"resampled\"`, `n_samples`, and `n_resamples` parameters.\n- **Cluster Splitting**: Provides a utility to refine existing clusters by splitting a single cluster into multiple sub-clusters.\n- **Automatic Caching**: Resume interrupted computations with the `cache_dir` parameter. Essential for long-running jobs on large datasets - caches initial centers and final results, allowing automatic recovery from failures.\n\n### Example: Hierarchical K-Means\n\n![Hierarchical K-Means](docs/img/hierarchical_kmeans.png)\n\n### Example: K-Means with Resampling\n\nThis implementation follows the method described in\n[Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach](https://arxiv.org/abs/2405.15613).\n\n![Voronoi K-Means](docs/img/voronoi_kmeans.png)\n\n## Installation\n\n`pt_kmeans` requires PyTorch (`torch>=2.4.0` recommended).\n\nFirst, ensure you have PyTorch installed (refer to the [official PyTorch website](https://pytorch.org/get-started/locally/) for installation instructions specific to your system and CUDA version).\n\nThen, install `pt_kmeans` directly from PyPI:\n\n```bash\npip install pt-kmeans\n```\n\n## Quick Start & Usage Examples\n\nHere's how to get started with `pt_kmeans`.\n\n```python\nimport torch\nimport matplotlib.pyplot as plt\n\nfrom pt_kmeans import hierarchical_kmeans\nfrom pt_kmeans import kmeans\nfrom pt_kmeans import predict\nfrom pt_kmeans import split_cluster\n```\n\n### Basic K-Means Clustering\n\n```python\n# 1. Generate some synthetic data for demonstration\n# Three distinct clusters\ndata = torch.concat([\n    torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 0.0]),\n    torch.randn(100, 2) * 0.5 + torch.tensor([5.0, 5.0]),\n    torch.randn(100, 2) * 0.5 + torch.tensor([0.0, 5.0]),\n])\n\n# Define the compute device\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\nn_clusters = 3\nrandom_seed = 0\n\n# 2. Run K-Means\nprint(f\"Running K-Means on {device}...\")\n(centers, labels) = kmeans(\n    data,\n    n_clusters=n_clusters,\n    max_iters=100,\n    tol=1e-4,\n    distance_metric=\"l2\",      # or \"cosine\"\n    init_method=\"kmeans++\",    # or \"random\"\n    chunk_size=None,           # Process all at once\n    random_seed=random_seed,\n    device=device,\n)\n\nprint(\"\\nK-Means Results:\")\nprint(f\"Final Centers Shape: {centers.shape}\")\nprint(f\"First 5 Labels: {labels[:5]}\")\nprint(f\"Unique Labels: {torch.unique(labels)}\")\n\n# 3. (Optional) Visualize the clusters\nplt.figure(figsize=(8, 6))\nplt.scatter(data[:, 0].cpu(), data[:, 1].cpu(), c=labels.cpu(), cmap=\"viridis\", s=10, alpha=0.7)\nplt.scatter(centers[:, 0].cpu(), centers[:, 1].cpu(), c=\"red\", marker=\"X\", s=200, label=\"Centers\")\nplt.title(\"K-Means Clustering Result\")\nplt.xlabel(\"Feature 1\")\nplt.ylabel(\"Feature 2\")\nplt.legend()\nplt.grid(True)\nplt.show()\n```\n\n### Assigning New Data with `predict`\n\nAfter training, assign new data points to the learned clusters.\n\n```python\n# Use the 'centers' obtained from the basic K-Means example\n# Generate some new data\nnew_data = torch.concat([\n    torch.randn(10, 2) * 0.5 + torch.tensor([0.2, 0.2]),\n    torch.randn(10, 2) * 0.5 + torch.tensor([5.2, 5.2]),\n])\n\nprint(f\"\\nAssigning new data points using 'predict' on {device}...\")\nnew_labels = predict(\n    new_data,\n    centers, # Use the centers from the previous kmeans run\n    distance_metric=\"l2\",\n    device=device,\n)\n\nprint(f\"New Data Shape: {new_data.shape}\")\nprint(f\"Labels for new data: {new_labels.tolist()}\")\nprint(f\"Unique labels for new data: {torch.unique(new_labels).tolist()}\")\n\n# (Optional) Visualize new data with existing clusters\nplt.figure(figsize=(8, 6))\nplt.scatter(data[:, 0].cpu(), data[:, 1].cpu(), c=labels.cpu(), cmap=\"viridis\", s=10, alpha=0.3, label=\"Training Data\")\nplt.scatter(centers[:, 0].cpu(), centers[:, 1].cpu(), c=\"red\", marker=\"X\", s=200, label=\"Centers\")\nplt.scatter(\n    new_data[:, 0].cpu(),\n    new_data[:, 1].cpu(),\n    c=new_labels.cpu(),\n    marker=\"o\",\n    edgecolors=\"black\",\n    s=100,\n    linewidth=1.5,\n    cmap=\"viridis\",\n    label=\"New Data\",\n)\nplt.title(\"Prediction on New Data\")\nplt.xlabel(\"Feature 1\")\nplt.ylabel(\"Feature 2\")\nplt.legend()\nplt.grid(True)\nplt.show()\n```\n\n### Hierarchical K-Means\n\nBuild a multi-level clustering structure.\n\n```python\n# Use the 'data' generated in the previous example\nn_clusters_levels = [15, 5, 3] # Define number of clusters for each level\n\nprint(f\"Running Hierarchical K-Means on {device}...\")\nresults = hierarchical_kmeans(\n    data,\n    n_clusters=n_clusters_levels,\n    max_iters=100,\n    tol=1e-4,\n    distance_metric=\"l2\",\n    init_method=\"kmeans++\",\n    random_seed=random_seed,\n    device=device\n)\n\nprint(\"\\nHierarchical K-Means Results:\")\nfor i, level_result in enumerate(results):\n    print(f\"Level {i} (n_clusters={n_clusters_levels[i]}):\")\n    print(f\"  Centers Shape: {level_result['centers'].shape}\")\n    print(f\"  Assignment Shape (original data): {level_result['assignment'].shape}\")\n    print(f\"  Unique Assignments: {torch.unique(level_result['assignment'])}\")\n```\n\n### Hierarchical K-Means Using Resampling\n\nBuild a multi-level clustering structure with enhanced center refinement using resampling.\n\n```python\nprint(f\"Running Hierarchical K-Means (resampled method) on {device}...\")\nresults_resampled_method = hierarchical_kmeans(\n    data,\n    n_clusters=n_clusters_levels,\n    max_iters=100,\n    tol=1e-4,\n    distance_metric=\"l2\",\n    init_method=\"kmeans++\",\n    random_seed=random_seed,\n    device=device,\n    method=\"resampled\",\n    n_samples=[5, 3, 0],\n)\n\nprint(\"\\nHierarchical K-Means Results (resampled method):\")\nfor i, level_result in enumerate(results_resampled_method):\n    print(f\"Level {i} (n_clusters={n_clusters_levels[i]}):\")\n    print(f\"  Centers Shape: {level_result['centers'].shape}\")\n    print(f\"  Assignment Shape (original data): {level_result['assignment'].shape}\")\n    print(f\"  Unique Assignments: {torch.unique(level_result['assignment'])}\")\n```\n\n### Splitting an Existing Cluster\n\nRefine a specific cluster by breaking it down into sub-clusters.\n\n```python\n# First, run a basic K-Means to get initial labels and centers\n(initial_centers, initial_labels) = kmeans(\n    data, n_clusters=3, random_seed=random_seed, show_progress=False, device=device\n)\n\ncluster_to_split_id = 0  # Choose a cluster to split\nnum_sub_clusters = 2\n\nprint(f\"Splitting Cluster {cluster_to_split_id} into {num_sub_clusters} sub-clusters, computations on {device}...\")\n(new_sub_centers, updated_labels) = split_cluster(\n    data,\n    initial_labels,\n    cluster_id=cluster_to_split_id,\n    n_clusters=num_sub_clusters,\n    max_iters=50,\n    distance_metric=\"l2\",\n    random_seed=random_seed + 1,\n    device=device\n)\n\nprint(\"\\nCluster Splitting Results:\")\nprint(f\"New Sub-Centers Shape: {new_sub_centers.shape}\")\nprint(f\"Updated Labels Shape: {updated_labels.shape}\")\nprint(f\"Unique Labels in updated set: {torch.unique(updated_labels).tolist()}\")\n\n# Verify that the original cluster_id is replaced by new ones or kept, and new ones are introduced\nprint(f\"Original unique labels: {torch.unique(initial_labels).tolist()}\")\nprint(f\"Updated unique labels: {torch.unique(updated_labels).tolist()}\")\n```\n\n### GPU Usage\n\n`pt_kmeans` is designed to be memory-efficient, allowing you to process datasets larger than your GPU's VRAM.\nThe general strategy is:\n\n1. Keep your primary dataset `x` on the CPU (if very large).\n1. Specify a `device` (e.g., `\"cuda\"`) for computations. `pt_kmeans` will intelligently move chunks of `x` or relevant centers to this device as needed.\n\nHere's an example demonstrating this, emphasizing `chunk_size` and the `device` parameter:\n\n```python\n# Example 1: Large dataset on CPU, compute on GPU with chunking\nlarge_data_cpu = torch.randn(1_000_000, 128, device=torch.device(\"cpu\"))\nn_clusters_large = 1000\n\n(centers_large, labels_large) = kmeans(\n    large_data_cpu,\n    n_clusters=n_clusters_large,\n    distance_metric=\"cosine\",\n    chunk_size=64000,               # Important for larger datasets on GPU to manage memory\n    show_progress=True,\n    device=torch.device(\"cuda\"),    # Tell kmeans to use the GPU for calculations\n)\n\nprint(f\"GPU K-Means finished. Centers on: {centers_large.device}, Labels on: {labels_large.device}\")\n\n# Example 2: Data already on GPU, compute on GPU (chunking still applies for iterations)\n# Here, 'x' is already on GPU. By default, 'kmeans' will use 'x.device' for computation.\nx_gpu = torch.randn(1_000_000, 128, device=torch.device(\"cuda\"))\nn_clusters_gpu = 100\n\n(centers_gpu, labels_gpu) = kmeans(\n    x_gpu,\n    n_clusters=n_clusters_gpu,\n    distance_metric=\"cosine\",\n    init_method=\"kmeans++\",\n    chunk_size=64000,              # Still important for iterative steps, especially for very large N and K\n    show_progress=True,\n    # device=torch.device(\"cuda\"), # Not mandatory to pass 'device' here, as it defaults to x.device\n)\nprint(f\"GPU K-Means finished. Centers on: {centers_gpu.device}, Labels on: {labels_gpu.device}\")\n```\n\n### Memory-Efficient Processing for Large Datasets\n\n`pt_kmeans` is specifically designed to handle datasets that are too large to fit entirely into RAM, leveraging `numpy.memmap` to keep data on disk while processing.\nBy using `torch.from_numpy()` with a `memmap` array and specifying a `chunk_size`, `pt_kmeans` will only load chunks of data into memory (and to the specified `device`) as needed, allowing you to cluster truly massive datasets.\n\nFor example, `pt_kmeans` will accept input such as:\n\n```python\nimport numpy as np\nimport torch\n\nx_memmap = torch.from_numpy(np.load(dataset_path, mmap_mode=\"r+\"))\n\n# Now, you can pass x_memmap to kmeans:\n# (centers, labels) = kmeans(x_memmap, n_clusters=..., chunk_size=..., device=...)\n```\n\n### Caching for Long-Running Computations\n\nFor large datasets where K-Means can take hours or days, use the `cache_dir` parameter to enable automatic checkpointing:\n\n```python\n# Basic K-Means with caching\n(centers, labels) = kmeans(\n    large_data,\n    n_clusters=10000,\n    cache_dir=\"./kmeans_cache\",  # Will resume from here if interrupted\n    device=device,\n)\n\n# Hierarchical K-Means with caching\n# Each level gets its own cache subdirectory\nresults = hierarchical_kmeans(\n    large_data,\n    n_clusters=[10000, 1000, 100],\n    method=\"resampled\",\n    n_samples=[50, 20, 0],\n    cache_dir=\"./hierarchical_cache\",  # Creates level_0/, level_1/, etc.\n    device=device,\n)\n```\n\n## Contributing\n\nContributions are very welcome! If you find a bug, have a feature request, or want to contribute code, please feel free to:\n\n1. Open an issue on the [GitLab Issues page](https://gitlab.com/hassonofer/pt_kmeans/issues).\n2. Submit a Pull Request.\n\nPlease ensure your code adheres to the existing style (Black, isort) and passes all tests.\n\n## License\n\nThis project is licensed under the Apache-2.0 License - see the [LICENSE](https://gitlab.com/hassonofer/pt_kmeans/blob/main/LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "K-Means and Hierarchical K-Means implementation in PyTorch",
    "version": "0.8.3",
    "project_urls": {
        "Homepage": "https://gitlab.com/hassonofer/pt_kmeans",
        "Issues": "https://gitlab.com/hassonofer/pt_kmeans/-/issues"
    },
    "split_keywords": [
        "pytorch",
        " kmeans",
        " clustering",
        " machine-learning",
        " unsupervised-learning",
        " hierarchical-clustering"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "77e63ce903c4df0a3c19c7ef8d8a23d0779336d00c3e45fb3dbcfbfa75a9a3a4",
                "md5": "960953f3818b28b538485385982a51e6",
                "sha256": "ae995a50059df7cd5cea40cf2efb2c5f3cb0c2fa29f9c77333ed845398622835"
            },
            "downloads": -1,
            "filename": "pt_kmeans-0.8.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "960953f3818b28b538485385982a51e6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 18969,
            "upload_time": "2025-09-07T16:11:28",
            "upload_time_iso_8601": "2025-09-07T16:11:28.620982Z",
            "url": "https://files.pythonhosted.org/packages/77/e6/3ce903c4df0a3c19c7ef8d8a23d0779336d00c3e45fb3dbcfbfa75a9a3a4/pt_kmeans-0.8.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "547fa16b4c096541cfde422ba7feba102e44fcb2b02402327a540675540c7f2f",
                "md5": "af8a56fa1db68c7b885123b24c726cdd",
                "sha256": "31964673c4e89404a92c5d3c34aade9808127ceb9623dc50217fc02f09e953a8"
            },
            "downloads": -1,
            "filename": "pt_kmeans-0.8.3.tar.gz",
            "has_sig": false,
            "md5_digest": "af8a56fa1db68c7b885123b24c726cdd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 28084,
            "upload_time": "2025-09-07T16:11:29",
            "upload_time_iso_8601": "2025-09-07T16:11:29.507701Z",
            "url": "https://files.pythonhosted.org/packages/54/7f/a16b4c096541cfde422ba7feba102e44fcb2b02402327a540675540c7f2f/pt_kmeans-0.8.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-07 16:11:29",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "hassonofer",
    "gitlab_project": "pt_kmeans",
    "lcname": "pt-kmeans"
}

Ofer Hasson