kmeans-seeding


Namekmeans-seeding JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryFast k-means++ seeding algorithms with C++ implementation and Python bindings
upload_time2025-11-03 10:46:37
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords kmeans clustering machine-learning initialization seeding rejection-sampling mcmc
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # kmeans-seeding: Fast k-means++ Initialization Algorithms

[![PyPI version](https://img.shields.io/pypi/v/kmeans-seeding.svg)](https://pypi.org/project/kmeans-seeding/)
[![Python versions](https://img.shields.io/pypi/pyversions/kmeans-seeding.svg)](https://pypi.org/project/kmeans-seeding/)
[![License](https://img.shields.io/pypi/l/kmeans-seeding.svg)](https://github.com/pcshah2004/kmeans-seeding/blob/main/LICENSE)
[![Documentation](https://readthedocs.org/projects/kmeans-seeding/badge/?version=latest)](https://kmeans-seeding.readthedocs.io/)

**Fast, state-of-the-art k-means initialization algorithms implemented in C++ with Python bindings.**

## Features

🚀 **Fast**: C++ implementation with OpenMP parallelization
🎯 **Accurate**: State-of-the-art algorithms with theoretical guarantees
🔌 **Compatible**: Drop-in replacement for sklearn's k-means++ initialization
📦 **Easy to use**: Simple Python API, works with NumPy arrays
🛠️ **Flexible**: Multiple algorithms to choose from

## Algorithms Included

1. **RS-k-means++** (Rejection Sampling) - *Our contribution*
   - Fast approximate D² sampling using rejection sampling
   - Supports FAISS for approximate nearest neighbors
   - Best for large datasets (n > 10,000)

2. **AFK-MC²** (Adaptive Fast k-MC²)
   - MCMC-based sampling without computing all distances
   - Good balance of speed and quality

3. **Fast-LSH k-means++** (Google 2020)
   - Tree embedding with LSH for fast sampling
   - Excellent for high-dimensional data

4. **Standard k-means++**
   - Classic D² sampling algorithm
   - Baseline for comparison

## Installation

### From PyPI (recommended)

```bash
pip install kmeans-seeding
```

### With FAISS support (recommended for large datasets)

```bash
# CPU version
conda install -c pytorch faiss-cpu
pip install kmeans-seeding

# GPU version
conda install -c pytorch faiss-gpu
pip install kmeans-seeding
```

### From source

```bash
git clone https://github.com/pcshah2004/kmeans-seeding.git
cd kmeans-seeding
pip install -e .
```

## Quick Start

```python
from kmeans_seeding import rejection_sampling
from sklearn.cluster import KMeans
import numpy as np

# Generate sample data
X = np.random.randn(10000, 50)

# Get initial centers using RS-k-means++
centers = rejection_sampling(X, n_clusters=100, index_type='LSH')

# Use with sklearn
kmeans = KMeans(n_clusters=100, init=centers, n_init=1)
kmeans.fit(X)
```

## Usage Examples

### RS-k-means++ (Rejection Sampling)

```python
from kmeans_seeding import rejection_sampling

centers = rejection_sampling(
    X,
    n_clusters=100,
    max_iter=50,           # Max rejection sampling iterations
    index_type='LSH',      # FAISS index type: 'Flat', 'LSH', 'IVFFlat', 'HNSW'
    random_state=42
)
```

### AFK-MC² (MCMC Sampling)

```python
from kmeans_seeding import afkmc2

centers = afkmc2(
    X,
    n_clusters=100,
    chain_length=200,      # Markov chain length
    random_state=42
)
```

### Fast-LSH k-means++

```python
from kmeans_seeding import fast_lsh

centers = fast_lsh(
    X,
    n_clusters=100,
    n_trees=4,             # Number of trees for embedding
    random_state=42
)
```

### Standard k-means++

```python
from kmeans_seeding import kmeanspp

centers = kmeanspp(X, n_clusters=100, random_state=42)
```

## Benchmarks

Performance comparison on various datasets:

| Dataset | n | d | Algorithm | Time (s) | Cost Ratio* |
|---------|---|---|-----------|----------|-------------|
| MNIST | 60K | 784 | k-means++ | 45.2 | 1.00 |
| | | | RS-k-means++ (LSH) | **2.1** | 1.02 |
| | | | AFK-MC² | 8.3 | 1.05 |
| CIFAR-10 | 50K | 512 | k-means++ | 38.7 | 1.00 |
| | | | RS-k-means++ (LSH) | **1.8** | 1.01 |
| | | | AFK-MC² | 6.9 | 1.04 |

*Cost ratio: Final k-means cost compared to standard k-means++

## Documentation

Full documentation available at: https://kmeans-seeding.readthedocs.io

- [Installation Guide](https://kmeans-seeding.readthedocs.io/en/latest/installation.html)
- [API Reference](https://kmeans-seeding.readthedocs.io/en/latest/api.html)
- [Algorithm Details](https://kmeans-seeding.readthedocs.io/en/latest/algorithms.html)
- [Benchmarks](https://kmeans-seeding.readthedocs.io/en/latest/benchmarks.html)

## Requirements

- Python 3.9+
- NumPy >= 1.20.0
- (Optional) FAISS >= 1.7.0 for fast approximate nearest neighbors
- (Optional) scikit-learn for full k-means clustering

## Citation

If you use this library in your research, please cite:

```bibtex
@article{shah2025rejection,
  title={A New Rejection Sampling Approach to k-means++ With Improved Trade-Offs},
  author={Shah, Poojan and Agrawal, Shashwat and Jaiswal, Ragesh},
  journal={arXiv preprint arXiv:2502.02085},
  year={2025}
}
```

For AFK-MC²:
```bibtex
@inproceedings{bachem2016approximate,
  title={Approximate k-means++ in sublinear time},
  author={Bachem, Olivier and Lucic, Mario and Hassani, Hamed and Krause, Andreas},
  booktitle={AAAI Conference on Artificial Intelligence},
  year={2016}
}
```

For Fast-LSH k-means++:
```bibtex
@inproceedings{cohen2020fast,
  title={Fast and accurate k-means++ via rejection sampling},
  author={Cohen-Addad, Vincent and Lattanzi, Silvio and Mitrovi{\'c}, Slobodan and Norouzi-Fard, Ashkan and Parotsidis, Nikos and Tarnawski, Jakub},
  booktitle={Advances in Neural Information Processing Systems},
  year={2020}
}
```

## Contributing

Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- FAISS library by Facebook AI Research
- scikit-learn for the k-means clustering API design
- Research supported by the Department of Computer Science, IIT Delhi

## Contact

- **Poojan Shah**: cs1221594@cse.iitd.ac.in
- **Issues**: https://github.com/pcshah2004/kmeans-seeding/issues
- **Discussions**: https://github.com/pcshah2004/kmeans-seeding/discussions

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kmeans-seeding",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Poojan Shah <cs1221594@cse.iitd.ac.in>",
    "keywords": "kmeans, clustering, machine-learning, initialization, seeding, rejection-sampling, mcmc",
    "author": null,
    "author_email": "Poojan Shah <cs1221594@cse.iitd.ac.in>",
    "download_url": "https://files.pythonhosted.org/packages/aa/84/b51092d7673c34d5f6f719e7c289d74e255d041f9064eb8d994ea1584ad4/kmeans_seeding-0.2.1.tar.gz",
    "platform": null,
    "description": "# kmeans-seeding: Fast k-means++ Initialization Algorithms\n\n[![PyPI version](https://img.shields.io/pypi/v/kmeans-seeding.svg)](https://pypi.org/project/kmeans-seeding/)\n[![Python versions](https://img.shields.io/pypi/pyversions/kmeans-seeding.svg)](https://pypi.org/project/kmeans-seeding/)\n[![License](https://img.shields.io/pypi/l/kmeans-seeding.svg)](https://github.com/pcshah2004/kmeans-seeding/blob/main/LICENSE)\n[![Documentation](https://readthedocs.org/projects/kmeans-seeding/badge/?version=latest)](https://kmeans-seeding.readthedocs.io/)\n\n**Fast, state-of-the-art k-means initialization algorithms implemented in C++ with Python bindings.**\n\n## Features\n\n\ud83d\ude80 **Fast**: C++ implementation with OpenMP parallelization\n\ud83c\udfaf **Accurate**: State-of-the-art algorithms with theoretical guarantees\n\ud83d\udd0c **Compatible**: Drop-in replacement for sklearn's k-means++ initialization\n\ud83d\udce6 **Easy to use**: Simple Python API, works with NumPy arrays\n\ud83d\udee0\ufe0f **Flexible**: Multiple algorithms to choose from\n\n## Algorithms Included\n\n1. **RS-k-means++** (Rejection Sampling) - *Our contribution*\n   - Fast approximate D\u00b2 sampling using rejection sampling\n   - Supports FAISS for approximate nearest neighbors\n   - Best for large datasets (n > 10,000)\n\n2. **AFK-MC\u00b2** (Adaptive Fast k-MC\u00b2)\n   - MCMC-based sampling without computing all distances\n   - Good balance of speed and quality\n\n3. **Fast-LSH k-means++** (Google 2020)\n   - Tree embedding with LSH for fast sampling\n   - Excellent for high-dimensional data\n\n4. **Standard k-means++**\n   - Classic D\u00b2 sampling algorithm\n   - Baseline for comparison\n\n## Installation\n\n### From PyPI (recommended)\n\n```bash\npip install kmeans-seeding\n```\n\n### With FAISS support (recommended for large datasets)\n\n```bash\n# CPU version\nconda install -c pytorch faiss-cpu\npip install kmeans-seeding\n\n# GPU version\nconda install -c pytorch faiss-gpu\npip install kmeans-seeding\n```\n\n### From source\n\n```bash\ngit clone https://github.com/pcshah2004/kmeans-seeding.git\ncd kmeans-seeding\npip install -e .\n```\n\n## Quick Start\n\n```python\nfrom kmeans_seeding import rejection_sampling\nfrom sklearn.cluster import KMeans\nimport numpy as np\n\n# Generate sample data\nX = np.random.randn(10000, 50)\n\n# Get initial centers using RS-k-means++\ncenters = rejection_sampling(X, n_clusters=100, index_type='LSH')\n\n# Use with sklearn\nkmeans = KMeans(n_clusters=100, init=centers, n_init=1)\nkmeans.fit(X)\n```\n\n## Usage Examples\n\n### RS-k-means++ (Rejection Sampling)\n\n```python\nfrom kmeans_seeding import rejection_sampling\n\ncenters = rejection_sampling(\n    X,\n    n_clusters=100,\n    max_iter=50,           # Max rejection sampling iterations\n    index_type='LSH',      # FAISS index type: 'Flat', 'LSH', 'IVFFlat', 'HNSW'\n    random_state=42\n)\n```\n\n### AFK-MC\u00b2 (MCMC Sampling)\n\n```python\nfrom kmeans_seeding import afkmc2\n\ncenters = afkmc2(\n    X,\n    n_clusters=100,\n    chain_length=200,      # Markov chain length\n    random_state=42\n)\n```\n\n### Fast-LSH k-means++\n\n```python\nfrom kmeans_seeding import fast_lsh\n\ncenters = fast_lsh(\n    X,\n    n_clusters=100,\n    n_trees=4,             # Number of trees for embedding\n    random_state=42\n)\n```\n\n### Standard k-means++\n\n```python\nfrom kmeans_seeding import kmeanspp\n\ncenters = kmeanspp(X, n_clusters=100, random_state=42)\n```\n\n## Benchmarks\n\nPerformance comparison on various datasets:\n\n| Dataset | n | d | Algorithm | Time (s) | Cost Ratio* |\n|---------|---|---|-----------|----------|-------------|\n| MNIST | 60K | 784 | k-means++ | 45.2 | 1.00 |\n| | | | RS-k-means++ (LSH) | **2.1** | 1.02 |\n| | | | AFK-MC\u00b2 | 8.3 | 1.05 |\n| CIFAR-10 | 50K | 512 | k-means++ | 38.7 | 1.00 |\n| | | | RS-k-means++ (LSH) | **1.8** | 1.01 |\n| | | | AFK-MC\u00b2 | 6.9 | 1.04 |\n\n*Cost ratio: Final k-means cost compared to standard k-means++\n\n## Documentation\n\nFull documentation available at: https://kmeans-seeding.readthedocs.io\n\n- [Installation Guide](https://kmeans-seeding.readthedocs.io/en/latest/installation.html)\n- [API Reference](https://kmeans-seeding.readthedocs.io/en/latest/api.html)\n- [Algorithm Details](https://kmeans-seeding.readthedocs.io/en/latest/algorithms.html)\n- [Benchmarks](https://kmeans-seeding.readthedocs.io/en/latest/benchmarks.html)\n\n## Requirements\n\n- Python 3.9+\n- NumPy >= 1.20.0\n- (Optional) FAISS >= 1.7.0 for fast approximate nearest neighbors\n- (Optional) scikit-learn for full k-means clustering\n\n## Citation\n\nIf you use this library in your research, please cite:\n\n```bibtex\n@article{shah2025rejection,\n  title={A New Rejection Sampling Approach to k-means++ With Improved Trade-Offs},\n  author={Shah, Poojan and Agrawal, Shashwat and Jaiswal, Ragesh},\n  journal={arXiv preprint arXiv:2502.02085},\n  year={2025}\n}\n```\n\nFor AFK-MC\u00b2:\n```bibtex\n@inproceedings{bachem2016approximate,\n  title={Approximate k-means++ in sublinear time},\n  author={Bachem, Olivier and Lucic, Mario and Hassani, Hamed and Krause, Andreas},\n  booktitle={AAAI Conference on Artificial Intelligence},\n  year={2016}\n}\n```\n\nFor Fast-LSH k-means++:\n```bibtex\n@inproceedings{cohen2020fast,\n  title={Fast and accurate k-means++ via rejection sampling},\n  author={Cohen-Addad, Vincent and Lattanzi, Silvio and Mitrovi{\\'c}, Slobodan and Norouzi-Fard, Ashkan and Parotsidis, Nikos and Tarnawski, Jakub},\n  booktitle={Advances in Neural Information Processing Systems},\n  year={2020}\n}\n```\n\n## Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- FAISS library by Facebook AI Research\n- scikit-learn for the k-means clustering API design\n- Research supported by the Department of Computer Science, IIT Delhi\n\n## Contact\n\n- **Poojan Shah**: cs1221594@cse.iitd.ac.in\n- **Issues**: https://github.com/pcshah2004/kmeans-seeding/issues\n- **Discussions**: https://github.com/pcshah2004/kmeans-seeding/discussions\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fast k-means++ seeding algorithms with C++ implementation and Python bindings",
    "version": "0.2.1",
    "project_urls": {
        "Documentation": "https://kmeans-seeding.readthedocs.io",
        "Homepage": "https://github.com/pcshah2004/kmeans-seeding",
        "Issues": "https://github.com/pcshah2004/kmeans-seeding/issues",
        "Repository": "https://github.com/pcshah2004/kmeans-seeding"
    },
    "split_keywords": [
        "kmeans",
        " clustering",
        " machine-learning",
        " initialization",
        " seeding",
        " rejection-sampling",
        " mcmc"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0d014ec1180832c2dc0488bca00f8b1baaaa26aa5d44d8f5857bbad6a061b96e",
                "md5": "4b88ea5bc7386eb7e5fc3ad406bb3eca",
                "sha256": "3b09c989a3bae680b589bccfff83d345c0548734ef27233a7fceeed71ec566fa"
            },
            "downloads": -1,
            "filename": "kmeans_seeding-0.2.1-cp313-cp313-macosx_14_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "4b88ea5bc7386eb7e5fc3ad406bb3eca",
            "packagetype": "bdist_wheel",
            "python_version": "cp313",
            "requires_python": ">=3.9",
            "size": 138166,
            "upload_time": "2025-11-03T10:46:36",
            "upload_time_iso_8601": "2025-11-03T10:46:36.332512Z",
            "url": "https://files.pythonhosted.org/packages/0d/01/4ec1180832c2dc0488bca00f8b1baaaa26aa5d44d8f5857bbad6a061b96e/kmeans_seeding-0.2.1-cp313-cp313-macosx_14_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "aa84b51092d7673c34d5f6f719e7c289d74e255d041f9064eb8d994ea1584ad4",
                "md5": "b8e42b42888a2aadedd485db2621a18b",
                "sha256": "8c96a855104e51afceb3f8cf97e603ac05f577c3bf9c4923dd2fb00d51499438"
            },
            "downloads": -1,
            "filename": "kmeans_seeding-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b8e42b42888a2aadedd485db2621a18b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 44775,
            "upload_time": "2025-11-03T10:46:37",
            "upload_time_iso_8601": "2025-11-03T10:46:37.805146Z",
            "url": "https://files.pythonhosted.org/packages/aa/84/b51092d7673c34d5f6f719e7c289d74e255d041f9064eb8d994ea1584ad4/kmeans_seeding-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-03 10:46:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pcshah2004",
    "github_project": "kmeans-seeding",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "kmeans-seeding"
}
        
Elapsed time: 1.45670s