# kmeans-seeding: Fast k-means++ Initialization Algorithms
[](https://pypi.org/project/kmeans-seeding/)
[](https://pypi.org/project/kmeans-seeding/)
[](https://github.com/pcshah2004/kmeans-seeding/blob/main/LICENSE)
[](https://kmeans-seeding.readthedocs.io/)
**Fast, state-of-the-art k-means initialization algorithms implemented in C++ with Python bindings.**
## Features
🚀 **Fast**: C++ implementation with OpenMP parallelization
🎯 **Accurate**: State-of-the-art algorithms with theoretical guarantees
🔌 **Compatible**: Drop-in replacement for sklearn's k-means++ initialization
📦 **Easy to use**: Simple Python API, works with NumPy arrays
🛠️ **Flexible**: Multiple algorithms to choose from
## Algorithms Included
1. **RS-k-means++** (Rejection Sampling) - *Our contribution*
- Fast approximate D² sampling using rejection sampling
- Supports FAISS for approximate nearest neighbors
- Best for large datasets (n > 10,000)
2. **AFK-MC²** (Adaptive Fast k-MC²)
- MCMC-based sampling without computing all distances
- Good balance of speed and quality
3. **Fast-LSH k-means++** (Google 2020)
- Tree embedding with LSH for fast sampling
- Excellent for high-dimensional data
4. **Standard k-means++**
- Classic D² sampling algorithm
- Baseline for comparison
## Installation
### From PyPI (recommended)
```bash
pip install kmeans-seeding
```
### With FAISS support (recommended for large datasets)
```bash
# CPU version
conda install -c pytorch faiss-cpu
pip install kmeans-seeding
# GPU version
conda install -c pytorch faiss-gpu
pip install kmeans-seeding
```
### From source
```bash
git clone https://github.com/pcshah2004/kmeans-seeding.git
cd kmeans-seeding
pip install -e .
```
## Quick Start
```python
from kmeans_seeding import rejection_sampling
from sklearn.cluster import KMeans
import numpy as np
# Generate sample data
X = np.random.randn(10000, 50)
# Get initial centers using RS-k-means++
centers = rejection_sampling(X, n_clusters=100, index_type='LSH')
# Use with sklearn
kmeans = KMeans(n_clusters=100, init=centers, n_init=1)
kmeans.fit(X)
```
## Usage Examples
### RS-k-means++ (Rejection Sampling)
```python
from kmeans_seeding import rejection_sampling
centers = rejection_sampling(
X,
n_clusters=100,
max_iter=50, # Max rejection sampling iterations
index_type='LSH', # FAISS index type: 'Flat', 'LSH', 'IVFFlat', 'HNSW'
random_state=42
)
```
### AFK-MC² (MCMC Sampling)
```python
from kmeans_seeding import afkmc2
centers = afkmc2(
X,
n_clusters=100,
chain_length=200, # Markov chain length
random_state=42
)
```
### Fast-LSH k-means++
```python
from kmeans_seeding import fast_lsh
centers = fast_lsh(
X,
n_clusters=100,
n_trees=4, # Number of trees for embedding
random_state=42
)
```
### Standard k-means++
```python
from kmeans_seeding import kmeanspp
centers = kmeanspp(X, n_clusters=100, random_state=42)
```
## Benchmarks
Performance comparison on various datasets:
| Dataset | n | d | Algorithm | Time (s) | Cost Ratio* |
|---------|---|---|-----------|----------|-------------|
| MNIST | 60K | 784 | k-means++ | 45.2 | 1.00 |
| | | | RS-k-means++ (LSH) | **2.1** | 1.02 |
| | | | AFK-MC² | 8.3 | 1.05 |
| CIFAR-10 | 50K | 512 | k-means++ | 38.7 | 1.00 |
| | | | RS-k-means++ (LSH) | **1.8** | 1.01 |
| | | | AFK-MC² | 6.9 | 1.04 |
*Cost ratio: Final k-means cost compared to standard k-means++
## Documentation
Full documentation available at: https://kmeans-seeding.readthedocs.io
- [Installation Guide](https://kmeans-seeding.readthedocs.io/en/latest/installation.html)
- [API Reference](https://kmeans-seeding.readthedocs.io/en/latest/api.html)
- [Algorithm Details](https://kmeans-seeding.readthedocs.io/en/latest/algorithms.html)
- [Benchmarks](https://kmeans-seeding.readthedocs.io/en/latest/benchmarks.html)
## Requirements
- Python 3.9+
- NumPy >= 1.20.0
- (Optional) FAISS >= 1.7.0 for fast approximate nearest neighbors
- (Optional) scikit-learn for full k-means clustering
## Citation
If you use this library in your research, please cite:
```bibtex
@article{shah2025rejection,
title={A New Rejection Sampling Approach to k-means++ With Improved Trade-Offs},
author={Shah, Poojan and Agrawal, Shashwat and Jaiswal, Ragesh},
journal={arXiv preprint arXiv:2502.02085},
year={2025}
}
```
For AFK-MC²:
```bibtex
@inproceedings{bachem2016approximate,
title={Approximate k-means++ in sublinear time},
author={Bachem, Olivier and Lucic, Mario and Hassani, Hamed and Krause, Andreas},
booktitle={AAAI Conference on Artificial Intelligence},
year={2016}
}
```
For Fast-LSH k-means++:
```bibtex
@inproceedings{cohen2020fast,
title={Fast and accurate k-means++ via rejection sampling},
author={Cohen-Addad, Vincent and Lattanzi, Silvio and Mitrovi{\'c}, Slobodan and Norouzi-Fard, Ashkan and Parotsidis, Nikos and Tarnawski, Jakub},
booktitle={Advances in Neural Information Processing Systems},
year={2020}
}
```
## Contributing
Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- FAISS library by Facebook AI Research
- scikit-learn for the k-means clustering API design
- Research supported by the Department of Computer Science, IIT Delhi
## Contact
- **Poojan Shah**: cs1221594@cse.iitd.ac.in
- **Issues**: https://github.com/pcshah2004/kmeans-seeding/issues
- **Discussions**: https://github.com/pcshah2004/kmeans-seeding/discussions
Raw data
{
"_id": null,
"home_page": null,
"name": "kmeans-seeding",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Poojan Shah <cs1221594@cse.iitd.ac.in>",
"keywords": "kmeans, clustering, machine-learning, initialization, seeding, rejection-sampling, mcmc",
"author": null,
"author_email": "Poojan Shah <cs1221594@cse.iitd.ac.in>",
"download_url": "https://files.pythonhosted.org/packages/aa/84/b51092d7673c34d5f6f719e7c289d74e255d041f9064eb8d994ea1584ad4/kmeans_seeding-0.2.1.tar.gz",
"platform": null,
"description": "# kmeans-seeding: Fast k-means++ Initialization Algorithms\n\n[](https://pypi.org/project/kmeans-seeding/)\n[](https://pypi.org/project/kmeans-seeding/)\n[](https://github.com/pcshah2004/kmeans-seeding/blob/main/LICENSE)\n[](https://kmeans-seeding.readthedocs.io/)\n\n**Fast, state-of-the-art k-means initialization algorithms implemented in C++ with Python bindings.**\n\n## Features\n\n\ud83d\ude80 **Fast**: C++ implementation with OpenMP parallelization\n\ud83c\udfaf **Accurate**: State-of-the-art algorithms with theoretical guarantees\n\ud83d\udd0c **Compatible**: Drop-in replacement for sklearn's k-means++ initialization\n\ud83d\udce6 **Easy to use**: Simple Python API, works with NumPy arrays\n\ud83d\udee0\ufe0f **Flexible**: Multiple algorithms to choose from\n\n## Algorithms Included\n\n1. **RS-k-means++** (Rejection Sampling) - *Our contribution*\n - Fast approximate D\u00b2 sampling using rejection sampling\n - Supports FAISS for approximate nearest neighbors\n - Best for large datasets (n > 10,000)\n\n2. **AFK-MC\u00b2** (Adaptive Fast k-MC\u00b2)\n - MCMC-based sampling without computing all distances\n - Good balance of speed and quality\n\n3. **Fast-LSH k-means++** (Google 2020)\n - Tree embedding with LSH for fast sampling\n - Excellent for high-dimensional data\n\n4. **Standard k-means++**\n - Classic D\u00b2 sampling algorithm\n - Baseline for comparison\n\n## Installation\n\n### From PyPI (recommended)\n\n```bash\npip install kmeans-seeding\n```\n\n### With FAISS support (recommended for large datasets)\n\n```bash\n# CPU version\nconda install -c pytorch faiss-cpu\npip install kmeans-seeding\n\n# GPU version\nconda install -c pytorch faiss-gpu\npip install kmeans-seeding\n```\n\n### From source\n\n```bash\ngit clone https://github.com/pcshah2004/kmeans-seeding.git\ncd kmeans-seeding\npip install -e .\n```\n\n## Quick Start\n\n```python\nfrom kmeans_seeding import rejection_sampling\nfrom sklearn.cluster import KMeans\nimport numpy as np\n\n# Generate sample data\nX = np.random.randn(10000, 50)\n\n# Get initial centers using RS-k-means++\ncenters = rejection_sampling(X, n_clusters=100, index_type='LSH')\n\n# Use with sklearn\nkmeans = KMeans(n_clusters=100, init=centers, n_init=1)\nkmeans.fit(X)\n```\n\n## Usage Examples\n\n### RS-k-means++ (Rejection Sampling)\n\n```python\nfrom kmeans_seeding import rejection_sampling\n\ncenters = rejection_sampling(\n X,\n n_clusters=100,\n max_iter=50, # Max rejection sampling iterations\n index_type='LSH', # FAISS index type: 'Flat', 'LSH', 'IVFFlat', 'HNSW'\n random_state=42\n)\n```\n\n### AFK-MC\u00b2 (MCMC Sampling)\n\n```python\nfrom kmeans_seeding import afkmc2\n\ncenters = afkmc2(\n X,\n n_clusters=100,\n chain_length=200, # Markov chain length\n random_state=42\n)\n```\n\n### Fast-LSH k-means++\n\n```python\nfrom kmeans_seeding import fast_lsh\n\ncenters = fast_lsh(\n X,\n n_clusters=100,\n n_trees=4, # Number of trees for embedding\n random_state=42\n)\n```\n\n### Standard k-means++\n\n```python\nfrom kmeans_seeding import kmeanspp\n\ncenters = kmeanspp(X, n_clusters=100, random_state=42)\n```\n\n## Benchmarks\n\nPerformance comparison on various datasets:\n\n| Dataset | n | d | Algorithm | Time (s) | Cost Ratio* |\n|---------|---|---|-----------|----------|-------------|\n| MNIST | 60K | 784 | k-means++ | 45.2 | 1.00 |\n| | | | RS-k-means++ (LSH) | **2.1** | 1.02 |\n| | | | AFK-MC\u00b2 | 8.3 | 1.05 |\n| CIFAR-10 | 50K | 512 | k-means++ | 38.7 | 1.00 |\n| | | | RS-k-means++ (LSH) | **1.8** | 1.01 |\n| | | | AFK-MC\u00b2 | 6.9 | 1.04 |\n\n*Cost ratio: Final k-means cost compared to standard k-means++\n\n## Documentation\n\nFull documentation available at: https://kmeans-seeding.readthedocs.io\n\n- [Installation Guide](https://kmeans-seeding.readthedocs.io/en/latest/installation.html)\n- [API Reference](https://kmeans-seeding.readthedocs.io/en/latest/api.html)\n- [Algorithm Details](https://kmeans-seeding.readthedocs.io/en/latest/algorithms.html)\n- [Benchmarks](https://kmeans-seeding.readthedocs.io/en/latest/benchmarks.html)\n\n## Requirements\n\n- Python 3.9+\n- NumPy >= 1.20.0\n- (Optional) FAISS >= 1.7.0 for fast approximate nearest neighbors\n- (Optional) scikit-learn for full k-means clustering\n\n## Citation\n\nIf you use this library in your research, please cite:\n\n```bibtex\n@article{shah2025rejection,\n title={A New Rejection Sampling Approach to k-means++ With Improved Trade-Offs},\n author={Shah, Poojan and Agrawal, Shashwat and Jaiswal, Ragesh},\n journal={arXiv preprint arXiv:2502.02085},\n year={2025}\n}\n```\n\nFor AFK-MC\u00b2:\n```bibtex\n@inproceedings{bachem2016approximate,\n title={Approximate k-means++ in sublinear time},\n author={Bachem, Olivier and Lucic, Mario and Hassani, Hamed and Krause, Andreas},\n booktitle={AAAI Conference on Artificial Intelligence},\n year={2016}\n}\n```\n\nFor Fast-LSH k-means++:\n```bibtex\n@inproceedings{cohen2020fast,\n title={Fast and accurate k-means++ via rejection sampling},\n author={Cohen-Addad, Vincent and Lattanzi, Silvio and Mitrovi{\\'c}, Slobodan and Norouzi-Fard, Ashkan and Parotsidis, Nikos and Tarnawski, Jakub},\n booktitle={Advances in Neural Information Processing Systems},\n year={2020}\n}\n```\n\n## Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- FAISS library by Facebook AI Research\n- scikit-learn for the k-means clustering API design\n- Research supported by the Department of Computer Science, IIT Delhi\n\n## Contact\n\n- **Poojan Shah**: cs1221594@cse.iitd.ac.in\n- **Issues**: https://github.com/pcshah2004/kmeans-seeding/issues\n- **Discussions**: https://github.com/pcshah2004/kmeans-seeding/discussions\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Fast k-means++ seeding algorithms with C++ implementation and Python bindings",
"version": "0.2.1",
"project_urls": {
"Documentation": "https://kmeans-seeding.readthedocs.io",
"Homepage": "https://github.com/pcshah2004/kmeans-seeding",
"Issues": "https://github.com/pcshah2004/kmeans-seeding/issues",
"Repository": "https://github.com/pcshah2004/kmeans-seeding"
},
"split_keywords": [
"kmeans",
" clustering",
" machine-learning",
" initialization",
" seeding",
" rejection-sampling",
" mcmc"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0d014ec1180832c2dc0488bca00f8b1baaaa26aa5d44d8f5857bbad6a061b96e",
"md5": "4b88ea5bc7386eb7e5fc3ad406bb3eca",
"sha256": "3b09c989a3bae680b589bccfff83d345c0548734ef27233a7fceeed71ec566fa"
},
"downloads": -1,
"filename": "kmeans_seeding-0.2.1-cp313-cp313-macosx_14_0_arm64.whl",
"has_sig": false,
"md5_digest": "4b88ea5bc7386eb7e5fc3ad406bb3eca",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.9",
"size": 138166,
"upload_time": "2025-11-03T10:46:36",
"upload_time_iso_8601": "2025-11-03T10:46:36.332512Z",
"url": "https://files.pythonhosted.org/packages/0d/01/4ec1180832c2dc0488bca00f8b1baaaa26aa5d44d8f5857bbad6a061b96e/kmeans_seeding-0.2.1-cp313-cp313-macosx_14_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "aa84b51092d7673c34d5f6f719e7c289d74e255d041f9064eb8d994ea1584ad4",
"md5": "b8e42b42888a2aadedd485db2621a18b",
"sha256": "8c96a855104e51afceb3f8cf97e603ac05f577c3bf9c4923dd2fb00d51499438"
},
"downloads": -1,
"filename": "kmeans_seeding-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "b8e42b42888a2aadedd485db2621a18b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 44775,
"upload_time": "2025-11-03T10:46:37",
"upload_time_iso_8601": "2025-11-03T10:46:37.805146Z",
"url": "https://files.pythonhosted.org/packages/aa/84/b51092d7673c34d5f6f719e7c289d74e255d041f9064eb8d994ea1584ad4/kmeans_seeding-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-03 10:46:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "pcshah2004",
"github_project": "kmeans-seeding",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "kmeans-seeding"
}