# ClusterTK
[](https://pypi.org/project/clustertk/)
[](https://www.python.org/downloads/)
[](https://github.com/alexeiveselov92/clustertk/actions/workflows/tests.yml)
[](https://codecov.io/gh/alexeiveselov92/clustertk)
[](https://opensource.org/licenses/MIT)
**A comprehensive Python toolkit for cluster analysis with full pipeline support.**
ClusterTK provides a complete, sklearn-style pipeline for clustering: from raw data preprocessing to cluster interpretation and export. Perfect for data analysts who want powerful clustering without writing hundreds of lines of code.
## Features
- 🔄 **Complete Pipeline** - One-line solution from raw data to insights
- 📊 **Multiple Algorithms** - K-Means, GMM, Hierarchical, DBSCAN, HDBSCAN
- 🎯 **Auto-Optimization** - Automatic optimal cluster number selection
- 🎨 **Rich Visualization** - Beautiful plots (optional dependency)
- 📁 **Export & Reports** - CSV, JSON, HTML reports with embedded plots
- 💾 **Save/Load** - Persist and reload fitted pipelines
- 🔍 **Interpretation** - Profiling, naming, and feature importance analysis
## Quick Start
### Installation
```bash
# Core functionality
pip install clustertk
# With visualization
pip install clustertk[viz]
```
### Basic Usage
```python
import pandas as pd
from clustertk import ClusterAnalysisPipeline
# Load data
df = pd.read_csv('your_data.csv')
# Create and fit pipeline
pipeline = ClusterAnalysisPipeline(
handle_missing='median',
correlation_threshold=0.85,
n_clusters=None, # Auto-detect optimal number
verbose=True
)
pipeline.fit(df, feature_columns=['feature1', 'feature2', 'feature3'])
# Get results
labels = pipeline.labels_
profiles = pipeline.cluster_profiles_
metrics = pipeline.metrics_
print(f"Found {pipeline.n_clusters_} clusters")
print(f"Silhouette score: {metrics['silhouette']:.3f}")
# Export
pipeline.export_results('results.csv')
pipeline.export_report('report.html')
# Visualize (requires clustertk[viz])
pipeline.plot_clusters_2d()
pipeline.plot_cluster_heatmap()
```
## Documentation
- **[Installation Guide](docs/installation.md)** - Detailed installation instructions
- **[Quick Start](docs/quickstart.md)** - Get started in 5 minutes
- **[User Guide](docs/user_guide/README.md)** - Complete component documentation
- [Preprocessing](docs/user_guide/preprocessing.md)
- [Feature Selection](docs/user_guide/feature_selection.md)
- [Clustering](docs/user_guide/clustering.md)
- [Evaluation](docs/user_guide/evaluation.md)
- [Interpretation](docs/user_guide/interpretation.md) - Profiles, naming, feature importance
- [Visualization](docs/user_guide/visualization.md)
- [Export](docs/user_guide/export.md)
- **[Examples](docs/examples.md)** - Real-world use cases
- **[FAQ](docs/faq.md)** - Common questions
## Pipeline Workflow
```
Raw Data → Preprocessing → Feature Selection → Dimensionality Reduction
→ Clustering → Evaluation → Interpretation → Export
```
Each step is configurable through pipeline parameters or can be run independently.
## Key Capabilities
### Preprocessing
- Missing value handling (median/mean/drop)
- Outlier detection and treatment
- Automatic scaling (robust/standard/minmax)
- Skewness transformation
### Clustering Algorithms
- **K-Means** - Fast, spherical clusters
- **GMM** - Probabilistic, elliptical clusters
- **Hierarchical** - Dendrograms, hierarchical structure
- **DBSCAN** - Density-based, arbitrary shapes
- **HDBSCAN** - Advanced density-based, varying densities (v0.8.0+)
### Evaluation & Interpretation
- Silhouette score, Calinski-Harabasz, Davies-Bouldin metrics
- Automatic optimal k selection
- Cluster profiling and automatic naming
- **Feature importance analysis** (v0.9.0+)
- Permutation importance
- Feature contribution (variance ratio)
- SHAP values (optional)
### Export & Reports
- CSV export (data + labels)
- JSON export (metadata + profiles)
- HTML reports with embedded visualizations
- Pipeline serialization (save/load)
## Examples
### Feature Importance Analysis
```python
# Understand which features drive your clustering
results = pipeline.analyze_feature_importance(method='all')
# View permutation importance
print(results['permutation'].head())
# View feature contribution (variance ratio)
print(results['contribution'].head())
# Use top features for focused analysis
top_features = results['permutation'].head(5)['feature'].tolist()
```
### Algorithm Comparison
```python
# Compare multiple algorithms automatically
results = pipeline.compare_algorithms(
X=df,
feature_columns=['feature1', 'feature2', 'feature3'],
algorithms=['kmeans', 'gmm', 'hierarchical', 'dbscan'],
n_clusters_range=(2, 8)
)
print(results['comparison']) # DataFrame with metrics
print(f"Best algorithm: {results['best_algorithm']}")
# Visualize comparison
pipeline.plot_algorithm_comparison(results)
```
### Customer Segmentation
```python
pipeline = ClusterAnalysisPipeline(
n_clusters=None, # Auto-detect
auto_name_clusters=True
)
pipeline.fit(customers_df,
feature_columns=['age', 'income', 'purchases'],
category_mapping={
'demographics': ['age', 'income'],
'behavior': ['purchases']
})
pipeline.export_report('customer_segments.html')
```
### Anomaly Detection
```python
pipeline = ClusterAnalysisPipeline(
clustering_algorithm='dbscan'
)
pipeline.fit(transactions_df)
anomalies = transactions_df[pipeline.labels_ == -1]
```
More examples: [docs/examples.md](docs/examples.md)
## Requirements
- Python 3.8+
- numpy >= 1.20.0
- pandas >= 1.3.0
- scikit-learn >= 1.0.0
- scipy >= 1.7.0
- joblib >= 1.0.0
Optional (for visualization):
- matplotlib >= 3.4.0
- seaborn >= 0.11.0
## Contributing
Contributions are welcome! Please check:
- [GitHub Issues](https://github.com/alexeiveselov92/clustertk/issues) - Report bugs
- [GitHub Discussions](https://github.com/alexeiveselov92/clustertk/discussions) - Questions
## License
MIT License - see [LICENSE](LICENSE) file for details.
## Citation
If you use ClusterTK in your research, please cite:
```bibtex
@software{clustertk2024,
author = {Veselov, Aleksey},
title = {ClusterTK: A Comprehensive Python Toolkit for Cluster Analysis},
year = {2024},
url = {https://github.com/alexeiveselov92/clustertk}
}
```
## Links
- **PyPI**: https://pypi.org/project/clustertk/
- **GitHub**: https://github.com/alexeiveselov92/clustertk
- **Documentation**: [docs/](docs/)
- **Author**: Aleksey Veselov (alexei.veselov92@gmail.com)
---
Made with ❤️ for the data science community
Raw data
{
"_id": null,
"home_page": "https://github.com/alexeiveselov92/clustertk",
"name": "clustertk",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "clustering, machine-learning, data-analysis, pipeline, kmeans, pca, data-science",
"author": "Aleksey Veselov",
"author_email": "Aleksey Veselov <alexei.veselov92@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/0b/39/8e9601103fbf9d16a4b0cdb00299583fa76c076489f8ebcfb3810da12a90/clustertk-0.15.0.tar.gz",
"platform": null,
"description": "# ClusterTK\n\n[](https://pypi.org/project/clustertk/)\n[](https://www.python.org/downloads/)\n[](https://github.com/alexeiveselov92/clustertk/actions/workflows/tests.yml)\n[](https://codecov.io/gh/alexeiveselov92/clustertk)\n[](https://opensource.org/licenses/MIT)\n\n**A comprehensive Python toolkit for cluster analysis with full pipeline support.**\n\nClusterTK provides a complete, sklearn-style pipeline for clustering: from raw data preprocessing to cluster interpretation and export. Perfect for data analysts who want powerful clustering without writing hundreds of lines of code.\n\n## Features\n\n- \ud83d\udd04 **Complete Pipeline** - One-line solution from raw data to insights\n- \ud83d\udcca **Multiple Algorithms** - K-Means, GMM, Hierarchical, DBSCAN, HDBSCAN\n- \ud83c\udfaf **Auto-Optimization** - Automatic optimal cluster number selection\n- \ud83c\udfa8 **Rich Visualization** - Beautiful plots (optional dependency)\n- \ud83d\udcc1 **Export & Reports** - CSV, JSON, HTML reports with embedded plots\n- \ud83d\udcbe **Save/Load** - Persist and reload fitted pipelines\n- \ud83d\udd0d **Interpretation** - Profiling, naming, and feature importance analysis\n\n## Quick Start\n\n### Installation\n\n```bash\n# Core functionality\npip install clustertk\n\n# With visualization\npip install clustertk[viz]\n```\n\n### Basic Usage\n\n```python\nimport pandas as pd\nfrom clustertk import ClusterAnalysisPipeline\n\n# Load data\ndf = pd.read_csv('your_data.csv')\n\n# Create and fit pipeline\npipeline = ClusterAnalysisPipeline(\n handle_missing='median',\n correlation_threshold=0.85,\n n_clusters=None, # Auto-detect optimal number\n verbose=True\n)\n\npipeline.fit(df, feature_columns=['feature1', 'feature2', 'feature3'])\n\n# Get results\nlabels = pipeline.labels_\nprofiles = pipeline.cluster_profiles_\nmetrics = pipeline.metrics_\n\nprint(f\"Found {pipeline.n_clusters_} clusters\")\nprint(f\"Silhouette score: {metrics['silhouette']:.3f}\")\n\n# Export\npipeline.export_results('results.csv')\npipeline.export_report('report.html')\n\n# Visualize (requires clustertk[viz])\npipeline.plot_clusters_2d()\npipeline.plot_cluster_heatmap()\n```\n\n## Documentation\n\n- **[Installation Guide](docs/installation.md)** - Detailed installation instructions\n- **[Quick Start](docs/quickstart.md)** - Get started in 5 minutes\n- **[User Guide](docs/user_guide/README.md)** - Complete component documentation\n - [Preprocessing](docs/user_guide/preprocessing.md)\n - [Feature Selection](docs/user_guide/feature_selection.md)\n - [Clustering](docs/user_guide/clustering.md)\n - [Evaluation](docs/user_guide/evaluation.md)\n - [Interpretation](docs/user_guide/interpretation.md) - Profiles, naming, feature importance\n - [Visualization](docs/user_guide/visualization.md)\n - [Export](docs/user_guide/export.md)\n- **[Examples](docs/examples.md)** - Real-world use cases\n- **[FAQ](docs/faq.md)** - Common questions\n\n## Pipeline Workflow\n\n```\nRaw Data \u2192 Preprocessing \u2192 Feature Selection \u2192 Dimensionality Reduction\n\u2192 Clustering \u2192 Evaluation \u2192 Interpretation \u2192 Export\n```\n\nEach step is configurable through pipeline parameters or can be run independently.\n\n## Key Capabilities\n\n### Preprocessing\n- Missing value handling (median/mean/drop)\n- Outlier detection and treatment\n- Automatic scaling (robust/standard/minmax)\n- Skewness transformation\n\n### Clustering Algorithms\n- **K-Means** - Fast, spherical clusters\n- **GMM** - Probabilistic, elliptical clusters\n- **Hierarchical** - Dendrograms, hierarchical structure\n- **DBSCAN** - Density-based, arbitrary shapes\n- **HDBSCAN** - Advanced density-based, varying densities (v0.8.0+)\n\n### Evaluation & Interpretation\n- Silhouette score, Calinski-Harabasz, Davies-Bouldin metrics\n- Automatic optimal k selection\n- Cluster profiling and automatic naming\n- **Feature importance analysis** (v0.9.0+)\n - Permutation importance\n - Feature contribution (variance ratio)\n - SHAP values (optional)\n\n### Export & Reports\n- CSV export (data + labels)\n- JSON export (metadata + profiles)\n- HTML reports with embedded visualizations\n- Pipeline serialization (save/load)\n\n## Examples\n\n### Feature Importance Analysis\n\n```python\n# Understand which features drive your clustering\nresults = pipeline.analyze_feature_importance(method='all')\n\n# View permutation importance\nprint(results['permutation'].head())\n\n# View feature contribution (variance ratio)\nprint(results['contribution'].head())\n\n# Use top features for focused analysis\ntop_features = results['permutation'].head(5)['feature'].tolist()\n```\n\n### Algorithm Comparison\n\n```python\n# Compare multiple algorithms automatically\nresults = pipeline.compare_algorithms(\n X=df,\n feature_columns=['feature1', 'feature2', 'feature3'],\n algorithms=['kmeans', 'gmm', 'hierarchical', 'dbscan'],\n n_clusters_range=(2, 8)\n)\n\nprint(results['comparison']) # DataFrame with metrics\nprint(f\"Best algorithm: {results['best_algorithm']}\")\n\n# Visualize comparison\npipeline.plot_algorithm_comparison(results)\n```\n\n### Customer Segmentation\n\n```python\npipeline = ClusterAnalysisPipeline(\n n_clusters=None, # Auto-detect\n auto_name_clusters=True\n)\n\npipeline.fit(customers_df,\n feature_columns=['age', 'income', 'purchases'],\n category_mapping={\n 'demographics': ['age', 'income'],\n 'behavior': ['purchases']\n })\n\npipeline.export_report('customer_segments.html')\n```\n\n### Anomaly Detection\n\n```python\npipeline = ClusterAnalysisPipeline(\n clustering_algorithm='dbscan'\n)\n\npipeline.fit(transactions_df)\nanomalies = transactions_df[pipeline.labels_ == -1]\n```\n\nMore examples: [docs/examples.md](docs/examples.md)\n\n## Requirements\n\n- Python 3.8+\n- numpy >= 1.20.0\n- pandas >= 1.3.0\n- scikit-learn >= 1.0.0\n- scipy >= 1.7.0\n- joblib >= 1.0.0\n\nOptional (for visualization):\n- matplotlib >= 3.4.0\n- seaborn >= 0.11.0\n\n## Contributing\n\nContributions are welcome! Please check:\n- [GitHub Issues](https://github.com/alexeiveselov92/clustertk/issues) - Report bugs\n- [GitHub Discussions](https://github.com/alexeiveselov92/clustertk/discussions) - Questions\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use ClusterTK in your research, please cite:\n\n```bibtex\n@software{clustertk2024,\n author = {Veselov, Aleksey},\n title = {ClusterTK: A Comprehensive Python Toolkit for Cluster Analysis},\n year = {2024},\n url = {https://github.com/alexeiveselov92/clustertk}\n}\n```\n\n## Links\n\n- **PyPI**: https://pypi.org/project/clustertk/\n- **GitHub**: https://github.com/alexeiveselov92/clustertk\n- **Documentation**: [docs/](docs/)\n- **Author**: Aleksey Veselov (alexei.veselov92@gmail.com)\n\n---\n\nMade with \u2764\ufe0f for the data science community\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A comprehensive toolkit for cluster analysis with full pipeline support",
"version": "0.15.0",
"project_urls": {
"Bug Tracker": "https://github.com/alexeiveselov92/clustertk/issues",
"Documentation": "https://clustertk.readthedocs.io",
"Homepage": "https://github.com/alexeiveselov92/clustertk",
"Repository": "https://github.com/alexeiveselov92/clustertk"
},
"split_keywords": [
"clustering",
" machine-learning",
" data-analysis",
" pipeline",
" kmeans",
" pca",
" data-science"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5c5274a218132bd127a498615ad5997503289984db97fede13e8bd707cb727a6",
"md5": "407fba2c29aeffea2f16031e08dce7b2",
"sha256": "03e6eba4131293670c46a945124c65bfbe9935401f9cfda09742c2b7a2f1bae7"
},
"downloads": -1,
"filename": "clustertk-0.15.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "407fba2c29aeffea2f16031e08dce7b2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 110810,
"upload_time": "2025-10-30T19:58:09",
"upload_time_iso_8601": "2025-10-30T19:58:09.422637Z",
"url": "https://files.pythonhosted.org/packages/5c/52/74a218132bd127a498615ad5997503289984db97fede13e8bd707cb727a6/clustertk-0.15.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0b398e9601103fbf9d16a4b0cdb00299583fa76c076489f8ebcfb3810da12a90",
"md5": "e8ade320f2fdc543c3c326d423cd5f54",
"sha256": "f74c89f829ff77a65cdd5c9ca7675aaf4df73142557d796f03dab2e921f46c16"
},
"downloads": -1,
"filename": "clustertk-0.15.0.tar.gz",
"has_sig": false,
"md5_digest": "e8ade320f2fdc543c3c326d423cd5f54",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 166274,
"upload_time": "2025-10-30T19:58:11",
"upload_time_iso_8601": "2025-10-30T19:58:11.007051Z",
"url": "https://files.pythonhosted.org/packages/0b/39/8e9601103fbf9d16a4b0cdb00299583fa76c076489f8ebcfb3810da12a90/clustertk-0.15.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-30 19:58:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "alexeiveselov92",
"github_project": "clustertk",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.20.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.7.0"
]
]
}
],
"lcname": "clustertk"
}