clustertk


Nameclustertk JSON
Version 0.15.0 PyPI version JSON
download
home_pagehttps://github.com/alexeiveselov92/clustertk
SummaryA comprehensive toolkit for cluster analysis with full pipeline support
upload_time2025-10-30 19:58:11
maintainerNone
docs_urlNone
authorAleksey Veselov
requires_python>=3.8
licenseMIT
keywords clustering machine-learning data-analysis pipeline kmeans pca data-science
VCS
bugtrack_url
requirements numpy pandas scikit-learn scipy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ClusterTK

[![PyPI version](https://badge.fury.io/py/clustertk.svg)](https://pypi.org/project/clustertk/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://github.com/alexeiveselov92/clustertk/workflows/Tests/badge.svg)](https://github.com/alexeiveselov92/clustertk/actions/workflows/tests.yml)
[![codecov](https://codecov.io/gh/alexeiveselov92/clustertk/branch/main/graph/badge.svg)](https://codecov.io/gh/alexeiveselov92/clustertk)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**A comprehensive Python toolkit for cluster analysis with full pipeline support.**

ClusterTK provides a complete, sklearn-style pipeline for clustering: from raw data preprocessing to cluster interpretation and export. Perfect for data analysts who want powerful clustering without writing hundreds of lines of code.

## Features

- 🔄 **Complete Pipeline** - One-line solution from raw data to insights
- 📊 **Multiple Algorithms** - K-Means, GMM, Hierarchical, DBSCAN, HDBSCAN
- 🎯 **Auto-Optimization** - Automatic optimal cluster number selection
- 🎨 **Rich Visualization** - Beautiful plots (optional dependency)
- 📁 **Export & Reports** - CSV, JSON, HTML reports with embedded plots
- 💾 **Save/Load** - Persist and reload fitted pipelines
- 🔍 **Interpretation** - Profiling, naming, and feature importance analysis

## Quick Start

### Installation

```bash
# Core functionality
pip install clustertk

# With visualization
pip install clustertk[viz]
```

### Basic Usage

```python
import pandas as pd
from clustertk import ClusterAnalysisPipeline

# Load data
df = pd.read_csv('your_data.csv')

# Create and fit pipeline
pipeline = ClusterAnalysisPipeline(
    handle_missing='median',
    correlation_threshold=0.85,
    n_clusters=None,  # Auto-detect optimal number
    verbose=True
)

pipeline.fit(df, feature_columns=['feature1', 'feature2', 'feature3'])

# Get results
labels = pipeline.labels_
profiles = pipeline.cluster_profiles_
metrics = pipeline.metrics_

print(f"Found {pipeline.n_clusters_} clusters")
print(f"Silhouette score: {metrics['silhouette']:.3f}")

# Export
pipeline.export_results('results.csv')
pipeline.export_report('report.html')

# Visualize (requires clustertk[viz])
pipeline.plot_clusters_2d()
pipeline.plot_cluster_heatmap()
```

## Documentation

- **[Installation Guide](docs/installation.md)** - Detailed installation instructions
- **[Quick Start](docs/quickstart.md)** - Get started in 5 minutes
- **[User Guide](docs/user_guide/README.md)** - Complete component documentation
  - [Preprocessing](docs/user_guide/preprocessing.md)
  - [Feature Selection](docs/user_guide/feature_selection.md)
  - [Clustering](docs/user_guide/clustering.md)
  - [Evaluation](docs/user_guide/evaluation.md)
  - [Interpretation](docs/user_guide/interpretation.md) - Profiles, naming, feature importance
  - [Visualization](docs/user_guide/visualization.md)
  - [Export](docs/user_guide/export.md)
- **[Examples](docs/examples.md)** - Real-world use cases
- **[FAQ](docs/faq.md)** - Common questions

## Pipeline Workflow

```
Raw Data → Preprocessing → Feature Selection → Dimensionality Reduction
→ Clustering → Evaluation → Interpretation → Export
```

Each step is configurable through pipeline parameters or can be run independently.

## Key Capabilities

### Preprocessing
- Missing value handling (median/mean/drop)
- Outlier detection and treatment
- Automatic scaling (robust/standard/minmax)
- Skewness transformation

### Clustering Algorithms
- **K-Means** - Fast, spherical clusters
- **GMM** - Probabilistic, elliptical clusters
- **Hierarchical** - Dendrograms, hierarchical structure
- **DBSCAN** - Density-based, arbitrary shapes
- **HDBSCAN** - Advanced density-based, varying densities (v0.8.0+)

### Evaluation & Interpretation
- Silhouette score, Calinski-Harabasz, Davies-Bouldin metrics
- Automatic optimal k selection
- Cluster profiling and automatic naming
- **Feature importance analysis** (v0.9.0+)
  - Permutation importance
  - Feature contribution (variance ratio)
  - SHAP values (optional)

### Export & Reports
- CSV export (data + labels)
- JSON export (metadata + profiles)
- HTML reports with embedded visualizations
- Pipeline serialization (save/load)

## Examples

### Feature Importance Analysis

```python
# Understand which features drive your clustering
results = pipeline.analyze_feature_importance(method='all')

# View permutation importance
print(results['permutation'].head())

# View feature contribution (variance ratio)
print(results['contribution'].head())

# Use top features for focused analysis
top_features = results['permutation'].head(5)['feature'].tolist()
```

### Algorithm Comparison

```python
# Compare multiple algorithms automatically
results = pipeline.compare_algorithms(
    X=df,
    feature_columns=['feature1', 'feature2', 'feature3'],
    algorithms=['kmeans', 'gmm', 'hierarchical', 'dbscan'],
    n_clusters_range=(2, 8)
)

print(results['comparison'])  # DataFrame with metrics
print(f"Best algorithm: {results['best_algorithm']}")

# Visualize comparison
pipeline.plot_algorithm_comparison(results)
```

### Customer Segmentation

```python
pipeline = ClusterAnalysisPipeline(
    n_clusters=None,  # Auto-detect
    auto_name_clusters=True
)

pipeline.fit(customers_df,
            feature_columns=['age', 'income', 'purchases'],
            category_mapping={
                'demographics': ['age', 'income'],
                'behavior': ['purchases']
            })

pipeline.export_report('customer_segments.html')
```

### Anomaly Detection

```python
pipeline = ClusterAnalysisPipeline(
    clustering_algorithm='dbscan'
)

pipeline.fit(transactions_df)
anomalies = transactions_df[pipeline.labels_ == -1]
```

More examples: [docs/examples.md](docs/examples.md)

## Requirements

- Python 3.8+
- numpy >= 1.20.0
- pandas >= 1.3.0
- scikit-learn >= 1.0.0
- scipy >= 1.7.0
- joblib >= 1.0.0

Optional (for visualization):
- matplotlib >= 3.4.0
- seaborn >= 0.11.0

## Contributing

Contributions are welcome! Please check:
- [GitHub Issues](https://github.com/alexeiveselov92/clustertk/issues) - Report bugs
- [GitHub Discussions](https://github.com/alexeiveselov92/clustertk/discussions) - Questions

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Citation

If you use ClusterTK in your research, please cite:

```bibtex
@software{clustertk2024,
  author = {Veselov, Aleksey},
  title = {ClusterTK: A Comprehensive Python Toolkit for Cluster Analysis},
  year = {2024},
  url = {https://github.com/alexeiveselov92/clustertk}
}
```

## Links

- **PyPI**: https://pypi.org/project/clustertk/
- **GitHub**: https://github.com/alexeiveselov92/clustertk
- **Documentation**: [docs/](docs/)
- **Author**: Aleksey Veselov (alexei.veselov92@gmail.com)

---

Made with ❤️ for the data science community

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/alexeiveselov92/clustertk",
    "name": "clustertk",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "clustering, machine-learning, data-analysis, pipeline, kmeans, pca, data-science",
    "author": "Aleksey Veselov",
    "author_email": "Aleksey Veselov <alexei.veselov92@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/0b/39/8e9601103fbf9d16a4b0cdb00299583fa76c076489f8ebcfb3810da12a90/clustertk-0.15.0.tar.gz",
    "platform": null,
    "description": "# ClusterTK\n\n[![PyPI version](https://badge.fury.io/py/clustertk.svg)](https://pypi.org/project/clustertk/)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![Tests](https://github.com/alexeiveselov92/clustertk/workflows/Tests/badge.svg)](https://github.com/alexeiveselov92/clustertk/actions/workflows/tests.yml)\n[![codecov](https://codecov.io/gh/alexeiveselov92/clustertk/branch/main/graph/badge.svg)](https://codecov.io/gh/alexeiveselov92/clustertk)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n**A comprehensive Python toolkit for cluster analysis with full pipeline support.**\n\nClusterTK provides a complete, sklearn-style pipeline for clustering: from raw data preprocessing to cluster interpretation and export. Perfect for data analysts who want powerful clustering without writing hundreds of lines of code.\n\n## Features\n\n- \ud83d\udd04 **Complete Pipeline** - One-line solution from raw data to insights\n- \ud83d\udcca **Multiple Algorithms** - K-Means, GMM, Hierarchical, DBSCAN, HDBSCAN\n- \ud83c\udfaf **Auto-Optimization** - Automatic optimal cluster number selection\n- \ud83c\udfa8 **Rich Visualization** - Beautiful plots (optional dependency)\n- \ud83d\udcc1 **Export & Reports** - CSV, JSON, HTML reports with embedded plots\n- \ud83d\udcbe **Save/Load** - Persist and reload fitted pipelines\n- \ud83d\udd0d **Interpretation** - Profiling, naming, and feature importance analysis\n\n## Quick Start\n\n### Installation\n\n```bash\n# Core functionality\npip install clustertk\n\n# With visualization\npip install clustertk[viz]\n```\n\n### Basic Usage\n\n```python\nimport pandas as pd\nfrom clustertk import ClusterAnalysisPipeline\n\n# Load data\ndf = pd.read_csv('your_data.csv')\n\n# Create and fit pipeline\npipeline = ClusterAnalysisPipeline(\n    handle_missing='median',\n    correlation_threshold=0.85,\n    n_clusters=None,  # Auto-detect optimal number\n    verbose=True\n)\n\npipeline.fit(df, feature_columns=['feature1', 'feature2', 'feature3'])\n\n# Get results\nlabels = pipeline.labels_\nprofiles = pipeline.cluster_profiles_\nmetrics = pipeline.metrics_\n\nprint(f\"Found {pipeline.n_clusters_} clusters\")\nprint(f\"Silhouette score: {metrics['silhouette']:.3f}\")\n\n# Export\npipeline.export_results('results.csv')\npipeline.export_report('report.html')\n\n# Visualize (requires clustertk[viz])\npipeline.plot_clusters_2d()\npipeline.plot_cluster_heatmap()\n```\n\n## Documentation\n\n- **[Installation Guide](docs/installation.md)** - Detailed installation instructions\n- **[Quick Start](docs/quickstart.md)** - Get started in 5 minutes\n- **[User Guide](docs/user_guide/README.md)** - Complete component documentation\n  - [Preprocessing](docs/user_guide/preprocessing.md)\n  - [Feature Selection](docs/user_guide/feature_selection.md)\n  - [Clustering](docs/user_guide/clustering.md)\n  - [Evaluation](docs/user_guide/evaluation.md)\n  - [Interpretation](docs/user_guide/interpretation.md) - Profiles, naming, feature importance\n  - [Visualization](docs/user_guide/visualization.md)\n  - [Export](docs/user_guide/export.md)\n- **[Examples](docs/examples.md)** - Real-world use cases\n- **[FAQ](docs/faq.md)** - Common questions\n\n## Pipeline Workflow\n\n```\nRaw Data \u2192 Preprocessing \u2192 Feature Selection \u2192 Dimensionality Reduction\n\u2192 Clustering \u2192 Evaluation \u2192 Interpretation \u2192 Export\n```\n\nEach step is configurable through pipeline parameters or can be run independently.\n\n## Key Capabilities\n\n### Preprocessing\n- Missing value handling (median/mean/drop)\n- Outlier detection and treatment\n- Automatic scaling (robust/standard/minmax)\n- Skewness transformation\n\n### Clustering Algorithms\n- **K-Means** - Fast, spherical clusters\n- **GMM** - Probabilistic, elliptical clusters\n- **Hierarchical** - Dendrograms, hierarchical structure\n- **DBSCAN** - Density-based, arbitrary shapes\n- **HDBSCAN** - Advanced density-based, varying densities (v0.8.0+)\n\n### Evaluation & Interpretation\n- Silhouette score, Calinski-Harabasz, Davies-Bouldin metrics\n- Automatic optimal k selection\n- Cluster profiling and automatic naming\n- **Feature importance analysis** (v0.9.0+)\n  - Permutation importance\n  - Feature contribution (variance ratio)\n  - SHAP values (optional)\n\n### Export & Reports\n- CSV export (data + labels)\n- JSON export (metadata + profiles)\n- HTML reports with embedded visualizations\n- Pipeline serialization (save/load)\n\n## Examples\n\n### Feature Importance Analysis\n\n```python\n# Understand which features drive your clustering\nresults = pipeline.analyze_feature_importance(method='all')\n\n# View permutation importance\nprint(results['permutation'].head())\n\n# View feature contribution (variance ratio)\nprint(results['contribution'].head())\n\n# Use top features for focused analysis\ntop_features = results['permutation'].head(5)['feature'].tolist()\n```\n\n### Algorithm Comparison\n\n```python\n# Compare multiple algorithms automatically\nresults = pipeline.compare_algorithms(\n    X=df,\n    feature_columns=['feature1', 'feature2', 'feature3'],\n    algorithms=['kmeans', 'gmm', 'hierarchical', 'dbscan'],\n    n_clusters_range=(2, 8)\n)\n\nprint(results['comparison'])  # DataFrame with metrics\nprint(f\"Best algorithm: {results['best_algorithm']}\")\n\n# Visualize comparison\npipeline.plot_algorithm_comparison(results)\n```\n\n### Customer Segmentation\n\n```python\npipeline = ClusterAnalysisPipeline(\n    n_clusters=None,  # Auto-detect\n    auto_name_clusters=True\n)\n\npipeline.fit(customers_df,\n            feature_columns=['age', 'income', 'purchases'],\n            category_mapping={\n                'demographics': ['age', 'income'],\n                'behavior': ['purchases']\n            })\n\npipeline.export_report('customer_segments.html')\n```\n\n### Anomaly Detection\n\n```python\npipeline = ClusterAnalysisPipeline(\n    clustering_algorithm='dbscan'\n)\n\npipeline.fit(transactions_df)\nanomalies = transactions_df[pipeline.labels_ == -1]\n```\n\nMore examples: [docs/examples.md](docs/examples.md)\n\n## Requirements\n\n- Python 3.8+\n- numpy >= 1.20.0\n- pandas >= 1.3.0\n- scikit-learn >= 1.0.0\n- scipy >= 1.7.0\n- joblib >= 1.0.0\n\nOptional (for visualization):\n- matplotlib >= 3.4.0\n- seaborn >= 0.11.0\n\n## Contributing\n\nContributions are welcome! Please check:\n- [GitHub Issues](https://github.com/alexeiveselov92/clustertk/issues) - Report bugs\n- [GitHub Discussions](https://github.com/alexeiveselov92/clustertk/discussions) - Questions\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use ClusterTK in your research, please cite:\n\n```bibtex\n@software{clustertk2024,\n  author = {Veselov, Aleksey},\n  title = {ClusterTK: A Comprehensive Python Toolkit for Cluster Analysis},\n  year = {2024},\n  url = {https://github.com/alexeiveselov92/clustertk}\n}\n```\n\n## Links\n\n- **PyPI**: https://pypi.org/project/clustertk/\n- **GitHub**: https://github.com/alexeiveselov92/clustertk\n- **Documentation**: [docs/](docs/)\n- **Author**: Aleksey Veselov (alexei.veselov92@gmail.com)\n\n---\n\nMade with \u2764\ufe0f for the data science community\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A comprehensive toolkit for cluster analysis with full pipeline support",
    "version": "0.15.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/alexeiveselov92/clustertk/issues",
        "Documentation": "https://clustertk.readthedocs.io",
        "Homepage": "https://github.com/alexeiveselov92/clustertk",
        "Repository": "https://github.com/alexeiveselov92/clustertk"
    },
    "split_keywords": [
        "clustering",
        " machine-learning",
        " data-analysis",
        " pipeline",
        " kmeans",
        " pca",
        " data-science"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5c5274a218132bd127a498615ad5997503289984db97fede13e8bd707cb727a6",
                "md5": "407fba2c29aeffea2f16031e08dce7b2",
                "sha256": "03e6eba4131293670c46a945124c65bfbe9935401f9cfda09742c2b7a2f1bae7"
            },
            "downloads": -1,
            "filename": "clustertk-0.15.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "407fba2c29aeffea2f16031e08dce7b2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 110810,
            "upload_time": "2025-10-30T19:58:09",
            "upload_time_iso_8601": "2025-10-30T19:58:09.422637Z",
            "url": "https://files.pythonhosted.org/packages/5c/52/74a218132bd127a498615ad5997503289984db97fede13e8bd707cb727a6/clustertk-0.15.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0b398e9601103fbf9d16a4b0cdb00299583fa76c076489f8ebcfb3810da12a90",
                "md5": "e8ade320f2fdc543c3c326d423cd5f54",
                "sha256": "f74c89f829ff77a65cdd5c9ca7675aaf4df73142557d796f03dab2e921f46c16"
            },
            "downloads": -1,
            "filename": "clustertk-0.15.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e8ade320f2fdc543c3c326d423cd5f54",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 166274,
            "upload_time": "2025-10-30T19:58:11",
            "upload_time_iso_8601": "2025-10-30T19:58:11.007051Z",
            "url": "https://files.pythonhosted.org/packages/0b/39/8e9601103fbf9d16a4b0cdb00299583fa76c076489f8ebcfb3810da12a90/clustertk-0.15.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-30 19:58:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "alexeiveselov92",
    "github_project": "clustertk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.7.0"
                ]
            ]
        }
    ],
    "lcname": "clustertk"
}
        
Elapsed time: 0.51488s