# skclust
A comprehensive clustering toolkit with advanced tree cutting, visualization, and network analysis capabilities.
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/Apache-2.0)
[](https://scikit-learn.org)


⚠️ **Warning: This is a beta release and has not been thoroughly tested.**
## Features
- **Scikit-learn compatible** API for seamless integration
- **Multiple linkage methods** (Ward, Complete, Average, Single, etc.)
- **Advanced tree cutting** with dynamic, height-based, and max-cluster methods
- **Rich visualizations** with dendrograms and metadata tracks
- **Network analysis** with connectivity metrics and NetworkX integration
- **Cluster validation** using silhouette analysis
- **Eigenprofile calculation** for cluster characterization
- **Tree export** in Newick format for phylogenetic analysis
- **Distance matrix support** for precomputed distances
- **Metadata tracks** for biological and experimental annotations
## Installation
### Basic Installation
```bash
pip install skclust
```
### Development Installation
```bash
git clone https://github.com/jolespin/skclust.git
cd skclust
pip install -e .[all]
```
### Installation Options
```bash
# Basic functionality only
pip install skclust
# With fast clustering (fastcluster)
pip install skclust[fast]
# With tree operations (scikit-bio)
pip install skclust[tree]
# With all optional features
pip install skclust[all]
# Development installation
pip install skclust[dev]
```
## Dependencies
### Core Dependencies (Required)
- `numpy >= 1.19.0`
- `pandas >= 1.3.0`
- `scipy >= 1.7.0`
- `scikit-learn >= 1.0.0`
- `matplotlib >= 3.3.0`
- `seaborn >= 0.11.0`
- `networkx >= 2.6.0`
### Optional Dependencies (Enhanced Features)
- `fastcluster >= 1.2.0` - Faster linkage computations
- `scikit-bio >= 0.5.6` - Tree operations and Newick export
- `dynamicTreeCut >= 0.1.0` - Dynamic tree cutting algorithms
- `ensemble-networkx >= 0.1.0` - Enhanced network analysis
## Quick Start
```python
from skclust import HierarchicalClustering
import pandas as pd
import numpy as np
# Generate sample data
data = np.random.randn(100, 10)
df = pd.DataFrame(data, index=[f"Sample_{i}" for i in range(100)])
# Create and fit clusterer
clusterer = HierarchicalClustering(
method='ward',
cut_method='dynamic',
min_cluster_size=10
)
# Fit and get cluster labels
labels = clusterer.fit_transform(df)
# Plot dendrogram
fig, ax = clusterer.plot_dendrogram(figsize=(12, 6))
# Get summary
summary = clusterer.summary()
print(f"Found {clusterer.n_clusters_} clusters")
```
## Detailed Usage Examples
### 1. Basic Clustering with Different Methods
```python
from skclust import HierarchicalClustering
import pandas as pd
# Ward clustering with dynamic cutting
clusterer = HierarchicalClustering(
method='ward',
cut_method='dynamic',
min_cluster_size=5
)
labels = clusterer.fit_transform(data)
# Complete linkage with height-based cutting
clusterer = HierarchicalClustering(
method='complete',
cut_method='height',
cut_threshold=10.0
)
labels = clusterer.fit_transform(data)
# Average linkage with fixed number of clusters
clusterer = HierarchicalClustering(
method='average',
cut_method='maxclust',
cut_threshold=4
)
labels = clusterer.fit_transform(data)
```
### 2. Working with Distance Matrices
```python
from scipy.spatial.distance import pdist, squareform
# Compute custom distance matrix
distances = pdist(data.values, metric='correlation')
distance_matrix = pd.DataFrame(
squareform(distances),
index=data.index,
columns=data.index
)
# Cluster using precomputed distances
clusterer = HierarchicalClustering(method='complete')
labels = clusterer.fit_transform(distance_matrix)
```
### 3. Adding Metadata Tracks
```python
# Add continuous metadata (e.g., age, expression levels)
age_data = np.random.normal(45, 15, len(data))
clusterer.add_track('Age', age_data, track_type='continuous', color='steelblue')
# Add categorical metadata (e.g., treatment groups)
treatment = ['Control'] * 30 + ['Treatment_A'] * 35 + ['Treatment_B'] * 35
clusterer.add_track(
'Treatment',
treatment,
track_type='categorical',
color={'Control': 'gray', 'Treatment_A': 'red', 'Treatment_B': 'blue'}
)
# Plot dendrogram with metadata tracks
fig, axes = clusterer.plot_dendrogram(
figsize=(14, 10),
show_tracks=True,
track_height=1.0
)
```
### 4. Cluster Analysis and Validation
```python
# Calculate eigenprofiles (principal components for each cluster)
eigenprofiles = clusterer.eigenprofiles(data)
for cluster_id, profile in eigenprofiles.items():
print(f"Cluster {cluster_id}: "
f"Explained variance = {profile['explained_variance_ratio']:.3f}")
# Perform silhouette analysis
silhouette_results = clusterer.silhouette_analysis()
print(f"Overall silhouette score: {silhouette_results['overall_score']:.3f}")
# Calculate connectivity metrics
connectivity = clusterer.connectivity()
print("Connectivity analysis:", connectivity)
```
### 5. Network Analysis
```python
# Convert to NetworkX graph
graph = clusterer.to_networkx(weight_threshold=0.3)
print(f"Graph: {graph.number_of_nodes()} nodes, {graph.number_of_edges()} edges")
# Visualize network (for small datasets)
import networkx as nx
import matplotlib.pyplot as plt
pos = nx.spring_layout(graph)
nx.draw(graph, pos, node_color=clusterer.labels_,
node_size=50, cmap='viridis', alpha=0.7)
plt.title('Sample Network (colored by cluster)')
plt.show()
```
### 6. Tree Export and Phylogenetic Analysis
```python
# Export tree in Newick format (requires scikit-bio)
try:
newick_string = clusterer.to_newick()
print("Newick tree:", newick_string[:100], "...")
# Save to file
clusterer.to_newick('my_tree.newick')
except ValueError as e:
print("Tree export not available:", e)
```
### 7. Convenience Function
```python
from skclust import hierarchical_clustering
# Quick clustering with default parameters
clusterer = hierarchical_clustering(
data,
method='ward',
min_cluster_size=10
)
print(f"Quick clustering: {clusterer.n_clusters_} clusters")
```
## Biological Data Example
```python
import pandas as pd
import numpy as np
from skclust import HierarchicalClustering
# Simulate gene expression data
np.random.seed(42)
n_samples, n_genes = 80, 1000
expression_data = np.random.randn(n_samples, n_genes)
# Add structure: 3 patient groups with different expression patterns
expression_data[:25, :100] += 2.0 # Group 1: high expression in genes 1-100
expression_data[25:50, 100:200] += 2.0 # Group 2: high expression in genes 101-200
expression_data[50:, 200:300] += 2.0 # Group 3: high expression in genes 201-300
# Create DataFrame with sample names
sample_names = [f"Patient_{i:02d}" for i in range(n_samples)]
gene_names = [f"Gene_{i:04d}" for i in range(n_genes)]
df_expression = pd.DataFrame(expression_data,
index=sample_names,
columns=gene_names)
# Perform hierarchical clustering
clusterer = HierarchicalClustering(
method='ward',
cut_method='dynamic',
min_cluster_size=8,
name='Gene_Expression_Clustering'
)
labels = clusterer.fit_transform(df_expression)
# Add clinical metadata
age = np.random.normal(55, 12, n_samples)
gender = np.random.choice(['Male', 'Female'], n_samples)
stage = ['Stage_I'] * 20 + ['Stage_II'] * 30 + ['Stage_III'] * 30
clusterer.add_track('Age', age, track_type='continuous')
clusterer.add_track('Gender', gender, track_type='categorical')
clusterer.add_track('Disease_Stage', stage, track_type='categorical')
# Visualize results
fig, axes = clusterer.plot_dendrogram(figsize=(15, 10), show_tracks=True)
# Analyze cluster characteristics
eigenprofiles = clusterer.eigenprofiles(df_expression)
silhouette_results = clusterer.silhouette_analysis()
print(f"Identified {clusterer.n_clusters_} patient clusters")
print(f"Silhouette score: {silhouette_results['overall_score']:.3f}")
# Print cluster summary
clusterer.summary()
```
## Advanced Configuration
### Custom Linkage Methods
```python
# Supported linkage methods
methods = ['ward', 'complete', 'average', 'single', 'centroid', 'median', 'weighted']
for method in methods:
clusterer = HierarchicalClustering(method=method)
labels = clusterer.fit_transform(data)
print(f"{method}: {clusterer.n_clusters_} clusters")
```
### Distance Metrics
```python
# Supported distance metrics (for raw data)
metrics = ['euclidean', 'manhattan', 'cosine', 'correlation']
for metric in metrics:
clusterer = HierarchicalClustering(metric=metric)
labels = clusterer.fit_transform(data)
print(f"{metric}: {clusterer.n_clusters_} clusters")
```
### Dynamic Tree Cutting Parameters
```python
# Fine-tune dynamic tree cutting
clusterer = HierarchicalClustering(
cut_method='dynamic',
min_cluster_size=10, # Minimum samples per cluster
deep_split=2, # Sensitivity (0-4, higher = more clusters)
dynamic_cut_method='hybrid' # 'hybrid' or 'tree'
)
```
## Performance Tips
1. **Use fastcluster**: Install `fastcluster` for significantly faster linkage computation
2. **Distance matrices**: Precompute distance matrices for repeated analysis
3. **Data preprocessing**: Standardize/normalize data before clustering
4. **Memory management**: For large datasets (>1000 samples), consider subsampling
```python
# Example: Preprocessing pipeline
from sklearn.preprocessing import StandardScaler
# Standardize features
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
df_scaled = pd.DataFrame(data_scaled, index=data.index, columns=data.columns)
# Cluster scaled data
clusterer = HierarchicalClustering(method='ward')
labels = clusterer.fit_transform(df_scaled)
```
## Troubleshooting
### Common Issues
1. **ImportError for optional dependencies**:
```bash
pip install hierarchical-clustering[all]
```
2. **Memory issues with large datasets**:
- Use data subsampling or dimensionality reduction
- Consider approximate methods for >5000 samples
3. **Dynamic tree cutting not working**:
- Install `dynamicTreeCut` package
- Falls back to height-based cutting automatically
4. **Tree export failing**:
- Install `scikit-bio` package
- Check that clustering was successful
### Performance Benchmarks
| Dataset Size | Method | Time (seconds) | Memory (GB) |
|-------------|--------|----------------|-------------|
| 100 samples | Ward | 0.01 | < 0.1 |
| 500 samples | Ward | 0.1 | 0.2 |
| 1000 samples| Ward | 0.5 | 0.8 |
| 2000 samples| Ward | 2.0 | 3.2 |
## API Reference
### HierarchicalClustering Class
#### Parameters
- `method` (str): Linkage method ('ward', 'complete', 'average', 'single')
- `metric` (str): Distance metric ('euclidean', 'manhattan', 'cosine', etc.)
- `cut_method` (str): Tree cutting method ('dynamic', 'height', 'maxclust')
- `min_cluster_size` (int): Minimum cluster size for dynamic cutting
- `cut_threshold` (float): Threshold for height/maxclust cutting
- `name` (str): Optional name for the clustering instance
#### Methods
- `fit(X)`: Fit clustering to data
- `transform()`: Return cluster labels
- `fit_transform(X)`: Fit and return labels
- `add_track(name, data, track_type)`: Add metadata track
- `plot_dendrogram(**kwargs)`: Plot dendrogram with optional tracks
- `eigenprofiles(data)`: Calculate cluster eigenprofiles
- `silhouette_analysis()`: Perform silhouette analysis
- `connectivity()`: Calculate network connectivity
- `to_networkx()`: Convert to NetworkX graph
- `to_newick()`: Export tree in Newick format
- `summary()`: Print clustering summary
#### Attributes (after fitting)
- `labels_`: Cluster labels for each sample
- `n_clusters_`: Number of clusters found
- `linkage_matrix_`: Hierarchical linkage matrix
- `distance_matrix_`: Distance matrix used
- `tree_`: Tree object (if available)
- `tracks_`: Dictionary of metadata tracks
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
## Original Implementation
This package is based on the hierarchical clustering implementation originally developed in the [Soothsayer](https://github.com/jolespin/soothsayer) framework:
**Espinoza JL, Dupont CL, O'Rourke A, Beyhan S, Morales P, et al. (2021) Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach. PLOS Computational Biology 17(3): e1008857.** [https://doi.org/10.1371/journal.pcbi.1008857](https://doi.org/10.1371/journal.pcbi.1008857)
The original implementation provided the foundation for the hierarchical clustering algorithms, metadata track visualization, and eigenprofile analysis features in this package.
## Acknowledgments
- Built on top of scipy, scikit-learn, and networkx
- Original implementation developed in the [Soothsayer framework](https://github.com/jolespin/soothsayer)
- Inspired by WGCNA and other biological clustering tools
- Dynamic tree cutting algorithms from the dynamicTreeCut package
## Support
- **Documentation**: [Link to docs]
- **Issues**: [GitHub Issues](https://github.com/your-username/hierarchical-clustering/issues)
- **Discussions**: [GitHub Discussions](https://github.com/your-username/hierarchical-clustering/discussions)
## Citation
If you use this package in your research, please cite:
```bibtex
@software{hierarchical_clustering,
author = {Josh L. Espinoza},
title = {HierarchicalClustering: A comprehensive hierarchical clustering toolkit},
url = {https://github.com/your-username/hierarchical-clustering},
version = {2025.7.26},
year = {2025}
}
```
**Original Soothsayer implementation:**
```bibtex
@article{espinoza2021predicting,
title={Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach},
author={Espinoza, Josh L and Dupont, Chris L and O'Rourke, Aubrie and Beyhan, Seherzada and Morales, Paula and others},
journal={PLOS Computational Biology},
volume={17},
number={3},
pages={e1008857},
year={2021},
publisher={Public Library of Science San Francisco, CA USA},
doi={10.1371/journal.pcbi.1008857},
url={https://doi.org/10.1371/journal.pcbi.1008857}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/jolespin/skclust",
"name": "skclust",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "clustering, hierarchical-clustering, dendrogram, tree-cutting, machine-learning, data-analysis, bioinformatics, network-analysis, visualization, scikit-learn",
"author": "Josh L. Espinoza",
"author_email": "\"Josh L. Espinoza\" <jol.espinoz@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/44/37/dd9a085a7e3e8075ad3281c1f30b341b0dc83f4130970c08611b2e03cecb/skclust-2025.7.26.tar.gz",
"platform": null,
"description": "# skclust\nA comprehensive clustering toolkit with advanced tree cutting, visualization, and network analysis capabilities.\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/Apache-2.0)\n[](https://scikit-learn.org)\n\n\n\n\u26a0\ufe0f **Warning: This is a beta release and has not been thoroughly tested.**\n\n## Features\n\n- **Scikit-learn compatible** API for seamless integration\n- **Multiple linkage methods** (Ward, Complete, Average, Single, etc.)\n- **Advanced tree cutting** with dynamic, height-based, and max-cluster methods\n- **Rich visualizations** with dendrograms and metadata tracks\n- **Network analysis** with connectivity metrics and NetworkX integration\n- **Cluster validation** using silhouette analysis\n- **Eigenprofile calculation** for cluster characterization\n- **Tree export** in Newick format for phylogenetic analysis\n- **Distance matrix support** for precomputed distances\n- **Metadata tracks** for biological and experimental annotations\n\n## Installation\n\n### Basic Installation\n\n```bash\npip install skclust\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/jolespin/skclust.git\ncd skclust\npip install -e .[all]\n```\n\n### Installation Options\n\n```bash\n# Basic functionality only\npip install skclust\n\n# With fast clustering (fastcluster)\npip install skclust[fast]\n\n# With tree operations (scikit-bio)\npip install skclust[tree]\n\n# With all optional features\npip install skclust[all]\n\n# Development installation\npip install skclust[dev]\n```\n\n## Dependencies\n\n### Core Dependencies (Required)\n- `numpy >= 1.19.0`\n- `pandas >= 1.3.0`\n- `scipy >= 1.7.0`\n- `scikit-learn >= 1.0.0`\n- `matplotlib >= 3.3.0`\n- `seaborn >= 0.11.0`\n- `networkx >= 2.6.0`\n\n### Optional Dependencies (Enhanced Features)\n- `fastcluster >= 1.2.0` - Faster linkage computations\n- `scikit-bio >= 0.5.6` - Tree operations and Newick export\n- `dynamicTreeCut >= 0.1.0` - Dynamic tree cutting algorithms\n- `ensemble-networkx >= 0.1.0` - Enhanced network analysis\n\n## Quick Start\n\n```python\nfrom skclust import HierarchicalClustering\nimport pandas as pd\nimport numpy as np\n\n# Generate sample data\ndata = np.random.randn(100, 10)\ndf = pd.DataFrame(data, index=[f\"Sample_{i}\" for i in range(100)])\n\n# Create and fit clusterer\nclusterer = HierarchicalClustering(\n method='ward',\n cut_method='dynamic',\n min_cluster_size=10\n)\n\n# Fit and get cluster labels\nlabels = clusterer.fit_transform(df)\n\n# Plot dendrogram\nfig, ax = clusterer.plot_dendrogram(figsize=(12, 6))\n\n# Get summary\nsummary = clusterer.summary()\nprint(f\"Found {clusterer.n_clusters_} clusters\")\n```\n\n## Detailed Usage Examples\n\n### 1. Basic Clustering with Different Methods\n\n```python\nfrom skclust import HierarchicalClustering\nimport pandas as pd\n\n# Ward clustering with dynamic cutting\nclusterer = HierarchicalClustering(\n method='ward',\n cut_method='dynamic',\n min_cluster_size=5\n)\nlabels = clusterer.fit_transform(data)\n\n# Complete linkage with height-based cutting\nclusterer = HierarchicalClustering(\n method='complete',\n cut_method='height',\n cut_threshold=10.0\n)\nlabels = clusterer.fit_transform(data)\n\n# Average linkage with fixed number of clusters\nclusterer = HierarchicalClustering(\n method='average',\n cut_method='maxclust',\n cut_threshold=4\n)\nlabels = clusterer.fit_transform(data)\n```\n\n### 2. Working with Distance Matrices\n\n```python\nfrom scipy.spatial.distance import pdist, squareform\n\n# Compute custom distance matrix\ndistances = pdist(data.values, metric='correlation')\ndistance_matrix = pd.DataFrame(\n squareform(distances),\n index=data.index,\n columns=data.index\n)\n\n# Cluster using precomputed distances\nclusterer = HierarchicalClustering(method='complete')\nlabels = clusterer.fit_transform(distance_matrix)\n```\n\n### 3. Adding Metadata Tracks\n\n```python\n# Add continuous metadata (e.g., age, expression levels)\nage_data = np.random.normal(45, 15, len(data))\nclusterer.add_track('Age', age_data, track_type='continuous', color='steelblue')\n\n# Add categorical metadata (e.g., treatment groups)\ntreatment = ['Control'] * 30 + ['Treatment_A'] * 35 + ['Treatment_B'] * 35\nclusterer.add_track(\n 'Treatment', \n treatment, \n track_type='categorical',\n color={'Control': 'gray', 'Treatment_A': 'red', 'Treatment_B': 'blue'}\n)\n\n# Plot dendrogram with metadata tracks\nfig, axes = clusterer.plot_dendrogram(\n figsize=(14, 10),\n show_tracks=True,\n track_height=1.0\n)\n```\n\n### 4. Cluster Analysis and Validation\n\n```python\n# Calculate eigenprofiles (principal components for each cluster)\neigenprofiles = clusterer.eigenprofiles(data)\nfor cluster_id, profile in eigenprofiles.items():\n print(f\"Cluster {cluster_id}: \"\n f\"Explained variance = {profile['explained_variance_ratio']:.3f}\")\n\n# Perform silhouette analysis\nsilhouette_results = clusterer.silhouette_analysis()\nprint(f\"Overall silhouette score: {silhouette_results['overall_score']:.3f}\")\n\n# Calculate connectivity metrics\nconnectivity = clusterer.connectivity()\nprint(\"Connectivity analysis:\", connectivity)\n```\n\n### 5. Network Analysis\n\n```python\n# Convert to NetworkX graph\ngraph = clusterer.to_networkx(weight_threshold=0.3)\nprint(f\"Graph: {graph.number_of_nodes()} nodes, {graph.number_of_edges()} edges\")\n\n# Visualize network (for small datasets)\nimport networkx as nx\nimport matplotlib.pyplot as plt\n\npos = nx.spring_layout(graph)\nnx.draw(graph, pos, node_color=clusterer.labels_, \n node_size=50, cmap='viridis', alpha=0.7)\nplt.title('Sample Network (colored by cluster)')\nplt.show()\n```\n\n### 6. Tree Export and Phylogenetic Analysis\n\n```python\n# Export tree in Newick format (requires scikit-bio)\ntry:\n newick_string = clusterer.to_newick()\n print(\"Newick tree:\", newick_string[:100], \"...\")\n \n # Save to file\n clusterer.to_newick('my_tree.newick')\nexcept ValueError as e:\n print(\"Tree export not available:\", e)\n```\n\n### 7. Convenience Function\n\n```python\nfrom skclust import hierarchical_clustering\n\n# Quick clustering with default parameters\nclusterer = hierarchical_clustering(\n data, \n method='ward', \n min_cluster_size=10\n)\nprint(f\"Quick clustering: {clusterer.n_clusters_} clusters\")\n```\n\n## Biological Data Example\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom skclust import HierarchicalClustering\n\n# Simulate gene expression data\nnp.random.seed(42)\nn_samples, n_genes = 80, 1000\nexpression_data = np.random.randn(n_samples, n_genes)\n\n# Add structure: 3 patient groups with different expression patterns\nexpression_data[:25, :100] += 2.0 # Group 1: high expression in genes 1-100\nexpression_data[25:50, 100:200] += 2.0 # Group 2: high expression in genes 101-200\nexpression_data[50:, 200:300] += 2.0 # Group 3: high expression in genes 201-300\n\n# Create DataFrame with sample names\nsample_names = [f\"Patient_{i:02d}\" for i in range(n_samples)]\ngene_names = [f\"Gene_{i:04d}\" for i in range(n_genes)]\ndf_expression = pd.DataFrame(expression_data, \n index=sample_names, \n columns=gene_names)\n\n# Perform hierarchical clustering\nclusterer = HierarchicalClustering(\n method='ward',\n cut_method='dynamic',\n min_cluster_size=8,\n name='Gene_Expression_Clustering'\n)\n\nlabels = clusterer.fit_transform(df_expression)\n\n# Add clinical metadata\nage = np.random.normal(55, 12, n_samples)\ngender = np.random.choice(['Male', 'Female'], n_samples)\nstage = ['Stage_I'] * 20 + ['Stage_II'] * 30 + ['Stage_III'] * 30\n\nclusterer.add_track('Age', age, track_type='continuous')\nclusterer.add_track('Gender', gender, track_type='categorical')\nclusterer.add_track('Disease_Stage', stage, track_type='categorical')\n\n# Visualize results\nfig, axes = clusterer.plot_dendrogram(figsize=(15, 10), show_tracks=True)\n\n# Analyze cluster characteristics\neigenprofiles = clusterer.eigenprofiles(df_expression)\nsilhouette_results = clusterer.silhouette_analysis()\n\nprint(f\"Identified {clusterer.n_clusters_} patient clusters\")\nprint(f\"Silhouette score: {silhouette_results['overall_score']:.3f}\")\n\n# Print cluster summary\nclusterer.summary()\n```\n\n## Advanced Configuration\n\n### Custom Linkage Methods\n\n```python\n# Supported linkage methods\nmethods = ['ward', 'complete', 'average', 'single', 'centroid', 'median', 'weighted']\n\nfor method in methods:\n clusterer = HierarchicalClustering(method=method)\n labels = clusterer.fit_transform(data)\n print(f\"{method}: {clusterer.n_clusters_} clusters\")\n```\n\n### Distance Metrics\n\n```python\n# Supported distance metrics (for raw data)\nmetrics = ['euclidean', 'manhattan', 'cosine', 'correlation']\n\nfor metric in metrics:\n clusterer = HierarchicalClustering(metric=metric)\n labels = clusterer.fit_transform(data)\n print(f\"{metric}: {clusterer.n_clusters_} clusters\")\n```\n\n### Dynamic Tree Cutting Parameters\n\n```python\n# Fine-tune dynamic tree cutting\nclusterer = HierarchicalClustering(\n cut_method='dynamic',\n min_cluster_size=10, # Minimum samples per cluster\n deep_split=2, # Sensitivity (0-4, higher = more clusters)\n dynamic_cut_method='hybrid' # 'hybrid' or 'tree'\n)\n```\n\n## Performance Tips\n\n1. **Use fastcluster**: Install `fastcluster` for significantly faster linkage computation\n2. **Distance matrices**: Precompute distance matrices for repeated analysis\n3. **Data preprocessing**: Standardize/normalize data before clustering\n4. **Memory management**: For large datasets (>1000 samples), consider subsampling\n\n```python\n# Example: Preprocessing pipeline\nfrom sklearn.preprocessing import StandardScaler\n\n# Standardize features\nscaler = StandardScaler()\ndata_scaled = scaler.fit_transform(data)\ndf_scaled = pd.DataFrame(data_scaled, index=data.index, columns=data.columns)\n\n# Cluster scaled data\nclusterer = HierarchicalClustering(method='ward')\nlabels = clusterer.fit_transform(df_scaled)\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **ImportError for optional dependencies**:\n ```bash\n pip install hierarchical-clustering[all]\n ```\n\n2. **Memory issues with large datasets**:\n - Use data subsampling or dimensionality reduction\n - Consider approximate methods for >5000 samples\n\n3. **Dynamic tree cutting not working**:\n - Install `dynamicTreeCut` package\n - Falls back to height-based cutting automatically\n\n4. **Tree export failing**:\n - Install `scikit-bio` package\n - Check that clustering was successful\n\n### Performance Benchmarks\n\n| Dataset Size | Method | Time (seconds) | Memory (GB) |\n|-------------|--------|----------------|-------------|\n| 100 samples | Ward | 0.01 | < 0.1 |\n| 500 samples | Ward | 0.1 | 0.2 |\n| 1000 samples| Ward | 0.5 | 0.8 |\n| 2000 samples| Ward | 2.0 | 3.2 |\n\n## API Reference\n\n### HierarchicalClustering Class\n\n#### Parameters\n- `method` (str): Linkage method ('ward', 'complete', 'average', 'single')\n- `metric` (str): Distance metric ('euclidean', 'manhattan', 'cosine', etc.)\n- `cut_method` (str): Tree cutting method ('dynamic', 'height', 'maxclust')\n- `min_cluster_size` (int): Minimum cluster size for dynamic cutting\n- `cut_threshold` (float): Threshold for height/maxclust cutting\n- `name` (str): Optional name for the clustering instance\n\n#### Methods\n- `fit(X)`: Fit clustering to data\n- `transform()`: Return cluster labels\n- `fit_transform(X)`: Fit and return labels\n- `add_track(name, data, track_type)`: Add metadata track\n- `plot_dendrogram(**kwargs)`: Plot dendrogram with optional tracks\n- `eigenprofiles(data)`: Calculate cluster eigenprofiles\n- `silhouette_analysis()`: Perform silhouette analysis\n- `connectivity()`: Calculate network connectivity\n- `to_networkx()`: Convert to NetworkX graph\n- `to_newick()`: Export tree in Newick format\n- `summary()`: Print clustering summary\n\n#### Attributes (after fitting)\n- `labels_`: Cluster labels for each sample\n- `n_clusters_`: Number of clusters found\n- `linkage_matrix_`: Hierarchical linkage matrix\n- `distance_matrix_`: Distance matrix used\n- `tree_`: Tree object (if available)\n- `tracks_`: Dictionary of metadata tracks\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## Original Implementation\n\nThis package is based on the hierarchical clustering implementation originally developed in the [Soothsayer](https://github.com/jolespin/soothsayer) framework:\n\n**Espinoza JL, Dupont CL, O'Rourke A, Beyhan S, Morales P, et al. (2021) Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach. PLOS Computational Biology 17(3): e1008857.** [https://doi.org/10.1371/journal.pcbi.1008857](https://doi.org/10.1371/journal.pcbi.1008857)\n\nThe original implementation provided the foundation for the hierarchical clustering algorithms, metadata track visualization, and eigenprofile analysis features in this package.\n\n## Acknowledgments\n\n- Built on top of scipy, scikit-learn, and networkx\n- Original implementation developed in the [Soothsayer framework](https://github.com/jolespin/soothsayer)\n- Inspired by WGCNA and other biological clustering tools\n- Dynamic tree cutting algorithms from the dynamicTreeCut package\n\n## Support\n\n- **Documentation**: [Link to docs]\n- **Issues**: [GitHub Issues](https://github.com/your-username/hierarchical-clustering/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/your-username/hierarchical-clustering/discussions)\n\n## Citation\n\nIf you use this package in your research, please cite:\n\n```bibtex\n@software{hierarchical_clustering,\n author = {Josh L. Espinoza},\n title = {HierarchicalClustering: A comprehensive hierarchical clustering toolkit},\n url = {https://github.com/your-username/hierarchical-clustering},\n version = {2025.7.26},\n year = {2025}\n}\n```\n\n**Original Soothsayer implementation:**\n```bibtex\n@article{espinoza2021predicting,\n title={Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach},\n author={Espinoza, Josh L and Dupont, Chris L and O'Rourke, Aubrie and Beyhan, Seherzada and Morales, Paula and others},\n journal={PLOS Computational Biology},\n volume={17},\n number={3},\n pages={e1008857},\n year={2021},\n publisher={Public Library of Science San Francisco, CA USA},\n doi={10.1371/journal.pcbi.1008857},\n url={https://doi.org/10.1371/journal.pcbi.1008857}\n}\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A comprehensive clustering toolkit with advanced tree cutting and visualization",
"version": "2025.7.26",
"project_urls": {
"Bug Reports": "https://github.com/jolespin/skclust/issues",
"Homepage": "https://github.com/jolespin/skclust",
"Source": "https://github.com/jolespin/skclust"
},
"split_keywords": [
"clustering",
" hierarchical-clustering",
" dendrogram",
" tree-cutting",
" machine-learning",
" data-analysis",
" bioinformatics",
" network-analysis",
" visualization",
" scikit-learn"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4437dd9a085a7e3e8075ad3281c1f30b341b0dc83f4130970c08611b2e03cecb",
"md5": "5d351aa66c7d51d8202674c97275abee",
"sha256": "a94ca54f06b9a1d4b83da4fa1f757113a3689ae5aad0c91eff0dfa84739e3f5d"
},
"downloads": -1,
"filename": "skclust-2025.7.26.tar.gz",
"has_sig": false,
"md5_digest": "5d351aa66c7d51d8202674c97275abee",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 20348,
"upload_time": "2025-08-02T02:05:24",
"upload_time_iso_8601": "2025-08-02T02:05:24.234098Z",
"url": "https://files.pythonhosted.org/packages/44/37/dd9a085a7e3e8075ad3281c1f30b341b0dc83f4130970c08611b2e03cecb/skclust-2025.7.26.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-02 02:05:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jolespin",
"github_project": "skclust",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.19.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.7.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.3.0"
]
]
},
{
"name": "seaborn",
"specs": [
[
">=",
"0.11.0"
]
]
},
{
"name": "networkx",
"specs": [
[
">=",
"2.6.0"
]
]
}
],
"lcname": "skclust"
}