skclust


Nameskclust JSON
Version 2025.7.26 PyPI version JSON
download
home_pagehttps://github.com/jolespin/skclust
SummaryA comprehensive clustering toolkit with advanced tree cutting and visualization
upload_time2025-08-02 02:05:24
maintainerNone
docs_urlNone
authorJosh L. Espinoza
requires_python>=3.8
licenseMIT
keywords clustering hierarchical-clustering dendrogram tree-cutting machine-learning data-analysis bioinformatics network-analysis visualization scikit-learn
VCS
bugtrack_url
requirements numpy pandas scipy scikit-learn matplotlib seaborn networkx
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # skclust
A comprehensive clustering toolkit with advanced tree cutting, visualization, and network analysis capabilities.

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![scikit-learn compatible](https://img.shields.io/badge/sklearn-compatible-orange.svg)](https://scikit-learn.org)
![Beta](https://img.shields.io/badge/status-beta-orange)
![Not Production Ready](https://img.shields.io/badge/production-not%20ready-red)

⚠️ **Warning: This is a beta release and has not been thoroughly tested.**

##  Features

- **Scikit-learn compatible** API for seamless integration
- **Multiple linkage methods** (Ward, Complete, Average, Single, etc.)
- **Advanced tree cutting** with dynamic, height-based, and max-cluster methods
- **Rich visualizations** with dendrograms and metadata tracks
- **Network analysis** with connectivity metrics and NetworkX integration
- **Cluster validation** using silhouette analysis
- **Eigenprofile calculation** for cluster characterization
- **Tree export** in Newick format for phylogenetic analysis
- **Distance matrix support** for precomputed distances
- **Metadata tracks** for biological and experimental annotations

##  Installation

### Basic Installation

```bash
pip install skclust
```

### Development Installation

```bash
git clone https://github.com/jolespin/skclust.git
cd skclust
pip install -e .[all]
```

### Installation Options

```bash
# Basic functionality only
pip install skclust

# With fast clustering (fastcluster)
pip install skclust[fast]

# With tree operations (scikit-bio)
pip install skclust[tree]

# With all optional features
pip install skclust[all]

# Development installation
pip install skclust[dev]
```

##  Dependencies

### Core Dependencies (Required)
- `numpy >= 1.19.0`
- `pandas >= 1.3.0`
- `scipy >= 1.7.0`
- `scikit-learn >= 1.0.0`
- `matplotlib >= 3.3.0`
- `seaborn >= 0.11.0`
- `networkx >= 2.6.0`

### Optional Dependencies (Enhanced Features)
- `fastcluster >= 1.2.0` - Faster linkage computations
- `scikit-bio >= 0.5.6` - Tree operations and Newick export
- `dynamicTreeCut >= 0.1.0` - Dynamic tree cutting algorithms
- `ensemble-networkx >= 0.1.0` - Enhanced network analysis

##  Quick Start

```python
from skclust import HierarchicalClustering
import pandas as pd
import numpy as np

# Generate sample data
data = np.random.randn(100, 10)
df = pd.DataFrame(data, index=[f"Sample_{i}" for i in range(100)])

# Create and fit clusterer
clusterer = HierarchicalClustering(
    method='ward',
    cut_method='dynamic',
    min_cluster_size=10
)

# Fit and get cluster labels
labels = clusterer.fit_transform(df)

# Plot dendrogram
fig, ax = clusterer.plot_dendrogram(figsize=(12, 6))

# Get summary
summary = clusterer.summary()
print(f"Found {clusterer.n_clusters_} clusters")
```

##  Detailed Usage Examples

### 1. Basic Clustering with Different Methods

```python
from skclust import HierarchicalClustering
import pandas as pd

# Ward clustering with dynamic cutting
clusterer = HierarchicalClustering(
    method='ward',
    cut_method='dynamic',
    min_cluster_size=5
)
labels = clusterer.fit_transform(data)

# Complete linkage with height-based cutting
clusterer = HierarchicalClustering(
    method='complete',
    cut_method='height',
    cut_threshold=10.0
)
labels = clusterer.fit_transform(data)

# Average linkage with fixed number of clusters
clusterer = HierarchicalClustering(
    method='average',
    cut_method='maxclust',
    cut_threshold=4
)
labels = clusterer.fit_transform(data)
```

### 2. Working with Distance Matrices

```python
from scipy.spatial.distance import pdist, squareform

# Compute custom distance matrix
distances = pdist(data.values, metric='correlation')
distance_matrix = pd.DataFrame(
    squareform(distances),
    index=data.index,
    columns=data.index
)

# Cluster using precomputed distances
clusterer = HierarchicalClustering(method='complete')
labels = clusterer.fit_transform(distance_matrix)
```

### 3. Adding Metadata Tracks

```python
# Add continuous metadata (e.g., age, expression levels)
age_data = np.random.normal(45, 15, len(data))
clusterer.add_track('Age', age_data, track_type='continuous', color='steelblue')

# Add categorical metadata (e.g., treatment groups)
treatment = ['Control'] * 30 + ['Treatment_A'] * 35 + ['Treatment_B'] * 35
clusterer.add_track(
    'Treatment', 
    treatment, 
    track_type='categorical',
    color={'Control': 'gray', 'Treatment_A': 'red', 'Treatment_B': 'blue'}
)

# Plot dendrogram with metadata tracks
fig, axes = clusterer.plot_dendrogram(
    figsize=(14, 10),
    show_tracks=True,
    track_height=1.0
)
```

### 4. Cluster Analysis and Validation

```python
# Calculate eigenprofiles (principal components for each cluster)
eigenprofiles = clusterer.eigenprofiles(data)
for cluster_id, profile in eigenprofiles.items():
    print(f"Cluster {cluster_id}: "
          f"Explained variance = {profile['explained_variance_ratio']:.3f}")

# Perform silhouette analysis
silhouette_results = clusterer.silhouette_analysis()
print(f"Overall silhouette score: {silhouette_results['overall_score']:.3f}")

# Calculate connectivity metrics
connectivity = clusterer.connectivity()
print("Connectivity analysis:", connectivity)
```

### 5. Network Analysis

```python
# Convert to NetworkX graph
graph = clusterer.to_networkx(weight_threshold=0.3)
print(f"Graph: {graph.number_of_nodes()} nodes, {graph.number_of_edges()} edges")

# Visualize network (for small datasets)
import networkx as nx
import matplotlib.pyplot as plt

pos = nx.spring_layout(graph)
nx.draw(graph, pos, node_color=clusterer.labels_, 
        node_size=50, cmap='viridis', alpha=0.7)
plt.title('Sample Network (colored by cluster)')
plt.show()
```

### 6. Tree Export and Phylogenetic Analysis

```python
# Export tree in Newick format (requires scikit-bio)
try:
    newick_string = clusterer.to_newick()
    print("Newick tree:", newick_string[:100], "...")
    
    # Save to file
    clusterer.to_newick('my_tree.newick')
except ValueError as e:
    print("Tree export not available:", e)
```

### 7. Convenience Function

```python
from skclust import hierarchical_clustering

# Quick clustering with default parameters
clusterer = hierarchical_clustering(
    data, 
    method='ward', 
    min_cluster_size=10
)
print(f"Quick clustering: {clusterer.n_clusters_} clusters")
```

##  Biological Data Example

```python
import pandas as pd
import numpy as np
from skclust import HierarchicalClustering

# Simulate gene expression data
np.random.seed(42)
n_samples, n_genes = 80, 1000
expression_data = np.random.randn(n_samples, n_genes)

# Add structure: 3 patient groups with different expression patterns
expression_data[:25, :100] += 2.0   # Group 1: high expression in genes 1-100
expression_data[25:50, 100:200] += 2.0  # Group 2: high expression in genes 101-200
expression_data[50:, 200:300] += 2.0     # Group 3: high expression in genes 201-300

# Create DataFrame with sample names
sample_names = [f"Patient_{i:02d}" for i in range(n_samples)]
gene_names = [f"Gene_{i:04d}" for i in range(n_genes)]
df_expression = pd.DataFrame(expression_data, 
                           index=sample_names, 
                           columns=gene_names)

# Perform hierarchical clustering
clusterer = HierarchicalClustering(
    method='ward',
    cut_method='dynamic',
    min_cluster_size=8,
    name='Gene_Expression_Clustering'
)

labels = clusterer.fit_transform(df_expression)

# Add clinical metadata
age = np.random.normal(55, 12, n_samples)
gender = np.random.choice(['Male', 'Female'], n_samples)
stage = ['Stage_I'] * 20 + ['Stage_II'] * 30 + ['Stage_III'] * 30

clusterer.add_track('Age', age, track_type='continuous')
clusterer.add_track('Gender', gender, track_type='categorical')
clusterer.add_track('Disease_Stage', stage, track_type='categorical')

# Visualize results
fig, axes = clusterer.plot_dendrogram(figsize=(15, 10), show_tracks=True)

# Analyze cluster characteristics
eigenprofiles = clusterer.eigenprofiles(df_expression)
silhouette_results = clusterer.silhouette_analysis()

print(f"Identified {clusterer.n_clusters_} patient clusters")
print(f"Silhouette score: {silhouette_results['overall_score']:.3f}")

# Print cluster summary
clusterer.summary()
```

##  Advanced Configuration

### Custom Linkage Methods

```python
# Supported linkage methods
methods = ['ward', 'complete', 'average', 'single', 'centroid', 'median', 'weighted']

for method in methods:
    clusterer = HierarchicalClustering(method=method)
    labels = clusterer.fit_transform(data)
    print(f"{method}: {clusterer.n_clusters_} clusters")
```

### Distance Metrics

```python
# Supported distance metrics (for raw data)
metrics = ['euclidean', 'manhattan', 'cosine', 'correlation']

for metric in metrics:
    clusterer = HierarchicalClustering(metric=metric)
    labels = clusterer.fit_transform(data)
    print(f"{metric}: {clusterer.n_clusters_} clusters")
```

### Dynamic Tree Cutting Parameters

```python
# Fine-tune dynamic tree cutting
clusterer = HierarchicalClustering(
    cut_method='dynamic',
    min_cluster_size=10,        # Minimum samples per cluster
    deep_split=2,               # Sensitivity (0-4, higher = more clusters)
    dynamic_cut_method='hybrid' # 'hybrid' or 'tree'
)
```

##  Performance Tips

1. **Use fastcluster**: Install `fastcluster` for significantly faster linkage computation
2. **Distance matrices**: Precompute distance matrices for repeated analysis
3. **Data preprocessing**: Standardize/normalize data before clustering
4. **Memory management**: For large datasets (>1000 samples), consider subsampling

```python
# Example: Preprocessing pipeline
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
df_scaled = pd.DataFrame(data_scaled, index=data.index, columns=data.columns)

# Cluster scaled data
clusterer = HierarchicalClustering(method='ward')
labels = clusterer.fit_transform(df_scaled)
```

##  Troubleshooting

### Common Issues

1. **ImportError for optional dependencies**:
   ```bash
   pip install hierarchical-clustering[all]
   ```

2. **Memory issues with large datasets**:
   - Use data subsampling or dimensionality reduction
   - Consider approximate methods for >5000 samples

3. **Dynamic tree cutting not working**:
   - Install `dynamicTreeCut` package
   - Falls back to height-based cutting automatically

4. **Tree export failing**:
   - Install `scikit-bio` package
   - Check that clustering was successful

### Performance Benchmarks

| Dataset Size | Method | Time (seconds) | Memory (GB) |
|-------------|--------|----------------|-------------|
| 100 samples | Ward   | 0.01          | < 0.1       |
| 500 samples | Ward   | 0.1           | 0.2         |
| 1000 samples| Ward   | 0.5           | 0.8         |
| 2000 samples| Ward   | 2.0           | 3.2         |

##  API Reference

### HierarchicalClustering Class

#### Parameters
- `method` (str): Linkage method ('ward', 'complete', 'average', 'single')
- `metric` (str): Distance metric ('euclidean', 'manhattan', 'cosine', etc.)
- `cut_method` (str): Tree cutting method ('dynamic', 'height', 'maxclust')
- `min_cluster_size` (int): Minimum cluster size for dynamic cutting
- `cut_threshold` (float): Threshold for height/maxclust cutting
- `name` (str): Optional name for the clustering instance

#### Methods
- `fit(X)`: Fit clustering to data
- `transform()`: Return cluster labels
- `fit_transform(X)`: Fit and return labels
- `add_track(name, data, track_type)`: Add metadata track
- `plot_dendrogram(**kwargs)`: Plot dendrogram with optional tracks
- `eigenprofiles(data)`: Calculate cluster eigenprofiles
- `silhouette_analysis()`: Perform silhouette analysis
- `connectivity()`: Calculate network connectivity
- `to_networkx()`: Convert to NetworkX graph
- `to_newick()`: Export tree in Newick format
- `summary()`: Print clustering summary

#### Attributes (after fitting)
- `labels_`: Cluster labels for each sample
- `n_clusters_`: Number of clusters found
- `linkage_matrix_`: Hierarchical linkage matrix
- `distance_matrix_`: Distance matrix used
- `tree_`: Tree object (if available)
- `tracks_`: Dictionary of metadata tracks

##  Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.


##  License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

##  Original Implementation

This package is based on the hierarchical clustering implementation originally developed in the [Soothsayer](https://github.com/jolespin/soothsayer) framework:

**Espinoza JL, Dupont CL, O'Rourke A, Beyhan S, Morales P, et al. (2021) Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach. PLOS Computational Biology 17(3): e1008857.** [https://doi.org/10.1371/journal.pcbi.1008857](https://doi.org/10.1371/journal.pcbi.1008857)

The original implementation provided the foundation for the hierarchical clustering algorithms, metadata track visualization, and eigenprofile analysis features in this package.

##  Acknowledgments

- Built on top of scipy, scikit-learn, and networkx
- Original implementation developed in the [Soothsayer framework](https://github.com/jolespin/soothsayer)
- Inspired by WGCNA and other biological clustering tools
- Dynamic tree cutting algorithms from the dynamicTreeCut package

##  Support

- **Documentation**: [Link to docs]
- **Issues**: [GitHub Issues](https://github.com/your-username/hierarchical-clustering/issues)
- **Discussions**: [GitHub Discussions](https://github.com/your-username/hierarchical-clustering/discussions)

##  Citation

If you use this package in your research, please cite:

```bibtex
@software{hierarchical_clustering,
  author = {Josh L. Espinoza},
  title = {HierarchicalClustering: A comprehensive hierarchical clustering toolkit},
  url = {https://github.com/your-username/hierarchical-clustering},
  version = {2025.7.26},
  year = {2025}
}
```

**Original Soothsayer implementation:**
```bibtex
@article{espinoza2021predicting,
  title={Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach},
  author={Espinoza, Josh L and Dupont, Chris L and O'Rourke, Aubrie and Beyhan, Seherzada and Morales, Paula and others},
  journal={PLOS Computational Biology},
  volume={17},
  number={3},
  pages={e1008857},
  year={2021},
  publisher={Public Library of Science San Francisco, CA USA},
  doi={10.1371/journal.pcbi.1008857},
  url={https://doi.org/10.1371/journal.pcbi.1008857}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jolespin/skclust",
    "name": "skclust",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "clustering, hierarchical-clustering, dendrogram, tree-cutting, machine-learning, data-analysis, bioinformatics, network-analysis, visualization, scikit-learn",
    "author": "Josh L. Espinoza",
    "author_email": "\"Josh L. Espinoza\" <jol.espinoz@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/44/37/dd9a085a7e3e8075ad3281c1f30b341b0dc83f4130970c08611b2e03cecb/skclust-2025.7.26.tar.gz",
    "platform": null,
    "description": "# skclust\nA comprehensive clustering toolkit with advanced tree cutting, visualization, and network analysis capabilities.\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![scikit-learn compatible](https://img.shields.io/badge/sklearn-compatible-orange.svg)](https://scikit-learn.org)\n![Beta](https://img.shields.io/badge/status-beta-orange)\n![Not Production Ready](https://img.shields.io/badge/production-not%20ready-red)\n\n\u26a0\ufe0f **Warning: This is a beta release and has not been thoroughly tested.**\n\n##  Features\n\n- **Scikit-learn compatible** API for seamless integration\n- **Multiple linkage methods** (Ward, Complete, Average, Single, etc.)\n- **Advanced tree cutting** with dynamic, height-based, and max-cluster methods\n- **Rich visualizations** with dendrograms and metadata tracks\n- **Network analysis** with connectivity metrics and NetworkX integration\n- **Cluster validation** using silhouette analysis\n- **Eigenprofile calculation** for cluster characterization\n- **Tree export** in Newick format for phylogenetic analysis\n- **Distance matrix support** for precomputed distances\n- **Metadata tracks** for biological and experimental annotations\n\n##  Installation\n\n### Basic Installation\n\n```bash\npip install skclust\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/jolespin/skclust.git\ncd skclust\npip install -e .[all]\n```\n\n### Installation Options\n\n```bash\n# Basic functionality only\npip install skclust\n\n# With fast clustering (fastcluster)\npip install skclust[fast]\n\n# With tree operations (scikit-bio)\npip install skclust[tree]\n\n# With all optional features\npip install skclust[all]\n\n# Development installation\npip install skclust[dev]\n```\n\n##  Dependencies\n\n### Core Dependencies (Required)\n- `numpy >= 1.19.0`\n- `pandas >= 1.3.0`\n- `scipy >= 1.7.0`\n- `scikit-learn >= 1.0.0`\n- `matplotlib >= 3.3.0`\n- `seaborn >= 0.11.0`\n- `networkx >= 2.6.0`\n\n### Optional Dependencies (Enhanced Features)\n- `fastcluster >= 1.2.0` - Faster linkage computations\n- `scikit-bio >= 0.5.6` - Tree operations and Newick export\n- `dynamicTreeCut >= 0.1.0` - Dynamic tree cutting algorithms\n- `ensemble-networkx >= 0.1.0` - Enhanced network analysis\n\n##  Quick Start\n\n```python\nfrom skclust import HierarchicalClustering\nimport pandas as pd\nimport numpy as np\n\n# Generate sample data\ndata = np.random.randn(100, 10)\ndf = pd.DataFrame(data, index=[f\"Sample_{i}\" for i in range(100)])\n\n# Create and fit clusterer\nclusterer = HierarchicalClustering(\n    method='ward',\n    cut_method='dynamic',\n    min_cluster_size=10\n)\n\n# Fit and get cluster labels\nlabels = clusterer.fit_transform(df)\n\n# Plot dendrogram\nfig, ax = clusterer.plot_dendrogram(figsize=(12, 6))\n\n# Get summary\nsummary = clusterer.summary()\nprint(f\"Found {clusterer.n_clusters_} clusters\")\n```\n\n##  Detailed Usage Examples\n\n### 1. Basic Clustering with Different Methods\n\n```python\nfrom skclust import HierarchicalClustering\nimport pandas as pd\n\n# Ward clustering with dynamic cutting\nclusterer = HierarchicalClustering(\n    method='ward',\n    cut_method='dynamic',\n    min_cluster_size=5\n)\nlabels = clusterer.fit_transform(data)\n\n# Complete linkage with height-based cutting\nclusterer = HierarchicalClustering(\n    method='complete',\n    cut_method='height',\n    cut_threshold=10.0\n)\nlabels = clusterer.fit_transform(data)\n\n# Average linkage with fixed number of clusters\nclusterer = HierarchicalClustering(\n    method='average',\n    cut_method='maxclust',\n    cut_threshold=4\n)\nlabels = clusterer.fit_transform(data)\n```\n\n### 2. Working with Distance Matrices\n\n```python\nfrom scipy.spatial.distance import pdist, squareform\n\n# Compute custom distance matrix\ndistances = pdist(data.values, metric='correlation')\ndistance_matrix = pd.DataFrame(\n    squareform(distances),\n    index=data.index,\n    columns=data.index\n)\n\n# Cluster using precomputed distances\nclusterer = HierarchicalClustering(method='complete')\nlabels = clusterer.fit_transform(distance_matrix)\n```\n\n### 3. Adding Metadata Tracks\n\n```python\n# Add continuous metadata (e.g., age, expression levels)\nage_data = np.random.normal(45, 15, len(data))\nclusterer.add_track('Age', age_data, track_type='continuous', color='steelblue')\n\n# Add categorical metadata (e.g., treatment groups)\ntreatment = ['Control'] * 30 + ['Treatment_A'] * 35 + ['Treatment_B'] * 35\nclusterer.add_track(\n    'Treatment', \n    treatment, \n    track_type='categorical',\n    color={'Control': 'gray', 'Treatment_A': 'red', 'Treatment_B': 'blue'}\n)\n\n# Plot dendrogram with metadata tracks\nfig, axes = clusterer.plot_dendrogram(\n    figsize=(14, 10),\n    show_tracks=True,\n    track_height=1.0\n)\n```\n\n### 4. Cluster Analysis and Validation\n\n```python\n# Calculate eigenprofiles (principal components for each cluster)\neigenprofiles = clusterer.eigenprofiles(data)\nfor cluster_id, profile in eigenprofiles.items():\n    print(f\"Cluster {cluster_id}: \"\n          f\"Explained variance = {profile['explained_variance_ratio']:.3f}\")\n\n# Perform silhouette analysis\nsilhouette_results = clusterer.silhouette_analysis()\nprint(f\"Overall silhouette score: {silhouette_results['overall_score']:.3f}\")\n\n# Calculate connectivity metrics\nconnectivity = clusterer.connectivity()\nprint(\"Connectivity analysis:\", connectivity)\n```\n\n### 5. Network Analysis\n\n```python\n# Convert to NetworkX graph\ngraph = clusterer.to_networkx(weight_threshold=0.3)\nprint(f\"Graph: {graph.number_of_nodes()} nodes, {graph.number_of_edges()} edges\")\n\n# Visualize network (for small datasets)\nimport networkx as nx\nimport matplotlib.pyplot as plt\n\npos = nx.spring_layout(graph)\nnx.draw(graph, pos, node_color=clusterer.labels_, \n        node_size=50, cmap='viridis', alpha=0.7)\nplt.title('Sample Network (colored by cluster)')\nplt.show()\n```\n\n### 6. Tree Export and Phylogenetic Analysis\n\n```python\n# Export tree in Newick format (requires scikit-bio)\ntry:\n    newick_string = clusterer.to_newick()\n    print(\"Newick tree:\", newick_string[:100], \"...\")\n    \n    # Save to file\n    clusterer.to_newick('my_tree.newick')\nexcept ValueError as e:\n    print(\"Tree export not available:\", e)\n```\n\n### 7. Convenience Function\n\n```python\nfrom skclust import hierarchical_clustering\n\n# Quick clustering with default parameters\nclusterer = hierarchical_clustering(\n    data, \n    method='ward', \n    min_cluster_size=10\n)\nprint(f\"Quick clustering: {clusterer.n_clusters_} clusters\")\n```\n\n##  Biological Data Example\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom skclust import HierarchicalClustering\n\n# Simulate gene expression data\nnp.random.seed(42)\nn_samples, n_genes = 80, 1000\nexpression_data = np.random.randn(n_samples, n_genes)\n\n# Add structure: 3 patient groups with different expression patterns\nexpression_data[:25, :100] += 2.0   # Group 1: high expression in genes 1-100\nexpression_data[25:50, 100:200] += 2.0  # Group 2: high expression in genes 101-200\nexpression_data[50:, 200:300] += 2.0     # Group 3: high expression in genes 201-300\n\n# Create DataFrame with sample names\nsample_names = [f\"Patient_{i:02d}\" for i in range(n_samples)]\ngene_names = [f\"Gene_{i:04d}\" for i in range(n_genes)]\ndf_expression = pd.DataFrame(expression_data, \n                           index=sample_names, \n                           columns=gene_names)\n\n# Perform hierarchical clustering\nclusterer = HierarchicalClustering(\n    method='ward',\n    cut_method='dynamic',\n    min_cluster_size=8,\n    name='Gene_Expression_Clustering'\n)\n\nlabels = clusterer.fit_transform(df_expression)\n\n# Add clinical metadata\nage = np.random.normal(55, 12, n_samples)\ngender = np.random.choice(['Male', 'Female'], n_samples)\nstage = ['Stage_I'] * 20 + ['Stage_II'] * 30 + ['Stage_III'] * 30\n\nclusterer.add_track('Age', age, track_type='continuous')\nclusterer.add_track('Gender', gender, track_type='categorical')\nclusterer.add_track('Disease_Stage', stage, track_type='categorical')\n\n# Visualize results\nfig, axes = clusterer.plot_dendrogram(figsize=(15, 10), show_tracks=True)\n\n# Analyze cluster characteristics\neigenprofiles = clusterer.eigenprofiles(df_expression)\nsilhouette_results = clusterer.silhouette_analysis()\n\nprint(f\"Identified {clusterer.n_clusters_} patient clusters\")\nprint(f\"Silhouette score: {silhouette_results['overall_score']:.3f}\")\n\n# Print cluster summary\nclusterer.summary()\n```\n\n##  Advanced Configuration\n\n### Custom Linkage Methods\n\n```python\n# Supported linkage methods\nmethods = ['ward', 'complete', 'average', 'single', 'centroid', 'median', 'weighted']\n\nfor method in methods:\n    clusterer = HierarchicalClustering(method=method)\n    labels = clusterer.fit_transform(data)\n    print(f\"{method}: {clusterer.n_clusters_} clusters\")\n```\n\n### Distance Metrics\n\n```python\n# Supported distance metrics (for raw data)\nmetrics = ['euclidean', 'manhattan', 'cosine', 'correlation']\n\nfor metric in metrics:\n    clusterer = HierarchicalClustering(metric=metric)\n    labels = clusterer.fit_transform(data)\n    print(f\"{metric}: {clusterer.n_clusters_} clusters\")\n```\n\n### Dynamic Tree Cutting Parameters\n\n```python\n# Fine-tune dynamic tree cutting\nclusterer = HierarchicalClustering(\n    cut_method='dynamic',\n    min_cluster_size=10,        # Minimum samples per cluster\n    deep_split=2,               # Sensitivity (0-4, higher = more clusters)\n    dynamic_cut_method='hybrid' # 'hybrid' or 'tree'\n)\n```\n\n##  Performance Tips\n\n1. **Use fastcluster**: Install `fastcluster` for significantly faster linkage computation\n2. **Distance matrices**: Precompute distance matrices for repeated analysis\n3. **Data preprocessing**: Standardize/normalize data before clustering\n4. **Memory management**: For large datasets (>1000 samples), consider subsampling\n\n```python\n# Example: Preprocessing pipeline\nfrom sklearn.preprocessing import StandardScaler\n\n# Standardize features\nscaler = StandardScaler()\ndata_scaled = scaler.fit_transform(data)\ndf_scaled = pd.DataFrame(data_scaled, index=data.index, columns=data.columns)\n\n# Cluster scaled data\nclusterer = HierarchicalClustering(method='ward')\nlabels = clusterer.fit_transform(df_scaled)\n```\n\n##  Troubleshooting\n\n### Common Issues\n\n1. **ImportError for optional dependencies**:\n   ```bash\n   pip install hierarchical-clustering[all]\n   ```\n\n2. **Memory issues with large datasets**:\n   - Use data subsampling or dimensionality reduction\n   - Consider approximate methods for >5000 samples\n\n3. **Dynamic tree cutting not working**:\n   - Install `dynamicTreeCut` package\n   - Falls back to height-based cutting automatically\n\n4. **Tree export failing**:\n   - Install `scikit-bio` package\n   - Check that clustering was successful\n\n### Performance Benchmarks\n\n| Dataset Size | Method | Time (seconds) | Memory (GB) |\n|-------------|--------|----------------|-------------|\n| 100 samples | Ward   | 0.01          | < 0.1       |\n| 500 samples | Ward   | 0.1           | 0.2         |\n| 1000 samples| Ward   | 0.5           | 0.8         |\n| 2000 samples| Ward   | 2.0           | 3.2         |\n\n##  API Reference\n\n### HierarchicalClustering Class\n\n#### Parameters\n- `method` (str): Linkage method ('ward', 'complete', 'average', 'single')\n- `metric` (str): Distance metric ('euclidean', 'manhattan', 'cosine', etc.)\n- `cut_method` (str): Tree cutting method ('dynamic', 'height', 'maxclust')\n- `min_cluster_size` (int): Minimum cluster size for dynamic cutting\n- `cut_threshold` (float): Threshold for height/maxclust cutting\n- `name` (str): Optional name for the clustering instance\n\n#### Methods\n- `fit(X)`: Fit clustering to data\n- `transform()`: Return cluster labels\n- `fit_transform(X)`: Fit and return labels\n- `add_track(name, data, track_type)`: Add metadata track\n- `plot_dendrogram(**kwargs)`: Plot dendrogram with optional tracks\n- `eigenprofiles(data)`: Calculate cluster eigenprofiles\n- `silhouette_analysis()`: Perform silhouette analysis\n- `connectivity()`: Calculate network connectivity\n- `to_networkx()`: Convert to NetworkX graph\n- `to_newick()`: Export tree in Newick format\n- `summary()`: Print clustering summary\n\n#### Attributes (after fitting)\n- `labels_`: Cluster labels for each sample\n- `n_clusters_`: Number of clusters found\n- `linkage_matrix_`: Hierarchical linkage matrix\n- `distance_matrix_`: Distance matrix used\n- `tree_`: Tree object (if available)\n- `tracks_`: Dictionary of metadata tracks\n\n##  Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n\n##  License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n##  Original Implementation\n\nThis package is based on the hierarchical clustering implementation originally developed in the [Soothsayer](https://github.com/jolespin/soothsayer) framework:\n\n**Espinoza JL, Dupont CL, O'Rourke A, Beyhan S, Morales P, et al. (2021) Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach. PLOS Computational Biology 17(3): e1008857.** [https://doi.org/10.1371/journal.pcbi.1008857](https://doi.org/10.1371/journal.pcbi.1008857)\n\nThe original implementation provided the foundation for the hierarchical clustering algorithms, metadata track visualization, and eigenprofile analysis features in this package.\n\n##  Acknowledgments\n\n- Built on top of scipy, scikit-learn, and networkx\n- Original implementation developed in the [Soothsayer framework](https://github.com/jolespin/soothsayer)\n- Inspired by WGCNA and other biological clustering tools\n- Dynamic tree cutting algorithms from the dynamicTreeCut package\n\n##  Support\n\n- **Documentation**: [Link to docs]\n- **Issues**: [GitHub Issues](https://github.com/your-username/hierarchical-clustering/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/your-username/hierarchical-clustering/discussions)\n\n##  Citation\n\nIf you use this package in your research, please cite:\n\n```bibtex\n@software{hierarchical_clustering,\n  author = {Josh L. Espinoza},\n  title = {HierarchicalClustering: A comprehensive hierarchical clustering toolkit},\n  url = {https://github.com/your-username/hierarchical-clustering},\n  version = {2025.7.26},\n  year = {2025}\n}\n```\n\n**Original Soothsayer implementation:**\n```bibtex\n@article{espinoza2021predicting,\n  title={Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach},\n  author={Espinoza, Josh L and Dupont, Chris L and O'Rourke, Aubrie and Beyhan, Seherzada and Morales, Paula and others},\n  journal={PLOS Computational Biology},\n  volume={17},\n  number={3},\n  pages={e1008857},\n  year={2021},\n  publisher={Public Library of Science San Francisco, CA USA},\n  doi={10.1371/journal.pcbi.1008857},\n  url={https://doi.org/10.1371/journal.pcbi.1008857}\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A comprehensive clustering toolkit with advanced tree cutting and visualization",
    "version": "2025.7.26",
    "project_urls": {
        "Bug Reports": "https://github.com/jolespin/skclust/issues",
        "Homepage": "https://github.com/jolespin/skclust",
        "Source": "https://github.com/jolespin/skclust"
    },
    "split_keywords": [
        "clustering",
        " hierarchical-clustering",
        " dendrogram",
        " tree-cutting",
        " machine-learning",
        " data-analysis",
        " bioinformatics",
        " network-analysis",
        " visualization",
        " scikit-learn"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4437dd9a085a7e3e8075ad3281c1f30b341b0dc83f4130970c08611b2e03cecb",
                "md5": "5d351aa66c7d51d8202674c97275abee",
                "sha256": "a94ca54f06b9a1d4b83da4fa1f757113a3689ae5aad0c91eff0dfa84739e3f5d"
            },
            "downloads": -1,
            "filename": "skclust-2025.7.26.tar.gz",
            "has_sig": false,
            "md5_digest": "5d351aa66c7d51d8202674c97275abee",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 20348,
            "upload_time": "2025-08-02T02:05:24",
            "upload_time_iso_8601": "2025-08-02T02:05:24.234098Z",
            "url": "https://files.pythonhosted.org/packages/44/37/dd9a085a7e3e8075ad3281c1f30b341b0dc83f4130970c08611b2e03cecb/skclust-2025.7.26.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-02 02:05:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jolespin",
    "github_project": "skclust",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.19.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.7.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.3.0"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    ">=",
                    "0.11.0"
                ]
            ]
        },
        {
            "name": "networkx",
            "specs": [
                [
                    ">=",
                    "2.6.0"
                ]
            ]
        }
    ],
    "lcname": "skclust"
}
        
Elapsed time: 1.51684s