erica-clustering


Nameerica-clustering JSON
Version 0.1.3 PyPI version JSON
download
home_pagehttps://github.com/shawnshirazi/ERICA_PyPI
SummaryERICA - Evaluating Replicability via Iterative Clustering Assignments
upload_time2025-10-06 05:53:37
maintainerNone
docs_urlNone
authorSiamak Sorooshyari, Shawn Shirazi
requires_python>=3.8
licenseMIT
keywords clustering machine-learning replicability data-analysis
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ERICA - Evaluating Replicability via Iterative Clustering Assignments

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/erica-clustering.svg)](https://badge.fury.io/py/erica-clustering)

**ERICA** is a comprehensive Python library for analyzing clustering replicability using Monte Carlo subsampling. It provides robust evaluation of clustering stability across different subsamples of your data.

## Features

- 🎯 **Multiple Clustering Methods**: Support for K-Means and Agglomerative (Hierarchical) Clustering
- 📊 **Replicability Metrics**: Compute CRI, WCRI, and TWCRI metrics for stability assessment
- 🔄 **Iterative Analysis**: Monte Carlo subsampling for robust evaluation
- 📈 **Interactive Visualization**: Create beautiful plots with Plotly
- 🎨 **Optional GUI**: User-friendly Gradio web interface
- 🔬 **Reproducible**: Deterministic mode for scientific reproducibility
- ⚡ **Optimized I/O**: Smart caching for efficient processing

## Quick Start

### Installation

```bash
# Basic installation
pip install erica-clustering

# With plotting support
pip install erica-clustering[plots]

# With GUI support
pip install erica-clustering[gui]

# Full installation
pip install erica-clustering[all]
```

### Basic Usage

```python
import numpy as np
from erica import ERICA

# Load your data (samples × features)
data = np.random.rand(100, 50)

# Run ERICA analysis
erica = ERICA(
    data=data,
    k_range=[2, 3, 4, 5],
    n_iterations=200,
    method='both'
)

results = erica.run()

# Get results
clam_matrix = erica.get_clam_matrix(k=3)
metrics = erica.get_metrics()

print(f"CRI: {metrics[3]['kmeans']['CRI']:.3f}")
print(f"WCRI: {metrics[3]['kmeans']['WCRI']:.3f}")
print(f"TWCRI: {metrics[3]['kmeans']['TWCRI']:.3f}")

# Visualize results
fig1, fig2 = erica.plot_metrics()
fig1.show()
```

## What is ERICA?

ERICA evaluates clustering stability through:

1. **Iterative Subsampling**: Repeatedly split data into train/test sets
2. **Clustering**: Run clustering algorithms on each subsample
3. **Alignment**: Align cluster identities across iterations
4. **CLAM Matrix Generation**: Track cluster assignments across iterations
5. **Metrics Computation**: Calculate replicability scores

### Replicability Metrics

- **CRI (Clustering Replicability Index)**: Measures how consistently samples are assigned to their primary cluster
- **WCRI (Weighted CRI)**: CRI weighted by cluster sizes
- **TWCRI (Total Weighted CRI)**: Sum of weighted CRI values for overall assessment

**Higher values = Better replicability!**

## Advanced Usage

### Using Individual Components

```python
from erica.clustering import kmeans_clustering, iterative_clustering_subsampling
from erica.metrics import compute_metrics_for_clam
from erica.data import prepare_samples_array

# Prepare data
samples = prepare_samples_array(your_data)

# Perform subsampling
train_size = int(len(samples) * 0.8)
_, indices_folder = iterative_clustering_subsampling(
    samples, len(samples), 200, train_size, './output'
)

# Run clustering
result = kmeans_clustering(
    samples, k=3, n_iterations=200,
    indices_folder=indices_folder,
    output_dir='./output'
)

# Compute metrics
metrics = compute_metrics_for_clam(result['clam_matrix'], k=3)
```

### Custom Plotting

```python
from erica.plotting import plot_metrics, plot_clam_heatmap

# Plot metrics
fig = plot_metrics(k_values, cri_values, wcri_values, twcri_values)
fig.show()

# Visualize CLAM matrix
fig = plot_clam_heatmap(clam_matrix, k=3)
fig.show()
```

## Documentation

- 📖 [Getting Started Guide](docs/GETTING_STARTED.md)
- 📚 [API Reference](docs/API_REFERENCE.md)
- 🧬 [Methodology](docs/METHODOLOGY.md)
- 📦 [PyPI Publishing Guide](PYPI_GUIDE.md)

## Project Structure

```
erica/
├── __init__.py          # Main package interface
├── core.py              # ERICA main class
├── clustering.py        # Clustering algorithms
├── metrics.py           # Replicability metrics
├── data.py              # Data loading and preparation
├── plotting.py          # Visualization functions
└── utils.py             # Utility functions
```

## Requirements

**Core Dependencies:**
- Python >= 3.8
- NumPy >= 1.21.0
- Pandas >= 1.3.0
- scikit-learn >= 1.0.0
- PyYAML >= 6.0

**Optional Dependencies:**
- Plotly >= 5.0.0 (for plotting)
- Matplotlib >= 3.5.0 (for additional plots)
- Gradio >= 4.0.0 (for GUI)

## Use Cases

### Bioinformatics
- Gene expression clustering analysis
- Single-cell RNA-seq clustering validation
- Patient stratification assessment

### General Machine Learning
- Evaluating clustering stability before downstream analysis
- Comparing different clustering methods objectively
- Selecting optimal k with stability criterion
- Identifying outliers or ambiguous samples

## Examples

### Example 1: Gene Expression Analysis

```python
import pandas as pd
from erica import ERICA, load_data

# Load gene expression data
data = load_data('gene_expression.csv')

# Run ERICA
erica = ERICA(data=data, k_range=[2, 3, 4, 5, 6], n_iterations=200)
results = erica.run()

# Find optimal k
from erica.metrics import find_optimal_k
optimal_k, _ = find_optimal_k(erica.get_metrics(), metric_name='TWCRI')
print(f"Recommended number of clusters: {optimal_k}")
```

### Example 2: Method Comparison

```python
from erica import ERICA

# Test K-Means
erica_km = ERICA(data=data, k_range=[2, 3, 4], method='kmeans')
results_km = erica_km.run()

# Test Agglomerative
erica_agg = ERICA(data=data, k_range=[2, 3, 4], method='agglomerative')
results_agg = erica_agg.run()

# Compare metrics
metrics_km = erica_km.get_metrics()
metrics_agg = erica_agg.get_metrics()
```

## Performance Tips

### For Large Datasets (n > 10,000)

1. **Reduce iterations**: Use `n_iterations=100-200` instead of 500
2. **Use K-Means**: Faster than Agglomerative clustering
3. **Optimize I/O**: Enabled by default, keeps memory usage low
4. **Feature selection**: Reduce dimensionality if d is large

### For Small Datasets (n < 100)

1. **Increase iterations**: Use `n_iterations=300-500` for stability
2. **Adjust train/test split**: Use `train_percent=0.7` for larger test sets
3. **Test smaller k range**: Avoid k close to n

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Citation

If you use ERICA in your research, please cite:

```bibtex
@software{erica2025,
  title = {ERICA: Evaluating Replicability via Iterative Clustering Assignments},
  author = {Sorooshyari, Siamak and Shirazi, Shawn},
  year = {2025},
  url = {https://github.com/yourusername/erica-clustering}
}
```

## Acknowledgments

- Original Monte Carlo Subsampling for Clustering Replicability (MCSS) methodology
- scikit-learn for clustering algorithms
- Plotly for interactive visualizations
- Gradio for web interface capabilities

## Support

- 📧 Email: your.email@example.com
- 🐛 Issues: [GitHub Issues](https://github.com/yourusername/erica-clustering/issues)
- 💬 Discussions: [GitHub Discussions](https://github.com/yourusername/erica-clustering/discussions)

---

**Made with ❤️ for the scientific community**

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shawnshirazi/ERICA_PyPI",
    "name": "erica-clustering",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "clustering, machine-learning, replicability, data-analysis",
    "author": "Siamak Sorooshyari, Shawn Shirazi",
    "author_email": "Siamak Sorooshyari <siamak.sorooshyari@example.com>, Shawn Shirazi <shawn.shirazi@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/03/8a/5aec2ed46fdb3fb2148154d5343bddbec2b5b8250ffe5680c2e1fbd4cc6b/erica_clustering-0.1.3.tar.gz",
    "platform": null,
    "description": "# ERICA - Evaluating Replicability via Iterative Clustering Assignments\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![PyPI version](https://badge.fury.io/py/erica-clustering.svg)](https://badge.fury.io/py/erica-clustering)\n\n**ERICA** is a comprehensive Python library for analyzing clustering replicability using Monte Carlo subsampling. It provides robust evaluation of clustering stability across different subsamples of your data.\n\n## Features\n\n- \ud83c\udfaf **Multiple Clustering Methods**: Support for K-Means and Agglomerative (Hierarchical) Clustering\n- \ud83d\udcca **Replicability Metrics**: Compute CRI, WCRI, and TWCRI metrics for stability assessment\n- \ud83d\udd04 **Iterative Analysis**: Monte Carlo subsampling for robust evaluation\n- \ud83d\udcc8 **Interactive Visualization**: Create beautiful plots with Plotly\n- \ud83c\udfa8 **Optional GUI**: User-friendly Gradio web interface\n- \ud83d\udd2c **Reproducible**: Deterministic mode for scientific reproducibility\n- \u26a1 **Optimized I/O**: Smart caching for efficient processing\n\n## Quick Start\n\n### Installation\n\n```bash\n# Basic installation\npip install erica-clustering\n\n# With plotting support\npip install erica-clustering[plots]\n\n# With GUI support\npip install erica-clustering[gui]\n\n# Full installation\npip install erica-clustering[all]\n```\n\n### Basic Usage\n\n```python\nimport numpy as np\nfrom erica import ERICA\n\n# Load your data (samples \u00d7 features)\ndata = np.random.rand(100, 50)\n\n# Run ERICA analysis\nerica = ERICA(\n    data=data,\n    k_range=[2, 3, 4, 5],\n    n_iterations=200,\n    method='both'\n)\n\nresults = erica.run()\n\n# Get results\nclam_matrix = erica.get_clam_matrix(k=3)\nmetrics = erica.get_metrics()\n\nprint(f\"CRI: {metrics[3]['kmeans']['CRI']:.3f}\")\nprint(f\"WCRI: {metrics[3]['kmeans']['WCRI']:.3f}\")\nprint(f\"TWCRI: {metrics[3]['kmeans']['TWCRI']:.3f}\")\n\n# Visualize results\nfig1, fig2 = erica.plot_metrics()\nfig1.show()\n```\n\n## What is ERICA?\n\nERICA evaluates clustering stability through:\n\n1. **Iterative Subsampling**: Repeatedly split data into train/test sets\n2. **Clustering**: Run clustering algorithms on each subsample\n3. **Alignment**: Align cluster identities across iterations\n4. **CLAM Matrix Generation**: Track cluster assignments across iterations\n5. **Metrics Computation**: Calculate replicability scores\n\n### Replicability Metrics\n\n- **CRI (Clustering Replicability Index)**: Measures how consistently samples are assigned to their primary cluster\n- **WCRI (Weighted CRI)**: CRI weighted by cluster sizes\n- **TWCRI (Total Weighted CRI)**: Sum of weighted CRI values for overall assessment\n\n**Higher values = Better replicability!**\n\n## Advanced Usage\n\n### Using Individual Components\n\n```python\nfrom erica.clustering import kmeans_clustering, iterative_clustering_subsampling\nfrom erica.metrics import compute_metrics_for_clam\nfrom erica.data import prepare_samples_array\n\n# Prepare data\nsamples = prepare_samples_array(your_data)\n\n# Perform subsampling\ntrain_size = int(len(samples) * 0.8)\n_, indices_folder = iterative_clustering_subsampling(\n    samples, len(samples), 200, train_size, './output'\n)\n\n# Run clustering\nresult = kmeans_clustering(\n    samples, k=3, n_iterations=200,\n    indices_folder=indices_folder,\n    output_dir='./output'\n)\n\n# Compute metrics\nmetrics = compute_metrics_for_clam(result['clam_matrix'], k=3)\n```\n\n### Custom Plotting\n\n```python\nfrom erica.plotting import plot_metrics, plot_clam_heatmap\n\n# Plot metrics\nfig = plot_metrics(k_values, cri_values, wcri_values, twcri_values)\nfig.show()\n\n# Visualize CLAM matrix\nfig = plot_clam_heatmap(clam_matrix, k=3)\nfig.show()\n```\n\n## Documentation\n\n- \ud83d\udcd6 [Getting Started Guide](docs/GETTING_STARTED.md)\n- \ud83d\udcda [API Reference](docs/API_REFERENCE.md)\n- \ud83e\uddec [Methodology](docs/METHODOLOGY.md)\n- \ud83d\udce6 [PyPI Publishing Guide](PYPI_GUIDE.md)\n\n## Project Structure\n\n```\nerica/\n\u251c\u2500\u2500 __init__.py          # Main package interface\n\u251c\u2500\u2500 core.py              # ERICA main class\n\u251c\u2500\u2500 clustering.py        # Clustering algorithms\n\u251c\u2500\u2500 metrics.py           # Replicability metrics\n\u251c\u2500\u2500 data.py              # Data loading and preparation\n\u251c\u2500\u2500 plotting.py          # Visualization functions\n\u2514\u2500\u2500 utils.py             # Utility functions\n```\n\n## Requirements\n\n**Core Dependencies:**\n- Python >= 3.8\n- NumPy >= 1.21.0\n- Pandas >= 1.3.0\n- scikit-learn >= 1.0.0\n- PyYAML >= 6.0\n\n**Optional Dependencies:**\n- Plotly >= 5.0.0 (for plotting)\n- Matplotlib >= 3.5.0 (for additional plots)\n- Gradio >= 4.0.0 (for GUI)\n\n## Use Cases\n\n### Bioinformatics\n- Gene expression clustering analysis\n- Single-cell RNA-seq clustering validation\n- Patient stratification assessment\n\n### General Machine Learning\n- Evaluating clustering stability before downstream analysis\n- Comparing different clustering methods objectively\n- Selecting optimal k with stability criterion\n- Identifying outliers or ambiguous samples\n\n## Examples\n\n### Example 1: Gene Expression Analysis\n\n```python\nimport pandas as pd\nfrom erica import ERICA, load_data\n\n# Load gene expression data\ndata = load_data('gene_expression.csv')\n\n# Run ERICA\nerica = ERICA(data=data, k_range=[2, 3, 4, 5, 6], n_iterations=200)\nresults = erica.run()\n\n# Find optimal k\nfrom erica.metrics import find_optimal_k\noptimal_k, _ = find_optimal_k(erica.get_metrics(), metric_name='TWCRI')\nprint(f\"Recommended number of clusters: {optimal_k}\")\n```\n\n### Example 2: Method Comparison\n\n```python\nfrom erica import ERICA\n\n# Test K-Means\nerica_km = ERICA(data=data, k_range=[2, 3, 4], method='kmeans')\nresults_km = erica_km.run()\n\n# Test Agglomerative\nerica_agg = ERICA(data=data, k_range=[2, 3, 4], method='agglomerative')\nresults_agg = erica_agg.run()\n\n# Compare metrics\nmetrics_km = erica_km.get_metrics()\nmetrics_agg = erica_agg.get_metrics()\n```\n\n## Performance Tips\n\n### For Large Datasets (n > 10,000)\n\n1. **Reduce iterations**: Use `n_iterations=100-200` instead of 500\n2. **Use K-Means**: Faster than Agglomerative clustering\n3. **Optimize I/O**: Enabled by default, keeps memory usage low\n4. **Feature selection**: Reduce dimensionality if d is large\n\n### For Small Datasets (n < 100)\n\n1. **Increase iterations**: Use `n_iterations=300-500` for stability\n2. **Adjust train/test split**: Use `train_percent=0.7` for larger test sets\n3. **Test smaller k range**: Avoid k close to n\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use ERICA in your research, please cite:\n\n```bibtex\n@software{erica2025,\n  title = {ERICA: Evaluating Replicability via Iterative Clustering Assignments},\n  author = {Sorooshyari, Siamak and Shirazi, Shawn},\n  year = {2025},\n  url = {https://github.com/yourusername/erica-clustering}\n}\n```\n\n## Acknowledgments\n\n- Original Monte Carlo Subsampling for Clustering Replicability (MCSS) methodology\n- scikit-learn for clustering algorithms\n- Plotly for interactive visualizations\n- Gradio for web interface capabilities\n\n## Support\n\n- \ud83d\udce7 Email: your.email@example.com\n- \ud83d\udc1b Issues: [GitHub Issues](https://github.com/yourusername/erica-clustering/issues)\n- \ud83d\udcac Discussions: [GitHub Discussions](https://github.com/yourusername/erica-clustering/discussions)\n\n---\n\n**Made with \u2764\ufe0f for the scientific community**\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "ERICA - Evaluating Replicability via Iterative Clustering Assignments",
    "version": "0.1.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/shawnshirazi/ERICA_PyPI/issues",
        "Documentation": "https://github.com/shawnshirazi/ERICA_PyPI/blob/main/docs/",
        "Homepage": "https://github.com/shawnshirazi/ERICA_PyPI",
        "Repository": "https://github.com/shawnshirazi/ERICA_PyPI"
    },
    "split_keywords": [
        "clustering",
        " machine-learning",
        " replicability",
        " data-analysis"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1085be0a5a86ab8e70cad63afce3bf2b2a368ea89039c020a9585b0523130778",
                "md5": "d4f27bc3d55e69a9d1bd08dddc78d68d",
                "sha256": "5fe582a2ea00bf6a8ab23df0989e2eb31cc6150dc6b85a6a2cd5246549caf2f3"
            },
            "downloads": -1,
            "filename": "erica_clustering-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d4f27bc3d55e69a9d1bd08dddc78d68d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 27084,
            "upload_time": "2025-10-06T05:53:35",
            "upload_time_iso_8601": "2025-10-06T05:53:35.692124Z",
            "url": "https://files.pythonhosted.org/packages/10/85/be0a5a86ab8e70cad63afce3bf2b2a368ea89039c020a9585b0523130778/erica_clustering-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "038a5aec2ed46fdb3fb2148154d5343bddbec2b5b8250ffe5680c2e1fbd4cc6b",
                "md5": "4c889956030845735e1227ac427fcc47",
                "sha256": "ae5e2974c7b58d15ca1d142438c84180477c3da0020133dbca0bd7c971a8ff08"
            },
            "downloads": -1,
            "filename": "erica_clustering-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "4c889956030845735e1227ac427fcc47",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 80481,
            "upload_time": "2025-10-06T05:53:37",
            "upload_time_iso_8601": "2025-10-06T05:53:37.079796Z",
            "url": "https://files.pythonhosted.org/packages/03/8a/5aec2ed46fdb3fb2148154d5343bddbec2b5b8250ffe5680c2e1fbd4cc6b/erica_clustering-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-06 05:53:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shawnshirazi",
    "github_project": "ERICA_PyPI",
    "github_not_found": true,
    "lcname": "erica-clustering"
}
        
Elapsed time: 1.89300s