<div align="center">
<img src="website/public/images/logo.png" alt="HoneyBee Logo" width="200">
# HoneyBee
**A Scalable Modular Framework for Multimodal AI in Oncology**
[](https://arxiv.org/abs/2405.07460)
[](LICENSE)
[](https://github.com/lab-rasool/HoneyBee/stargazers)
[](https://www.python.org/downloads/)
[](https://pytorch.org/)
[Documentation](https://lab-rasool.github.io/HoneyBee/) | [Paper](https://arxiv.org/abs/2405.07460) | [Examples](examples/) | [Demo](app.py) | [Google Colab](https://colab.research.google.com/)
</div>
## 🚀 Overview
HoneyBee is a comprehensive multimodal AI framework designed specifically for oncology research and clinical applications. It seamlessly integrates and processes diverse medical data types—clinical text, radiology images, pathology slides, and molecular data—through a unified, modular architecture. Built with scalability and extensibility in mind, HoneyBee empowers researchers to develop sophisticated AI models for cancer diagnosis, prognosis, and treatment planning.
> [!WARNING]
> **Alpha Release**: This framework is currently in alpha. APIs may change, and some features are still under development.
## ✨ Key Features
### 🏗️ Modular Architecture
- **3-Layer Design**: Clean separation between data loaders, embedding models, and processors
- **Unified API**: Consistent interface across all modalities
- **Extensible**: Easy to add new models and data sources
- **Production-Ready**: Optimized for both research and clinical deployment
### 📊 Comprehensive Data Support
#### Medical Imaging
- **Pathology**: Whole Slide Images (WSI) - SVS, TIFF formats with tissue detection
- **Radiology**: DICOM, NIFTI processing with 3D support
- **Preprocessing**: Advanced augmentation and normalization pipelines
#### Clinical Text
- **Document Processing**: PDF support with OCR for scanned documents
- **NLP Pipeline**: Cancer entity extraction, temporal parsing, medical ontology integration
- **Database Integration**: Native [MINDS](https://github.com/lab-rasool/MINDS) format support
- **Long Document Handling**: Multiple tokenization strategies for clinical notes
#### Molecular Data
- **Genomics**: Support for expression data and mutation profiles
- **Integration**: Seamless combination with imaging and clinical data
### 🧠 State-of-the-Art Embedding Models
#### Clinical Text Embeddings
- **GatorTron**: Domain-specific clinical language model
- **BioBERT**: Biomedical text understanding
- **PubMedBERT**: Scientific literature embeddings
- **Clinical-T5**: Text-to-text clinical transformers
#### Medical Image Embeddings
- **REMEDIS**: Self-supervised medical image representations
- **RadImageNet**: Pre-trained radiological feature extractors
- **UNI**: Universal medical image encoder
- **Custom Models**: Easy integration of proprietary models
### 🛠️ Advanced Capabilities
#### Multimodal Integration
- **Cross-Modal Learning**: Unified representations across modalities
- **Attention Mechanisms**: Interpretable fusion strategies
- **Patient-Level Aggregation**: Comprehensive patient profiles
#### Analysis Tools
- **Survival Analysis**: Cox PH, Random Survival Forest, DeepSurv
- **Classification**: Multi-class cancer type prediction
- **Retrieval**: Similar patient identification
- **Visualization**: Interactive t-SNE dashboards
#### Clinical Applications
- **Risk Stratification**: Patient outcome prediction
- **Treatment Planning**: Personalized therapy recommendations
- **Biomarker Discovery**: Multi-omic pattern identification
## 🚀 Quick Start
### Prerequisites
- Python 3.8+
- PyTorch 2.0+
- CUDA 11.7+ (optional, for GPU acceleration)
### System Dependencies
```bash
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y openslide-tools tesseract-ocr
# macOS
brew install openslide tesseract
# Windows
# Install from official websites:
# - OpenSlide: https://openslide.org/download/
# - Tesseract: https://github.com/UB-Mannheim/tesseract/wiki
```
### Installation
```bash
# Clone the repository
git clone https://github.com/lab-rasool/HoneyBee.git
cd HoneyBee
# Install dependencies
pip install -r requirements.txt
# Download required NLTK data
python -c "import nltk; nltk.download('punkt')"
# Install HoneyBee in development mode
pip install -e .
```
### Environment Setup
Create a `.env` file in the project root:
```bash
# MINDS database credentials (if using MINDS format)
HOST=your_server
PORT=5433
DB_USER=postgres
PASSWORD=your_password
DATABASE=minds
# HuggingFace API (for some models)
HF_API_KEY=your_huggingface_api_key
```
## 🔬 Research Applications
HoneyBee has been successfully applied to:
- **Cancer Subtype Classification**: Automated identification of cancer subtypes from multimodal data
- **Survival Prediction**: Risk stratification and outcome prediction for treatment planning
- **Similar Patient Retrieval**: Finding patients with similar clinical profiles for precision medicine
- **Biomarker Discovery**: Identifying multimodal patterns associated with treatment response
## 🤝 Contributing
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
### Development Setup
```bash
# Fork and clone your fork
git clone https://github.com/YOUR_USERNAME/HoneyBee.git
cd HoneyBee
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -r requirements.txt
pip install -e .
```
## 🐛 Known Issues & Limitations
- **Alpha Status**: Some features are still under development
- **Memory Requirements**: WSI processing requires significant RAM (16GB+ recommended)
- **GPU Recommended**: While CPU fallback exists, GPU acceleration significantly improves performance
- **Limited Test Coverage**: Comprehensive test suite is planned for future releases
## 📜 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 📝 Citation
If you use HoneyBee in your research, please cite our paper:
```bibtex
@article{tripathi2024honeybee,
title={HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models},
author={Aakash Tripathi and Asim Waqas and Yasin Yilmaz and Ghulam Rasool},
journal={arXiv preprint arXiv:2405.07460},
year={2024},
eprint={2405.07460},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
---
<div align="center">
Made with ❤️ by the <a href="https://github.com/lab-rasool">Lab Rasool</a> team
</div>
Raw data
{
"_id": null,
"home_page": null,
"name": "honeybee-ml",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Aakash Tripathi <aakash.tripathi@moffitt.org>",
"keywords": "multimodal AI, oncology, cancer research, medical imaging, clinical NLP, machine learning, pathology, radiology, biomedical, healthcare",
"author": null,
"author_email": "Aakash Tripathi <aakash.tripathi@moffitt.org>, Lab Rasool <ghulam.rasool@moffitt.org>",
"download_url": "https://files.pythonhosted.org/packages/50/80/d15659ee2b1f83fc67c2b610afcda7c99e8456fbb77f1bb9b1c685830fdf/honeybee_ml-0.1.0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <img src=\"website/public/images/logo.png\" alt=\"HoneyBee Logo\" width=\"200\">\n \n # HoneyBee\n \n **A Scalable Modular Framework for Multimodal AI in Oncology**\n \n [](https://arxiv.org/abs/2405.07460)\n [](LICENSE)\n [](https://github.com/lab-rasool/HoneyBee/stargazers)\n [](https://www.python.org/downloads/)\n [](https://pytorch.org/)\n \n [Documentation](https://lab-rasool.github.io/HoneyBee/) | [Paper](https://arxiv.org/abs/2405.07460) | [Examples](examples/) | [Demo](app.py) | [Google Colab](https://colab.research.google.com/)\n</div>\n\n## \ud83d\ude80 Overview\n\nHoneyBee is a comprehensive multimodal AI framework designed specifically for oncology research and clinical applications. It seamlessly integrates and processes diverse medical data types\u2014clinical text, radiology images, pathology slides, and molecular data\u2014through a unified, modular architecture. Built with scalability and extensibility in mind, HoneyBee empowers researchers to develop sophisticated AI models for cancer diagnosis, prognosis, and treatment planning.\n\n> [!WARNING]\n> **Alpha Release**: This framework is currently in alpha. APIs may change, and some features are still under development.\n\n## \u2728 Key Features\n\n### \ud83c\udfd7\ufe0f Modular Architecture\n- **3-Layer Design**: Clean separation between data loaders, embedding models, and processors\n- **Unified API**: Consistent interface across all modalities\n- **Extensible**: Easy to add new models and data sources\n- **Production-Ready**: Optimized for both research and clinical deployment\n\n### \ud83d\udcca Comprehensive Data Support\n\n#### Medical Imaging\n- **Pathology**: Whole Slide Images (WSI) - SVS, TIFF formats with tissue detection\n- **Radiology**: DICOM, NIFTI processing with 3D support\n- **Preprocessing**: Advanced augmentation and normalization pipelines\n\n#### Clinical Text\n- **Document Processing**: PDF support with OCR for scanned documents\n- **NLP Pipeline**: Cancer entity extraction, temporal parsing, medical ontology integration\n- **Database Integration**: Native [MINDS](https://github.com/lab-rasool/MINDS) format support\n- **Long Document Handling**: Multiple tokenization strategies for clinical notes\n\n#### Molecular Data\n- **Genomics**: Support for expression data and mutation profiles\n- **Integration**: Seamless combination with imaging and clinical data\n\n### \ud83e\udde0 State-of-the-Art Embedding Models\n\n#### Clinical Text Embeddings\n- **GatorTron**: Domain-specific clinical language model\n- **BioBERT**: Biomedical text understanding\n- **PubMedBERT**: Scientific literature embeddings\n- **Clinical-T5**: Text-to-text clinical transformers\n\n#### Medical Image Embeddings\n- **REMEDIS**: Self-supervised medical image representations\n- **RadImageNet**: Pre-trained radiological feature extractors\n- **UNI**: Universal medical image encoder\n- **Custom Models**: Easy integration of proprietary models\n\n### \ud83d\udee0\ufe0f Advanced Capabilities\n\n#### Multimodal Integration\n- **Cross-Modal Learning**: Unified representations across modalities\n- **Attention Mechanisms**: Interpretable fusion strategies\n- **Patient-Level Aggregation**: Comprehensive patient profiles\n\n#### Analysis Tools\n- **Survival Analysis**: Cox PH, Random Survival Forest, DeepSurv\n- **Classification**: Multi-class cancer type prediction\n- **Retrieval**: Similar patient identification\n- **Visualization**: Interactive t-SNE dashboards\n\n#### Clinical Applications\n- **Risk Stratification**: Patient outcome prediction\n- **Treatment Planning**: Personalized therapy recommendations\n- **Biomarker Discovery**: Multi-omic pattern identification\n\n## \ud83d\ude80 Quick Start\n\n### Prerequisites\n\n- Python 3.8+\n- PyTorch 2.0+\n- CUDA 11.7+ (optional, for GPU acceleration)\n\n### System Dependencies\n\n```bash\n# Ubuntu/Debian\nsudo apt-get update\nsudo apt-get install -y openslide-tools tesseract-ocr\n\n# macOS\nbrew install openslide tesseract\n\n# Windows\n# Install from official websites:\n# - OpenSlide: https://openslide.org/download/\n# - Tesseract: https://github.com/UB-Mannheim/tesseract/wiki\n```\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/lab-rasool/HoneyBee.git\ncd HoneyBee\n\n# Install dependencies\npip install -r requirements.txt\n\n# Download required NLTK data\npython -c \"import nltk; nltk.download('punkt')\"\n\n# Install HoneyBee in development mode\npip install -e .\n```\n\n### Environment Setup\n\nCreate a `.env` file in the project root:\n\n```bash\n# MINDS database credentials (if using MINDS format)\nHOST=your_server\nPORT=5433\nDB_USER=postgres\nPASSWORD=your_password\nDATABASE=minds\n\n# HuggingFace API (for some models)\nHF_API_KEY=your_huggingface_api_key\n```\n\n## \ud83d\udd2c Research Applications\n\nHoneyBee has been successfully applied to:\n\n- **Cancer Subtype Classification**: Automated identification of cancer subtypes from multimodal data\n- **Survival Prediction**: Risk stratification and outcome prediction for treatment planning\n- **Similar Patient Retrieval**: Finding patients with similar clinical profiles for precision medicine\n- **Biomarker Discovery**: Identifying multimodal patterns associated with treatment response\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.\n\n### Development Setup\n\n```bash\n# Fork and clone your fork\ngit clone https://github.com/YOUR_USERNAME/HoneyBee.git\ncd HoneyBee\n\n# Create a virtual environment\npython -m venv venv\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\n\n# Install in development mode\npip install -r requirements.txt\npip install -e .\n```\n\n## \ud83d\udc1b Known Issues & Limitations\n\n- **Alpha Status**: Some features are still under development\n- **Memory Requirements**: WSI processing requires significant RAM (16GB+ recommended)\n- **GPU Recommended**: While CPU fallback exists, GPU acceleration significantly improves performance\n- **Limited Test Coverage**: Comprehensive test suite is planned for future releases\n\n## \ud83d\udcdc License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udcdd Citation\n\nIf you use HoneyBee in your research, please cite our paper:\n\n```bibtex\n@article{tripathi2024honeybee,\n title={HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models},\n author={Aakash Tripathi and Asim Waqas and Yasin Yilmaz and Ghulam Rasool},\n journal={arXiv preprint arXiv:2405.07460},\n year={2024},\n eprint={2405.07460},\n archivePrefix={arXiv},\n primaryClass={cs.LG}\n}\n```\n\n---\n\n<div align=\"center\">\n Made with \u2764\ufe0f by the <a href=\"https://github.com/lab-rasool\">Lab Rasool</a> team\n</div>\n",
"bugtrack_url": null,
"license": null,
"summary": "A Scalable Modular Framework for Multimodal AI in Oncology",
"version": "0.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/lab-rasool/HoneyBee/issues",
"Documentation": "https://lab-rasool.github.io/HoneyBee/",
"Homepage": "https://github.com/lab-rasool/HoneyBee",
"Paper": "https://arxiv.org/abs/2405.07460",
"Repository": "https://github.com/lab-rasool/HoneyBee"
},
"split_keywords": [
"multimodal ai",
" oncology",
" cancer research",
" medical imaging",
" clinical nlp",
" machine learning",
" pathology",
" radiology",
" biomedical",
" healthcare"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3269992dffcf350049039cabb3cf2efaa8ca7a77c229a71314b66ee483a9e99b",
"md5": "fef9c27d74f737c17b8c328bcd4cf15d",
"sha256": "d1c12de10edd3987aa9d56f5a616b992d8717d982a00f47704e34bbba9d0eb05"
},
"downloads": -1,
"filename": "honeybee_ml-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fef9c27d74f737c17b8c328bcd4cf15d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 97486,
"upload_time": "2025-10-14T02:14:12",
"upload_time_iso_8601": "2025-10-14T02:14:12.260755Z",
"url": "https://files.pythonhosted.org/packages/32/69/992dffcf350049039cabb3cf2efaa8ca7a77c229a71314b66ee483a9e99b/honeybee_ml-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5080d15659ee2b1f83fc67c2b610afcda7c99e8456fbb77f1bb9b1c685830fdf",
"md5": "b882a234bf0764b0330b1cfeb8bcf567",
"sha256": "bbfed4c984dd2210967b74ee24998f2393fc6f0db6dcf3acd671b6707efac709"
},
"downloads": -1,
"filename": "honeybee_ml-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "b882a234bf0764b0330b1cfeb8bcf567",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 89202,
"upload_time": "2025-10-14T02:14:14",
"upload_time_iso_8601": "2025-10-14T02:14:14.837254Z",
"url": "https://files.pythonhosted.org/packages/50/80/d15659ee2b1f83fc67c2b610afcda7c99e8456fbb77f1bb9b1c685830fdf/honeybee_ml-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-14 02:14:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lab-rasool",
"github_project": "HoneyBee",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "ipykernel",
"specs": []
},
{
"name": "ipywidgets",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "llama_index",
"specs": []
},
{
"name": "pymongo",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "torchvision",
"specs": []
},
{
"name": "torchaudio",
"specs": []
},
{
"name": "accelerate",
"specs": []
},
{
"name": "bitsandbytes",
"specs": []
},
{
"name": "pytesseract",
"specs": []
},
{
"name": "pdf2image",
"specs": []
},
{
"name": "PyPDF2",
"specs": []
},
{
"name": "pyarrow",
"specs": []
},
{
"name": "fastparquet",
"specs": []
},
{
"name": "pydicom",
"specs": []
},
{
"name": "opencv-python",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "langchain",
"specs": []
},
{
"name": "scikit-image",
"specs": []
},
{
"name": "imageio",
"specs": []
},
{
"name": "albumentations",
"specs": []
},
{
"name": "peft",
"specs": []
},
{
"name": "cucim",
"specs": []
},
{
"name": "openslide-python",
"specs": []
},
{
"name": "colour-science",
"specs": []
},
{
"name": "scipy",
"specs": []
},
{
"name": "pytesseract",
"specs": []
},
{
"name": "onnxruntime",
"specs": []
},
{
"name": "SimpleITK",
"specs": []
},
{
"name": "nibabel",
"specs": []
},
{
"name": "timm",
"specs": []
}
],
"lcname": "honeybee-ml"
}