<p align="center">
<a href="https://github.com/astral-sh/ruff"><img alt="Linter: Ruff" src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json"></a>
<a href="https://github.com/tahoebio/tahoe-x1/blob/main/LICENSE">
<img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-green.svg">
</a>
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
</p>
<br />
# Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters
[π Paper](http://www.tahoebio.ai/news/tahoe-x1) | [π€ HuggingFace](https://huggingface.co/tahoebio/Tahoe-x1) | [π Getting Started](#installation) | [π§βπ« Tutorials](tutorials/)
**Tahoe-x1 (Tx1)** is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics.
Tx1 is pretrained on large-scale single-cell transcriptomic datasets including the _Tahoe-100M_ perturbation compendium, and fine-tuned for cancer-relevant tasks.
Through architectural optimizations and efficient training strategies, Tx1 achieves 3β30Γ higher compute efficiency than prior implementations while delivering state-of-the-art performance across disease-relevant benchmarks.
<picture>
<source media="(prefers-color-scheme: dark)" srcset="./assets/abstract_logo_dark_mode.png">
<source media="(prefers-color-scheme: light)" srcset="./assets/abstract_logo_light_mode.png">
<img src="./assets/abstract_logo_light_mode.png" alt="Abstract Logo">
</picture>
## Table of Contents
- [Repository Structure](#repository-structure)
- [Installation](#installation)
- [Training Infrastructure](#training-infrastructure)
- [Datasets](#datasets)
- [Pre-trained Models](#pre-trained-models)
- [Training and Fine-tuning](#training-and-fine-tuning)
- [Generating Cell and Gene Embeddings](#generating-cell-and-gene-embeddings)
- [Tutorials and Benchmarks](#tutorials-and-benchmarks)
- [Developer Guidelines](#developer-guidelines)
- [Acknowledgements](#acknowledgements)
- [License](#license)
## Repository Structure
This repository follows a similar structure to [llm-foundry](https://github.com/mosaicml/llm-foundry/tree/main) and imports several utility functions from it.
```
tahoe-x1/
βββ tahoex/ # Core Tahoe-x1 library
β βββ model/
β β βββ blocks/ # Building block modules used across models
β β βββ model/ # Full architecture subclassed from ComposerModel
β βββ tasks/ # Helper functions for downstream tasks
β βββ tokenizer/ # Vocabulary building and tokenization functions
β βββ data/ # Data loaders and collators
β βββ utils/ # Utility functions
βββ scripts/
β βββ train.py # Training script
β βββ prepare_for_inference.py # Prepares model for inference
β βββ depmap/ # DepMap benchmark scripts
β βββ msigdb/ # MSigDB pathway benchmark scripts
β βββ state_transition/ # State transition prediction scripts
β βββ data_prep/ # Dataset preparation scripts
β βββ inference/ # Inference utilities
βββ tutorials/ # Jupyter notebook tutorials
β βββ clustering_tutorial.ipynb # Cell clustering and UMAP visualization
β βββ training_tutorial.ipynb # Training walkthrough
βββ configs/
βββrunai/ # RunAI configuration files
βββmcli/ # MosaicML platform configuration files
βββgcloud/ # Google Cloud configuration files
βββtest_run.yaml # Sample config file
```
## Installation
Docker installation provides better reproducibility and avoids dependency conflicts.
### Docker Installation (Recommended)
```bash
# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1.git
cd tahoe-x1
# Pull the latest Docker image with all the dependencies pre-installed
docker pull ghcr.io/tahoebio/tahoe-x1:latest
# Start an interactive container with GPU support
# Note that nvidia-container-toolkit is required for this step
docker run -it --rm \
--gpus all \
-v "$(pwd)":/workspace \
-w /workspace \
ghcr.io/tahoebio/tahoe-x1:latest\
/bin/bash
# Inside the container, install the Tahoe-x1 package (dependencies are pre-installed)
pip install -e . --no-deps
```
The Docker image includes all necessary dependencies including PyTorch, CUDA drivers, and flash-attention for optimal performance.
### Native Installation (Alternative)
For direct installation without Docker, we recommend using `uv` for dependency management:
```bash
# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1.git
cd tahoe-x1
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and activate virtual environment
uv venv
source .venv/bin/activate
# Install the package with dependencies
uv pip install -e . --no-build-isolation-package flash-attn
```
**Note**: Native installation requires compatible CUDA drivers and may encounter dependency conflicts. Docker installation is recommended for the best experience.
## System Requirements & Training Capabilities
Tahoe-x1 is built natively on [Composer](https://github.com/mosaicml/composer) and [llm-foundry](https://github.com/mosaicml/llm-foundry), inheriting their full suite of large-scale training capabilities:
### Hardware Requirements
- **GPU**: NVIDIA Ampere (A100) or newer for FlashAttention support
- **CUDA**: Version 12.1+
- **Python**: 3.10+
### Advanced Training Features
The codebase leverages Composer's state-of-the-art training stack, configurable via YAML:
- **Automatic micro-batching** for optimal memory utilization
- **Mixed precision training** with BF16/FP16, plus FP8 support on Hopper (H100) and newer GPUs
- **Multi-GPU and multi-node** distributed training with FSDP (Fully Sharded Data Parallel)
- **Gradient accumulation and checkpointing** for training larger models on limited hardware
- **Advanced optimizers and schedulers** from the LLM training ecosystem
- **Streaming datasets** for efficient data loading at scale
This infrastructure supports training models from 70M to 3B parameters and can scale to larger architectures.
### Docker Images
We provide pre-built Docker images for ease of use:
| Image Name | Base Image | Description |
|------------|------------|-------------|
| [`ghcr.io/tahoebio/tahoe-x1:1.0.0`](https://github.com/tahoebio/tahoe-x1/pkgs/container/tahoe-x1) | `mosaicml/llm-foundry:2.2.1_cu121_flash2-813d596` | Current release image for Tahoe-x1 |
## Datasets
Tx1 was pretrained on 266 million single-cell profiles from three major sources. The following datasets were used for training and evaluation:
| Dataset | Description | Usage | Location |
|---------|-------------|-------|----------|
| **CellxGene 2025-01** | ~61M cells from Jan 2025 CellxGene release | Tx1-3B stage 1 Pre-training | `s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS/` |
| **scBaseCamp 2025-02** | ~112M cells from Feb 2025 scBaseCamp release | Tx1-3B stage 1 Pre-training | `s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2/` |
| **Tahoe 100M** | ~96M cells from Tahoe-100M | Tx1-3B stage 1 Pre-training | `s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2/` |
| **filtered CellxGene 2025-01** | ~43M filtered cells from Jan 2025 CellxGene release | Tx1-3B stage 2 Pre-training | `s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS_filtered/` |
| **filtered scBaseCamp 2025-02** | ~76M filtered cells from Feb 2025 scBaseCamp release | Tx1-3B stage 2 Pre-training | `s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2_filtered/` |
| **filtered Tahoe 100M** | ~34M filtered cells from Tahoe-100M | Tx1-3B stage 2 Pre-training | `s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2_filtered/` |
| **DepMap** | Cancer cell line dependency data | DepMap Benchmark | `s3://tahoe-hackathon-data/MFM/benchmarks/depmap/` |
| **MSigDB** | Pathway signature data | MsigDB Benchmark | `s3://tahoe-hackathon-data/MFM/benchmarks/msigdb/` |
Filtered versions of the pre-training datasets above exclude cells with very few expressed genes and are used for stage 2 pre-training of Tx1-3B.
Public access to datasets: `s3://tahoe-hackathon-data/MFM/benchmarks/`
If you require access to datasets not available in the public bucket, please open a GitHub issue or contact the team.
For more information on dataset preparation, see [scripts/data_prep/README.md](scripts/data_prep/README.md).
## Pre-trained Models
We provide pre-trained Tahoe-x1 models of various sizes:
| Model Name | Parameters | Context Length | Checkpoint Path | WandB ID | Config File |
|------------|------------|----------------|-----------------|----------|-------------|
| **Tx1-3B** | 3B | 2048 | `s3://tahoe-hackathon-data/MFM/ckpts/3b/` | [mygjkq5c](https://wandb.ai/vevotx/tahoe-x1/runs/mygjkq5c) | `./configs/mcli/tahoex-3b-v2-cont-train.yaml` |
| **Tx1-1.3B** | 1.3B | 2048 | `s3://tahoe-hackathon-data/MFM/ckpts/1b/` | [26iormxc](https://wandb.ai/vevotx/tahoe-x1/runs/26iormxc) | `./configs/gcloud/tahoex-1_3b-merged.yaml` |
| **Tx1-70M** | 70M | 1024 | `s3://tahoe-hackathon-data/MFM/ckpts/70m/` | [ftb65le8](https://wandb.ai/vevotx/tahoe-x1/runs/ftb65le8) | `./configs/gcloud/tahoex-70m-merged.yaml` |
Model weights are also available as safetensor files on our [π€ Huggingface model card](https://huggingface.co/tahoebio/Tahoe-x1).
## Training and Fine-tuning
### Training from Scratch
A sample test configuration is available at `configs/test_run.yaml` for quick experimentation.
Use the main training script with a YAML configuration file:
```bash
composer scripts/train.py -f configs/test_run.yaml
```
Or with command-line arguments:
```bash
composer scripts/train.py \
--model_name tahoex \
--data_path /path/to/data \
--max_seq_len 2048 \
--batch_size 32
```
Note that the current codebase only supports `attn_impl: flash` and `use_attn_mask: False`. The Triton backend and custom attention masks (used for training Tx1-1B and Tx1-70M) are no longer supported. If you have questions about using custom attention masks with the Triton backend, please contact us.
### Fine-tuning
To fine-tune a pre-trained model on your own data:
1. Download a pre-trained checkpoint from S3
2. Modify the training configuration to load from checkpoint
3. Prepare your dataset in the MDS format (see [scripts/data_prep/README.md](scripts/data_prep/README.md))
4. Launch training with the `--load_path` argument
```bash
python scripts/train.py \
-f configs/finetune_config.yaml \
--load_path s3://path/to/checkpoint
```
### Launching runs on different platforms
For launching runs on specific platforms such as MosaicML, Google Cloud, or RunAI, refer to the corresponding configuration folders under `configs/` and their respective README files.
### Preparing Models for Inference
Package a trained model with its vocabulary and metadata:
```bash
python scripts/prepare_for_inference.py \
--model_path /path/to/checkpoint \
--vocab_path /path/to/vocab.json \
--output_path /path/to/inference_model
```
## Generating Cell and Gene Embeddings
### Quick Start with Inference Script
Extract cell embeddings from an AnnData object:
```python
from omegaconf import OmegaConf as om
from scripts.inference.predict_embeddings import predict_embeddings
cfg = {
"model_name": "Tx1-70m",
"paths": {
"hf_repo_id": "tahoebio/Tahoe-x1",
"hf_model_size": "70m",
"adata_input": "/path/to/your_data.h5ad",
},
"data": {
"cell_type_key": "cell_type",
"gene_id_key": "ensembl_id"
},
"predict": {
"seq_len_dataset": 2048,
"return_gene_embeddings": False,
}
}
cfg = om.create(cfg)
adata = predict_embeddings(cfg)
# Access embeddings
cell_embeddings = adata.obsm["Tx1-70m"]
```
### Extracting Gene Embeddings
Set `return_gene_embeddings: True` in the configuration to extract gene-level representations.
## Tutorials and Benchmarks
### Tutorials
- **[Clustering Tutorial](tutorials/clustering_tutorial.ipynb)**: Cell clustering and UMAP visualization
- **[Training Tutorial](tutorials/training_tutorial.ipynb)**: Step-by-step guide to training Tahoe-x1 models
### Benchmarks
Tx1 achieves state-of-the-art performance across disease-relevant benchmarks. See our [preprint](http://www.tahoebio.ai/news/tahoe-x1) for detailed results.
| Benchmark | Task | Code Location |
|-----------|------|---------------|
| **DepMap Essentiality** | Predict broad and context-specific gene dependencies | [`scripts/depmap/`](scripts/depmap/) |
| **MSigDB Hallmarks** | Recover 50 hallmark pathway memberships from gene embeddings | [`scripts/msigdb/`](scripts/msigdb/) |
| **Cell-Type Classification** | Classify cell types across 5 tissues (Tabula Sapiens 2.0) | [`cz-benchmarks`](https://github.com/chanzuckerberg/cz-benchmarks) |
| **Perturbation Prediction** | Predict transcriptional responses in held-out contexts | [`scripts/state_transition/`](scripts/state_transition/) |
### Additional Resources
- **Data Preparation**: [scripts/data_prep/README.md](scripts/data_prep/README.md)
- **Platform Usage**: [mcli/README.md](mcli/README.md) and [gcloud/README.md](gcloud/README.md)
## Troubleshooting
### Common Issues and Solutions
- **PyTorch/CUDA mismatch**: Ensure PyTorch is installed with the correct CUDA version for your system
- **Docker permission denied**: Run Docker commands with `sudo` or add your user to the docker group
- **OOM (Out of Memory)**: Ensure half-precision, flash-attention are enabled, set microbatch_size to auto
- **S3 access denied**: For public buckets, the code will automatically retry with unsigned requests
For additional help, please open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) with:
- Your system configuration (OS, GPU, PyTorch version)
- Complete error message and stack trace
- Steps to reproduce the issue
## Acknowledgements
We thank the developers of the following open-source projects:
- **[scGPT](https://github.com/bowang-lab/scGPT/tree/main)**: For inspiring the Tahoe-x1 architecture
- **[llm-foundry](https://github.com/mosaicml/llm-foundry)**: Efficient training infrastructure for large language models
- **[streaming](https://github.com/mosaicml/streaming)**: Fast, efficient dataset streaming
- **[CZ CELLxGENE](https://cellxgene.cziscience.com/)**: Chan Zuckerberg Initiative's single-cell atlas
- **[Arc scBaseCount](https://github.com/ArcInstitute/arc-virtual-cell-atlas/)**: Arc Institute's virtual cell atlas
- **[Arc Institute STATE](https://github.com/arcinstitute/state)**: State Transition model for perturbation prediction
---
For questions, issues, or collaboration inquiries, please open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) or write to us at [admin@tahoebio.ai](mailto:admin@tahoebio.ai).
Raw data
{
"_id": null,
"home_page": "https://github.com/tahoebio/tahoe-x1/",
"name": "tahoex",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "single-cell, foundation-model, genomics, transcriptomics, perturbation, machine-learning, deep-learning",
"author": "Tahoe Therapeutics",
"author_email": "admin@tahoebio.ai",
"download_url": "https://files.pythonhosted.org/packages/b9/a5/e3e16d45b75546651e1e7b7397499fe26ce815d9972bed34e210bdb3198f/tahoex-1.0.4.tar.gz",
"platform": null,
"description": "\n\n\n<p align=\"center\">\n<a href=\"https://github.com/astral-sh/ruff\"><img alt=\"Linter: Ruff\" src=\"https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json\"></a>\n<a href=\"https://github.com/tahoebio/tahoe-x1/blob/main/LICENSE\">\n <img alt=\"License\" src=\"https://img.shields.io/badge/License-Apache%202.0-green.svg\">\n </a>\n<a href=\"https://github.com/psf/black\"><img alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"></a>\n</p>\n<br />\n\n# Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters\n\n[\ud83d\udcc4 Paper](http://www.tahoebio.ai/news/tahoe-x1) | [\ud83e\udd17 HuggingFace](https://huggingface.co/tahoebio/Tahoe-x1) | [\ud83d\ude80 Getting Started](#installation) | [\ud83e\uddd1\u200d\ud83c\udfeb Tutorials](tutorials/)\n\n**Tahoe-x1 (Tx1)** is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics.\nTx1 is pretrained on large-scale single-cell transcriptomic datasets including the _Tahoe-100M_ perturbation compendium, and fine-tuned for cancer-relevant tasks.\nThrough architectural optimizations and efficient training strategies, Tx1 achieves 3\u201330\u00d7 higher compute efficiency than prior implementations while delivering state-of-the-art performance across disease-relevant benchmarks.\n\n<picture>\n <source media=\"(prefers-color-scheme: dark)\" srcset=\"./assets/abstract_logo_dark_mode.png\">\n <source media=\"(prefers-color-scheme: light)\" srcset=\"./assets/abstract_logo_light_mode.png\">\n <img src=\"./assets/abstract_logo_light_mode.png\" alt=\"Abstract Logo\">\n</picture>\n\n## Table of Contents\n- [Repository Structure](#repository-structure)\n- [Installation](#installation)\n- [Training Infrastructure](#training-infrastructure)\n- [Datasets](#datasets)\n- [Pre-trained Models](#pre-trained-models)\n- [Training and Fine-tuning](#training-and-fine-tuning)\n- [Generating Cell and Gene Embeddings](#generating-cell-and-gene-embeddings)\n- [Tutorials and Benchmarks](#tutorials-and-benchmarks)\n- [Developer Guidelines](#developer-guidelines)\n- [Acknowledgements](#acknowledgements)\n- [License](#license)\n\n## Repository Structure\n\nThis repository follows a similar structure to [llm-foundry](https://github.com/mosaicml/llm-foundry/tree/main) and imports several utility functions from it.\n\n```\ntahoe-x1/\n\u251c\u2500\u2500 tahoex/ # Core Tahoe-x1 library\n\u2502 \u251c\u2500\u2500 model/\n\u2502 \u2502 \u251c\u2500\u2500 blocks/ # Building block modules used across models\n\u2502 \u2502 \u2514\u2500\u2500 model/ # Full architecture subclassed from ComposerModel\n\u2502 \u251c\u2500\u2500 tasks/ # Helper functions for downstream tasks\n\u2502 \u251c\u2500\u2500 tokenizer/ # Vocabulary building and tokenization functions\n\u2502 \u251c\u2500\u2500 data/ # Data loaders and collators\n\u2502 \u2514\u2500\u2500 utils/ # Utility functions \n\u251c\u2500\u2500 scripts/\n\u2502 \u251c\u2500\u2500 train.py # Training script\n\u2502 \u251c\u2500\u2500 prepare_for_inference.py # Prepares model for inference\n\u2502 \u251c\u2500\u2500 depmap/ # DepMap benchmark scripts\n\u2502 \u251c\u2500\u2500 msigdb/ # MSigDB pathway benchmark scripts\n\u2502 \u251c\u2500\u2500 state_transition/ # State transition prediction scripts\n\u2502 \u251c\u2500\u2500 data_prep/ # Dataset preparation scripts\n\u2502 \u2514\u2500\u2500 inference/ # Inference utilities\n\u251c\u2500\u2500 tutorials/ # Jupyter notebook tutorials\n\u2502 \u251c\u2500\u2500 clustering_tutorial.ipynb # Cell clustering and UMAP visualization\n\u2502 \u2514\u2500\u2500 training_tutorial.ipynb # Training walkthrough\n\u2514\u2500\u2500 configs/ \n \u251c\u2500\u2500runai/ # RunAI configuration files\n \u251c\u2500\u2500mcli/ # MosaicML platform configuration files\n \u251c\u2500\u2500gcloud/ # Google Cloud configuration files\n \u2514\u2500\u2500test_run.yaml # Sample config file\n```\n\n## Installation\n\nDocker installation provides better reproducibility and avoids dependency conflicts.\n\n### Docker Installation (Recommended)\n\n```bash\n# Clone the repository\ngit clone https://github.com/tahoebio/tahoe-x1.git\ncd tahoe-x1\n\n# Pull the latest Docker image with all the dependencies pre-installed\ndocker pull ghcr.io/tahoebio/tahoe-x1:latest\n\n# Start an interactive container with GPU support\n# Note that nvidia-container-toolkit is required for this step\ndocker run -it --rm \\\n --gpus all \\\n -v \"$(pwd)\":/workspace \\\n -w /workspace \\\n ghcr.io/tahoebio/tahoe-x1:latest\\\n /bin/bash\n\n# Inside the container, install the Tahoe-x1 package (dependencies are pre-installed)\npip install -e . --no-deps\n```\n\nThe Docker image includes all necessary dependencies including PyTorch, CUDA drivers, and flash-attention for optimal performance.\n\n### Native Installation (Alternative)\n\nFor direct installation without Docker, we recommend using `uv` for dependency management:\n\n```bash\n# Clone the repository\ngit clone https://github.com/tahoebio/tahoe-x1.git\ncd tahoe-x1\n\n# Install uv if not already installed\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n\n# Create and activate virtual environment\nuv venv\nsource .venv/bin/activate\n\n# Install the package with dependencies\nuv pip install -e . --no-build-isolation-package flash-attn\n```\n\n**Note**: Native installation requires compatible CUDA drivers and may encounter dependency conflicts. Docker installation is recommended for the best experience.\n\n\n\n\n\n## System Requirements & Training Capabilities\n\nTahoe-x1 is built natively on [Composer](https://github.com/mosaicml/composer) and [llm-foundry](https://github.com/mosaicml/llm-foundry), inheriting their full suite of large-scale training capabilities:\n\n### Hardware Requirements\n- **GPU**: NVIDIA Ampere (A100) or newer for FlashAttention support\n- **CUDA**: Version 12.1+\n- **Python**: 3.10+\n\n### Advanced Training Features\nThe codebase leverages Composer's state-of-the-art training stack, configurable via YAML:\n- **Automatic micro-batching** for optimal memory utilization\n- **Mixed precision training** with BF16/FP16, plus FP8 support on Hopper (H100) and newer GPUs\n- **Multi-GPU and multi-node** distributed training with FSDP (Fully Sharded Data Parallel)\n- **Gradient accumulation and checkpointing** for training larger models on limited hardware\n- **Advanced optimizers and schedulers** from the LLM training ecosystem\n- **Streaming datasets** for efficient data loading at scale\n\nThis infrastructure supports training models from 70M to 3B parameters and can scale to larger architectures.\n\n### Docker Images\n\nWe provide pre-built Docker images for ease of use:\n\n| Image Name | Base Image | Description |\n|------------|------------|-------------|\n| [`ghcr.io/tahoebio/tahoe-x1:1.0.0`](https://github.com/tahoebio/tahoe-x1/pkgs/container/tahoe-x1) | `mosaicml/llm-foundry:2.2.1_cu121_flash2-813d596` | Current release image for Tahoe-x1 |\n\n## Datasets\n\nTx1 was pretrained on 266 million single-cell profiles from three major sources. The following datasets were used for training and evaluation:\n\n| Dataset | Description | Usage | Location |\n|---------|-------------|-------|----------|\n| **CellxGene 2025-01** | ~61M cells from Jan 2025 CellxGene release | Tx1-3B stage 1 Pre-training | `s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS/` |\n| **scBaseCamp 2025-02** | ~112M cells from Feb 2025 scBaseCamp release | Tx1-3B stage 1 Pre-training | `s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2/` |\n| **Tahoe 100M** | ~96M cells from Tahoe-100M | Tx1-3B stage 1 Pre-training | `s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2/` |\n| **filtered CellxGene 2025-01** | ~43M filtered cells from Jan 2025 CellxGene release | Tx1-3B stage 2 Pre-training | `s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS_filtered/` |\n| **filtered scBaseCamp 2025-02** | ~76M filtered cells from Feb 2025 scBaseCamp release | Tx1-3B stage 2 Pre-training | `s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2_filtered/` |\n| **filtered Tahoe 100M** | ~34M filtered cells from Tahoe-100M | Tx1-3B stage 2 Pre-training | `s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2_filtered/` |\n| **DepMap** | Cancer cell line dependency data | DepMap Benchmark | `s3://tahoe-hackathon-data/MFM/benchmarks/depmap/` |\n| **MSigDB** | Pathway signature data | MsigDB Benchmark | `s3://tahoe-hackathon-data/MFM/benchmarks/msigdb/` |\n\nFiltered versions of the pre-training datasets above exclude cells with very few expressed genes and are used for stage 2 pre-training of Tx1-3B.\n\nPublic access to datasets: `s3://tahoe-hackathon-data/MFM/benchmarks/`\n\nIf you require access to datasets not available in the public bucket, please open a GitHub issue or contact the team.\n\nFor more information on dataset preparation, see [scripts/data_prep/README.md](scripts/data_prep/README.md).\n\n\n\n## Pre-trained Models\n\nWe provide pre-trained Tahoe-x1 models of various sizes:\n\n| Model Name | Parameters | Context Length | Checkpoint Path | WandB ID | Config File |\n|------------|------------|----------------|-----------------|----------|-------------|\n| **Tx1-3B** | 3B | 2048 | `s3://tahoe-hackathon-data/MFM/ckpts/3b/` | [mygjkq5c](https://wandb.ai/vevotx/tahoe-x1/runs/mygjkq5c) | `./configs/mcli/tahoex-3b-v2-cont-train.yaml` |\n| **Tx1-1.3B** | 1.3B | 2048 | `s3://tahoe-hackathon-data/MFM/ckpts/1b/` | [26iormxc](https://wandb.ai/vevotx/tahoe-x1/runs/26iormxc) | `./configs/gcloud/tahoex-1_3b-merged.yaml` |\n| **Tx1-70M** | 70M | 1024 | `s3://tahoe-hackathon-data/MFM/ckpts/70m/` | [ftb65le8](https://wandb.ai/vevotx/tahoe-x1/runs/ftb65le8) | `./configs/gcloud/tahoex-70m-merged.yaml` |\n\nModel weights are also available as safetensor files on our [\ud83e\udd17 Huggingface model card](https://huggingface.co/tahoebio/Tahoe-x1).\n\n## Training and Fine-tuning\n\n### Training from Scratch\n\nA sample test configuration is available at `configs/test_run.yaml` for quick experimentation.\n\nUse the main training script with a YAML configuration file: \n\n```bash\ncomposer scripts/train.py -f configs/test_run.yaml\n```\n\nOr with command-line arguments:\n\n```bash\ncomposer scripts/train.py \\\n --model_name tahoex \\\n --data_path /path/to/data \\\n --max_seq_len 2048 \\\n --batch_size 32\n```\n\nNote that the current codebase only supports `attn_impl: flash` and `use_attn_mask: False`. The Triton backend and custom attention masks (used for training Tx1-1B and Tx1-70M) are no longer supported. If you have questions about using custom attention masks with the Triton backend, please contact us.\n\n### Fine-tuning\n\nTo fine-tune a pre-trained model on your own data:\n\n1. Download a pre-trained checkpoint from S3\n2. Modify the training configuration to load from checkpoint\n3. Prepare your dataset in the MDS format (see [scripts/data_prep/README.md](scripts/data_prep/README.md))\n4. Launch training with the `--load_path` argument\n\n```bash\npython scripts/train.py \\\n -f configs/finetune_config.yaml \\\n --load_path s3://path/to/checkpoint\n```\n\n### Launching runs on different platforms\nFor launching runs on specific platforms such as MosaicML, Google Cloud, or RunAI, refer to the corresponding configuration folders under `configs/` and their respective README files.\n\n### Preparing Models for Inference\n\nPackage a trained model with its vocabulary and metadata:\n\n```bash\npython scripts/prepare_for_inference.py \\\n --model_path /path/to/checkpoint \\\n --vocab_path /path/to/vocab.json \\\n --output_path /path/to/inference_model\n```\n\n\n## Generating Cell and Gene Embeddings\n\n### Quick Start with Inference Script\n\nExtract cell embeddings from an AnnData object:\n\n```python\nfrom omegaconf import OmegaConf as om\nfrom scripts.inference.predict_embeddings import predict_embeddings\n\ncfg = {\n \"model_name\": \"Tx1-70m\",\n \"paths\": {\n \"hf_repo_id\": \"tahoebio/Tahoe-x1\",\n \"hf_model_size\": \"70m\",\n \"adata_input\": \"/path/to/your_data.h5ad\",\n },\n \"data\": {\n \"cell_type_key\": \"cell_type\",\n \"gene_id_key\": \"ensembl_id\"\n },\n \"predict\": {\n \"seq_len_dataset\": 2048,\n \"return_gene_embeddings\": False,\n }\n}\n\ncfg = om.create(cfg)\nadata = predict_embeddings(cfg)\n\n# Access embeddings\ncell_embeddings = adata.obsm[\"Tx1-70m\"]\n```\n\n### Extracting Gene Embeddings\n\nSet `return_gene_embeddings: True` in the configuration to extract gene-level representations.\n\n\n## Tutorials and Benchmarks\n\n### Tutorials\n- **[Clustering Tutorial](tutorials/clustering_tutorial.ipynb)**: Cell clustering and UMAP visualization\n- **[Training Tutorial](tutorials/training_tutorial.ipynb)**: Step-by-step guide to training Tahoe-x1 models\n\n\n### Benchmarks\n\nTx1 achieves state-of-the-art performance across disease-relevant benchmarks. See our [preprint](http://www.tahoebio.ai/news/tahoe-x1) for detailed results.\n\n| Benchmark | Task | Code Location |\n|-----------|------|---------------|\n| **DepMap Essentiality** | Predict broad and context-specific gene dependencies | [`scripts/depmap/`](scripts/depmap/) |\n| **MSigDB Hallmarks** | Recover 50 hallmark pathway memberships from gene embeddings | [`scripts/msigdb/`](scripts/msigdb/) |\n| **Cell-Type Classification** | Classify cell types across 5 tissues (Tabula Sapiens 2.0) | [`cz-benchmarks`](https://github.com/chanzuckerberg/cz-benchmarks) |\n| **Perturbation Prediction** | Predict transcriptional responses in held-out contexts | [`scripts/state_transition/`](scripts/state_transition/) |\n\n\n### Additional Resources\n- **Data Preparation**: [scripts/data_prep/README.md](scripts/data_prep/README.md)\n- **Platform Usage**: [mcli/README.md](mcli/README.md) and [gcloud/README.md](gcloud/README.md)\n\n## Troubleshooting\n\n### Common Issues and Solutions\n\n- **PyTorch/CUDA mismatch**: Ensure PyTorch is installed with the correct CUDA version for your system\n- **Docker permission denied**: Run Docker commands with `sudo` or add your user to the docker group\n- **OOM (Out of Memory)**: Ensure half-precision, flash-attention are enabled, set microbatch_size to auto\n- **S3 access denied**: For public buckets, the code will automatically retry with unsigned requests\n\n\nFor additional help, please open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) with:\n- Your system configuration (OS, GPU, PyTorch version)\n- Complete error message and stack trace\n- Steps to reproduce the issue\n\n## Acknowledgements\nWe thank the developers of the following open-source projects:\n- **[scGPT](https://github.com/bowang-lab/scGPT/tree/main)**: For inspiring the Tahoe-x1 architecture\n- **[llm-foundry](https://github.com/mosaicml/llm-foundry)**: Efficient training infrastructure for large language models\n- **[streaming](https://github.com/mosaicml/streaming)**: Fast, efficient dataset streaming\n- **[CZ CELLxGENE](https://cellxgene.cziscience.com/)**: Chan Zuckerberg Initiative's single-cell atlas\n- **[Arc scBaseCount](https://github.com/ArcInstitute/arc-virtual-cell-atlas/)**: Arc Institute's virtual cell atlas\n- **[Arc Institute STATE](https://github.com/arcinstitute/state)**: State Transition model for perturbation prediction\n---\n\nFor questions, issues, or collaboration inquiries, please open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) or write to us at [admin@tahoebio.ai](mailto:admin@tahoebio.ai).\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Tahoe-x1: Perturbation trained single-cell foundation models with up to 3 billion parameters",
"version": "1.0.4",
"project_urls": {
"Bug Tracker": "https://github.com/tahoebio/tahoe-x1/issues",
"Documentation": "https://github.com/tahoebio/tahoe-x1#readme",
"Homepage": "https://github.com/tahoebio/tahoe-x1",
"Model Card": "https://huggingface.co/tahoebio/Tahoe-x1",
"Paper": "http://www.tahoebio.ai/news/tahoe-x1",
"Repository": "https://github.com/tahoebio/tahoe-x1"
},
"split_keywords": [
"single-cell",
" foundation-model",
" genomics",
" transcriptomics",
" perturbation",
" machine-learning",
" deep-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "853bc7d39b5096829db6769261b677ec2c5f999927d8130598023cda17caa3c2",
"md5": "34d734c938f0612bd05b4898c0b27e2d",
"sha256": "033de754120e3cf40ccecb77d559bcb64a7c856236fe85b5331ad6e542fdbf23"
},
"downloads": -1,
"filename": "tahoex-1.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "34d734c938f0612bd05b4898c0b27e2d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 47647,
"upload_time": "2025-10-23T05:15:57",
"upload_time_iso_8601": "2025-10-23T05:15:57.062313Z",
"url": "https://files.pythonhosted.org/packages/85/3b/c7d39b5096829db6769261b677ec2c5f999927d8130598023cda17caa3c2/tahoex-1.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b9a5e3e16d45b75546651e1e7b7397499fe26ce815d9972bed34e210bdb3198f",
"md5": "056b0a3f1bbc56556c02e8c28a7332e1",
"sha256": "6c76546569b06166d485526a0a1d38126d3176a06b619027557a0cf45ae3935c"
},
"downloads": -1,
"filename": "tahoex-1.0.4.tar.gz",
"has_sig": false,
"md5_digest": "056b0a3f1bbc56556c02e8c28a7332e1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 49998,
"upload_time": "2025-10-23T05:15:58",
"upload_time_iso_8601": "2025-10-23T05:15:58.374923Z",
"url": "https://files.pythonhosted.org/packages/b9/a5/e3e16d45b75546651e1e7b7397499fe26ce815d9972bed34e210bdb3198f/tahoex-1.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-23 05:15:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tahoebio",
"github_project": "tahoe-x1",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tahoex"
}