tahoex


Nametahoex JSON
Version 1.0.4 PyPI version JSON
download
home_pagehttps://github.com/tahoebio/tahoe-x1/
SummaryTahoe-x1: Perturbation trained single-cell foundation models with up to 3 billion parameters
upload_time2025-10-23 05:15:58
maintainerNone
docs_urlNone
authorTahoe Therapeutics
requires_python>=3.10
licenseApache-2.0
keywords single-cell foundation-model genomics transcriptomics perturbation machine-learning deep-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            


<p align="center">
<a href="https://github.com/astral-sh/ruff"><img alt="Linter: Ruff" src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json"></a>
<a href="https://github.com/tahoebio/tahoe-x1/blob/main/LICENSE">
        <img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-green.svg">
    </a>
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
</p>
<br />

# Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters

[πŸ“„ Paper](http://www.tahoebio.ai/news/tahoe-x1) | [πŸ€— HuggingFace](https://huggingface.co/tahoebio/Tahoe-x1) | [πŸš€ Getting Started](#installation) | [πŸ§‘β€πŸ« Tutorials](tutorials/)

**Tahoe-x1 (Tx1)** is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics.
Tx1 is pretrained on large-scale single-cell transcriptomic datasets including the _Tahoe-100M_ perturbation compendium, and fine-tuned for cancer-relevant tasks.
Through architectural optimizations and efficient training strategies, Tx1 achieves 3–30Γ— higher compute efficiency than prior implementations while delivering state-of-the-art performance across disease-relevant benchmarks.

<picture>
  <source media="(prefers-color-scheme: dark)" srcset="./assets/abstract_logo_dark_mode.png">
  <source media="(prefers-color-scheme: light)" srcset="./assets/abstract_logo_light_mode.png">
  <img src="./assets/abstract_logo_light_mode.png" alt="Abstract Logo">
</picture>

## Table of Contents
- [Repository Structure](#repository-structure)
- [Installation](#installation)
- [Training Infrastructure](#training-infrastructure)
- [Datasets](#datasets)
- [Pre-trained Models](#pre-trained-models)
- [Training and Fine-tuning](#training-and-fine-tuning)
- [Generating Cell and Gene Embeddings](#generating-cell-and-gene-embeddings)
- [Tutorials and Benchmarks](#tutorials-and-benchmarks)
- [Developer Guidelines](#developer-guidelines)
- [Acknowledgements](#acknowledgements)
- [License](#license)

## Repository Structure

This repository follows a similar structure to [llm-foundry](https://github.com/mosaicml/llm-foundry/tree/main) and imports several utility functions from it.

```
tahoe-x1/
β”œβ”€β”€ tahoex/                    # Core Tahoe-x1 library
β”‚   β”œβ”€β”€ model/
β”‚   β”‚   β”œβ”€β”€ blocks/           # Building block modules used across models
β”‚   β”‚   └── model/            # Full architecture subclassed from ComposerModel
β”‚   β”œβ”€β”€ tasks/                # Helper functions for downstream tasks
β”‚   β”œβ”€β”€ tokenizer/            # Vocabulary building and tokenization functions
β”‚   β”œβ”€β”€ data/                 # Data loaders and collators
β”‚   └── utils/                # Utility functions 
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train.py              # Training script
β”‚   β”œβ”€β”€ prepare_for_inference.py  # Prepares model for inference
β”‚   β”œβ”€β”€ depmap/               # DepMap benchmark scripts
β”‚   β”œβ”€β”€ msigdb/               # MSigDB pathway benchmark scripts
β”‚   β”œβ”€β”€ state_transition/     # State transition prediction scripts
β”‚   β”œβ”€β”€ data_prep/            # Dataset preparation scripts
β”‚   └── inference/            # Inference utilities
β”œβ”€β”€ tutorials/                 # Jupyter notebook tutorials
β”‚   β”œβ”€β”€ clustering_tutorial.ipynb  # Cell clustering and UMAP visualization
β”‚   └── training_tutorial.ipynb    # Training walkthrough
└── configs/                      
    β”œβ”€β”€runai/                 # RunAI configuration files
    β”œβ”€β”€mcli/                  # MosaicML platform configuration files
    β”œβ”€β”€gcloud/                # Google Cloud configuration files
    └──test_run.yaml          # Sample config file
```

## Installation

Docker installation provides better reproducibility and avoids dependency conflicts.

### Docker Installation (Recommended)

```bash
# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1.git
cd tahoe-x1

# Pull the latest Docker image with all the dependencies pre-installed
docker pull ghcr.io/tahoebio/tahoe-x1:latest

# Start an interactive container with GPU support
# Note that nvidia-container-toolkit is required for this step
docker run -it --rm \
  --gpus all \
  -v "$(pwd)":/workspace \
  -w /workspace \
  ghcr.io/tahoebio/tahoe-x1:latest\
  /bin/bash

# Inside the container, install the Tahoe-x1 package (dependencies are pre-installed)
pip install -e . --no-deps
```

The Docker image includes all necessary dependencies including PyTorch, CUDA drivers, and flash-attention for optimal performance.

### Native Installation (Alternative)

For direct installation without Docker, we recommend using `uv` for dependency management:

```bash
# Clone the repository
git clone https://github.com/tahoebio/tahoe-x1.git
cd tahoe-x1

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv
source .venv/bin/activate

# Install the package with dependencies
uv pip install -e . --no-build-isolation-package flash-attn
```

**Note**: Native installation requires compatible CUDA drivers and may encounter dependency conflicts. Docker installation is recommended for the best experience.





## System Requirements & Training Capabilities

Tahoe-x1 is built natively on [Composer](https://github.com/mosaicml/composer) and [llm-foundry](https://github.com/mosaicml/llm-foundry), inheriting their full suite of large-scale training capabilities:

### Hardware Requirements
- **GPU**: NVIDIA Ampere (A100) or newer for FlashAttention support
- **CUDA**: Version 12.1+
- **Python**: 3.10+

### Advanced Training Features
The codebase leverages Composer's state-of-the-art training stack, configurable via YAML:
- **Automatic micro-batching** for optimal memory utilization
- **Mixed precision training** with BF16/FP16, plus FP8 support on Hopper (H100) and newer GPUs
- **Multi-GPU and multi-node** distributed training with FSDP (Fully Sharded Data Parallel)
- **Gradient accumulation and checkpointing** for training larger models on limited hardware
- **Advanced optimizers and schedulers** from the LLM training ecosystem
- **Streaming datasets** for efficient data loading at scale

This infrastructure supports training models from 70M to 3B parameters and can scale to larger architectures.

### Docker Images

We provide pre-built Docker images for ease of use:

| Image Name | Base Image | Description |
|------------|------------|-------------|
| [`ghcr.io/tahoebio/tahoe-x1:1.0.0`](https://github.com/tahoebio/tahoe-x1/pkgs/container/tahoe-x1) | `mosaicml/llm-foundry:2.2.1_cu121_flash2-813d596` | Current release image for Tahoe-x1 |

## Datasets

Tx1 was pretrained on 266 million single-cell profiles from three major sources. The following datasets were used for training and evaluation:

| Dataset | Description | Usage | Location |
|---------|-------------|-------|----------|
| **CellxGene 2025-01** | ~61M  cells  from Jan 2025 CellxGene release | Tx1-3B stage 1 Pre-training  | `s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS/` |
| **scBaseCamp 2025-02** | ~112M  cells from Feb 2025 scBaseCamp release | Tx1-3B stage 1 Pre-training | `s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2/` |
| **Tahoe 100M** | ~96M  cells from Tahoe-100M | Tx1-3B stage 1 Pre-training | `s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2/` |
| **filtered CellxGene 2025-01** | ~43M filtered cells  from Jan 2025 CellxGene release | Tx1-3B stage 2 Pre-training  | `s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS_filtered/` |
| **filtered scBaseCamp 2025-02** | ~76M filtered cells from Feb 2025 scBaseCamp release | Tx1-3B stage 2 Pre-training | `s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2_filtered/` |
| **filtered Tahoe 100M** | ~34M filtered cells from Tahoe-100M | Tx1-3B stage 2 Pre-training | `s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2_filtered/` |
| **DepMap** | Cancer cell line dependency data | DepMap Benchmark | `s3://tahoe-hackathon-data/MFM/benchmarks/depmap/` |
| **MSigDB** | Pathway signature data | MsigDB Benchmark | `s3://tahoe-hackathon-data/MFM/benchmarks/msigdb/` |

Filtered versions of the pre-training datasets above exclude cells with very few expressed genes and are used for stage 2 pre-training of Tx1-3B.

Public access to datasets: `s3://tahoe-hackathon-data/MFM/benchmarks/`

If you require access to datasets not available in the public bucket, please open a GitHub issue or contact the team.

For more information on dataset preparation, see [scripts/data_prep/README.md](scripts/data_prep/README.md).



## Pre-trained Models

We provide pre-trained Tahoe-x1 models of various sizes:

| Model Name | Parameters | Context Length | Checkpoint Path | WandB ID | Config File |
|------------|------------|----------------|-----------------|----------|-------------|
| **Tx1-3B** | 3B | 2048  | `s3://tahoe-hackathon-data/MFM/ckpts/3b/` | [mygjkq5c](https://wandb.ai/vevotx/tahoe-x1/runs/mygjkq5c) | `./configs/mcli/tahoex-3b-v2-cont-train.yaml` |
| **Tx1-1.3B** | 1.3B | 2048 | `s3://tahoe-hackathon-data/MFM/ckpts/1b/` | [26iormxc](https://wandb.ai/vevotx/tahoe-x1/runs/26iormxc) | `./configs/gcloud/tahoex-1_3b-merged.yaml` |
| **Tx1-70M** | 70M | 1024 | `s3://tahoe-hackathon-data/MFM/ckpts/70m/` | [ftb65le8](https://wandb.ai/vevotx/tahoe-x1/runs/ftb65le8) | `./configs/gcloud/tahoex-70m-merged.yaml` |

Model weights are also available as safetensor files on our  [πŸ€— Huggingface model card](https://huggingface.co/tahoebio/Tahoe-x1).

## Training and Fine-tuning

### Training from Scratch

A sample test configuration is available at `configs/test_run.yaml` for quick experimentation.

Use the main training script with a YAML configuration file: 

```bash
composer scripts/train.py -f configs/test_run.yaml
```

Or with command-line arguments:

```bash
composer scripts/train.py \
  --model_name tahoex \
  --data_path /path/to/data \
  --max_seq_len 2048 \
  --batch_size 32
```

Note that the current codebase only supports `attn_impl: flash` and `use_attn_mask: False`. The Triton backend and custom attention masks (used for training Tx1-1B and Tx1-70M) are no longer supported. If you have questions about using custom attention masks with the Triton backend, please contact us.

### Fine-tuning

To fine-tune a pre-trained model on your own data:

1. Download a pre-trained checkpoint from S3
2. Modify the training configuration to load from checkpoint
3. Prepare your dataset in the MDS format (see [scripts/data_prep/README.md](scripts/data_prep/README.md))
4. Launch training with the `--load_path` argument

```bash
python scripts/train.py \
  -f configs/finetune_config.yaml \
  --load_path s3://path/to/checkpoint
```

### Launching runs on different platforms
For launching runs on specific platforms such as MosaicML, Google Cloud, or RunAI, refer to the corresponding configuration folders under `configs/` and their respective README files.

### Preparing Models for Inference

Package a trained model with its vocabulary and metadata:

```bash
python scripts/prepare_for_inference.py \
  --model_path /path/to/checkpoint \
  --vocab_path /path/to/vocab.json \
  --output_path /path/to/inference_model
```


## Generating Cell and Gene Embeddings

### Quick Start with Inference Script

Extract cell embeddings from an AnnData object:

```python
from omegaconf import OmegaConf as om
from scripts.inference.predict_embeddings import predict_embeddings

cfg = {
    "model_name": "Tx1-70m",
    "paths": {
        "hf_repo_id": "tahoebio/Tahoe-x1",
        "hf_model_size": "70m",
        "adata_input": "/path/to/your_data.h5ad",
    },
    "data": {
        "cell_type_key": "cell_type",
        "gene_id_key": "ensembl_id"
    },
    "predict": {
        "seq_len_dataset": 2048,
        "return_gene_embeddings": False,
    }
}

cfg = om.create(cfg)
adata = predict_embeddings(cfg)

# Access embeddings
cell_embeddings = adata.obsm["Tx1-70m"]
```

### Extracting Gene Embeddings

Set `return_gene_embeddings: True` in the configuration to extract gene-level representations.


## Tutorials and Benchmarks

### Tutorials
- **[Clustering Tutorial](tutorials/clustering_tutorial.ipynb)**: Cell clustering and UMAP visualization
- **[Training Tutorial](tutorials/training_tutorial.ipynb)**: Step-by-step guide to training Tahoe-x1 models


### Benchmarks

Tx1 achieves state-of-the-art performance across disease-relevant benchmarks. See our [preprint](http://www.tahoebio.ai/news/tahoe-x1) for detailed results.

| Benchmark | Task | Code Location |
|-----------|------|---------------|
| **DepMap Essentiality** | Predict broad and context-specific gene dependencies | [`scripts/depmap/`](scripts/depmap/) |
| **MSigDB Hallmarks** | Recover 50 hallmark pathway memberships from gene embeddings | [`scripts/msigdb/`](scripts/msigdb/) |
| **Cell-Type Classification** | Classify cell types across 5 tissues (Tabula Sapiens 2.0) | [`cz-benchmarks`](https://github.com/chanzuckerberg/cz-benchmarks) |
| **Perturbation Prediction** | Predict transcriptional responses in held-out contexts | [`scripts/state_transition/`](scripts/state_transition/) |


### Additional Resources
- **Data Preparation**: [scripts/data_prep/README.md](scripts/data_prep/README.md)
- **Platform Usage**: [mcli/README.md](mcli/README.md) and [gcloud/README.md](gcloud/README.md)

## Troubleshooting

### Common Issues and Solutions

- **PyTorch/CUDA mismatch**: Ensure PyTorch is installed with the correct CUDA version for your system
- **Docker permission denied**: Run Docker commands with `sudo` or add your user to the docker group
- **OOM (Out of Memory)**: Ensure half-precision, flash-attention are enabled, set microbatch_size to auto
- **S3 access denied**: For public buckets, the code will automatically retry with unsigned requests


For additional help, please open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) with:
- Your system configuration (OS, GPU, PyTorch version)
- Complete error message and stack trace
- Steps to reproduce the issue

## Acknowledgements
We thank the developers of the following open-source projects:
- **[scGPT](https://github.com/bowang-lab/scGPT/tree/main)**: For inspiring the Tahoe-x1 architecture
- **[llm-foundry](https://github.com/mosaicml/llm-foundry)**: Efficient training infrastructure for large language models
- **[streaming](https://github.com/mosaicml/streaming)**: Fast, efficient dataset streaming
- **[CZ CELLxGENE](https://cellxgene.cziscience.com/)**: Chan Zuckerberg Initiative's single-cell atlas
- **[Arc scBaseCount](https://github.com/ArcInstitute/arc-virtual-cell-atlas/)**: Arc Institute's virtual cell atlas
- **[Arc Institute STATE](https://github.com/arcinstitute/state)**: State Transition model for perturbation prediction
---

For questions, issues, or collaboration inquiries, please open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) or write to us at [admin@tahoebio.ai](mailto:admin@tahoebio.ai).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tahoebio/tahoe-x1/",
    "name": "tahoex",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "single-cell, foundation-model, genomics, transcriptomics, perturbation, machine-learning, deep-learning",
    "author": "Tahoe Therapeutics",
    "author_email": "admin@tahoebio.ai",
    "download_url": "https://files.pythonhosted.org/packages/b9/a5/e3e16d45b75546651e1e7b7397499fe26ce815d9972bed34e210bdb3198f/tahoex-1.0.4.tar.gz",
    "platform": null,
    "description": "\n\n\n<p align=\"center\">\n<a href=\"https://github.com/astral-sh/ruff\"><img alt=\"Linter: Ruff\" src=\"https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json\"></a>\n<a href=\"https://github.com/tahoebio/tahoe-x1/blob/main/LICENSE\">\n        <img alt=\"License\" src=\"https://img.shields.io/badge/License-Apache%202.0-green.svg\">\n    </a>\n<a href=\"https://github.com/psf/black\"><img alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"></a>\n</p>\n<br />\n\n# Tahoe-x1: Scaling Perturbation-Trained Single-Cell Foundation Models to 3 Billion Parameters\n\n[\ud83d\udcc4 Paper](http://www.tahoebio.ai/news/tahoe-x1) | [\ud83e\udd17 HuggingFace](https://huggingface.co/tahoebio/Tahoe-x1) | [\ud83d\ude80 Getting Started](#installation) | [\ud83e\uddd1\u200d\ud83c\udfeb Tutorials](tutorials/)\n\n**Tahoe-x1 (Tx1)** is a family of perturbation-trained single-cell foundation models with up to 3 billion parameters, developed by Tahoe Therapeutics.\nTx1 is pretrained on large-scale single-cell transcriptomic datasets including the _Tahoe-100M_ perturbation compendium, and fine-tuned for cancer-relevant tasks.\nThrough architectural optimizations and efficient training strategies, Tx1 achieves 3\u201330\u00d7 higher compute efficiency than prior implementations while delivering state-of-the-art performance across disease-relevant benchmarks.\n\n<picture>\n  <source media=\"(prefers-color-scheme: dark)\" srcset=\"./assets/abstract_logo_dark_mode.png\">\n  <source media=\"(prefers-color-scheme: light)\" srcset=\"./assets/abstract_logo_light_mode.png\">\n  <img src=\"./assets/abstract_logo_light_mode.png\" alt=\"Abstract Logo\">\n</picture>\n\n## Table of Contents\n- [Repository Structure](#repository-structure)\n- [Installation](#installation)\n- [Training Infrastructure](#training-infrastructure)\n- [Datasets](#datasets)\n- [Pre-trained Models](#pre-trained-models)\n- [Training and Fine-tuning](#training-and-fine-tuning)\n- [Generating Cell and Gene Embeddings](#generating-cell-and-gene-embeddings)\n- [Tutorials and Benchmarks](#tutorials-and-benchmarks)\n- [Developer Guidelines](#developer-guidelines)\n- [Acknowledgements](#acknowledgements)\n- [License](#license)\n\n## Repository Structure\n\nThis repository follows a similar structure to [llm-foundry](https://github.com/mosaicml/llm-foundry/tree/main) and imports several utility functions from it.\n\n```\ntahoe-x1/\n\u251c\u2500\u2500 tahoex/                    # Core Tahoe-x1 library\n\u2502   \u251c\u2500\u2500 model/\n\u2502   \u2502   \u251c\u2500\u2500 blocks/           # Building block modules used across models\n\u2502   \u2502   \u2514\u2500\u2500 model/            # Full architecture subclassed from ComposerModel\n\u2502   \u251c\u2500\u2500 tasks/                # Helper functions for downstream tasks\n\u2502   \u251c\u2500\u2500 tokenizer/            # Vocabulary building and tokenization functions\n\u2502   \u251c\u2500\u2500 data/                 # Data loaders and collators\n\u2502   \u2514\u2500\u2500 utils/                # Utility functions \n\u251c\u2500\u2500 scripts/\n\u2502   \u251c\u2500\u2500 train.py              # Training script\n\u2502   \u251c\u2500\u2500 prepare_for_inference.py  # Prepares model for inference\n\u2502   \u251c\u2500\u2500 depmap/               # DepMap benchmark scripts\n\u2502   \u251c\u2500\u2500 msigdb/               # MSigDB pathway benchmark scripts\n\u2502   \u251c\u2500\u2500 state_transition/     # State transition prediction scripts\n\u2502   \u251c\u2500\u2500 data_prep/            # Dataset preparation scripts\n\u2502   \u2514\u2500\u2500 inference/            # Inference utilities\n\u251c\u2500\u2500 tutorials/                 # Jupyter notebook tutorials\n\u2502   \u251c\u2500\u2500 clustering_tutorial.ipynb  # Cell clustering and UMAP visualization\n\u2502   \u2514\u2500\u2500 training_tutorial.ipynb    # Training walkthrough\n\u2514\u2500\u2500 configs/                      \n    \u251c\u2500\u2500runai/                 # RunAI configuration files\n    \u251c\u2500\u2500mcli/                  # MosaicML platform configuration files\n    \u251c\u2500\u2500gcloud/                # Google Cloud configuration files\n    \u2514\u2500\u2500test_run.yaml          # Sample config file\n```\n\n## Installation\n\nDocker installation provides better reproducibility and avoids dependency conflicts.\n\n### Docker Installation (Recommended)\n\n```bash\n# Clone the repository\ngit clone https://github.com/tahoebio/tahoe-x1.git\ncd tahoe-x1\n\n# Pull the latest Docker image with all the dependencies pre-installed\ndocker pull ghcr.io/tahoebio/tahoe-x1:latest\n\n# Start an interactive container with GPU support\n# Note that nvidia-container-toolkit is required for this step\ndocker run -it --rm \\\n  --gpus all \\\n  -v \"$(pwd)\":/workspace \\\n  -w /workspace \\\n  ghcr.io/tahoebio/tahoe-x1:latest\\\n  /bin/bash\n\n# Inside the container, install the Tahoe-x1 package (dependencies are pre-installed)\npip install -e . --no-deps\n```\n\nThe Docker image includes all necessary dependencies including PyTorch, CUDA drivers, and flash-attention for optimal performance.\n\n### Native Installation (Alternative)\n\nFor direct installation without Docker, we recommend using `uv` for dependency management:\n\n```bash\n# Clone the repository\ngit clone https://github.com/tahoebio/tahoe-x1.git\ncd tahoe-x1\n\n# Install uv if not already installed\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n\n# Create and activate virtual environment\nuv venv\nsource .venv/bin/activate\n\n# Install the package with dependencies\nuv pip install -e . --no-build-isolation-package flash-attn\n```\n\n**Note**: Native installation requires compatible CUDA drivers and may encounter dependency conflicts. Docker installation is recommended for the best experience.\n\n\n\n\n\n## System Requirements & Training Capabilities\n\nTahoe-x1 is built natively on [Composer](https://github.com/mosaicml/composer) and [llm-foundry](https://github.com/mosaicml/llm-foundry), inheriting their full suite of large-scale training capabilities:\n\n### Hardware Requirements\n- **GPU**: NVIDIA Ampere (A100) or newer for FlashAttention support\n- **CUDA**: Version 12.1+\n- **Python**: 3.10+\n\n### Advanced Training Features\nThe codebase leverages Composer's state-of-the-art training stack, configurable via YAML:\n- **Automatic micro-batching** for optimal memory utilization\n- **Mixed precision training** with BF16/FP16, plus FP8 support on Hopper (H100) and newer GPUs\n- **Multi-GPU and multi-node** distributed training with FSDP (Fully Sharded Data Parallel)\n- **Gradient accumulation and checkpointing** for training larger models on limited hardware\n- **Advanced optimizers and schedulers** from the LLM training ecosystem\n- **Streaming datasets** for efficient data loading at scale\n\nThis infrastructure supports training models from 70M to 3B parameters and can scale to larger architectures.\n\n### Docker Images\n\nWe provide pre-built Docker images for ease of use:\n\n| Image Name | Base Image | Description |\n|------------|------------|-------------|\n| [`ghcr.io/tahoebio/tahoe-x1:1.0.0`](https://github.com/tahoebio/tahoe-x1/pkgs/container/tahoe-x1) | `mosaicml/llm-foundry:2.2.1_cu121_flash2-813d596` | Current release image for Tahoe-x1 |\n\n## Datasets\n\nTx1 was pretrained on 266 million single-cell profiles from three major sources. The following datasets were used for training and evaluation:\n\n| Dataset | Description | Usage | Location |\n|---------|-------------|-------|----------|\n| **CellxGene 2025-01** | ~61M  cells  from Jan 2025 CellxGene release | Tx1-3B stage 1 Pre-training  | `s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS/` |\n| **scBaseCamp 2025-02** | ~112M  cells from Feb 2025 scBaseCamp release | Tx1-3B stage 1 Pre-training | `s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2/` |\n| **Tahoe 100M** | ~96M  cells from Tahoe-100M | Tx1-3B stage 1 Pre-training | `s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2/` |\n| **filtered CellxGene 2025-01** | ~43M filtered cells  from Jan 2025 CellxGene release | Tx1-3B stage 2 Pre-training  | `s3://tahoe-hackathon-data/MFM/cellxgene_2025_01_21_merged_MDS_filtered/` |\n| **filtered scBaseCamp 2025-02** | ~76M filtered cells from Feb 2025 scBaseCamp release | Tx1-3B stage 2 Pre-training | `s3://tahoe-hackathon-data/MFM/scbasecamp_2025_02_25_MDS_v2_filtered/` |\n| **filtered Tahoe 100M** | ~34M filtered cells from Tahoe-100M | Tx1-3B stage 2 Pre-training | `s3://tahoe-hackathon-data/MFM/tahoe_100m_MDS_v2_filtered/` |\n| **DepMap** | Cancer cell line dependency data | DepMap Benchmark | `s3://tahoe-hackathon-data/MFM/benchmarks/depmap/` |\n| **MSigDB** | Pathway signature data | MsigDB Benchmark | `s3://tahoe-hackathon-data/MFM/benchmarks/msigdb/` |\n\nFiltered versions of the pre-training datasets above exclude cells with very few expressed genes and are used for stage 2 pre-training of Tx1-3B.\n\nPublic access to datasets: `s3://tahoe-hackathon-data/MFM/benchmarks/`\n\nIf you require access to datasets not available in the public bucket, please open a GitHub issue or contact the team.\n\nFor more information on dataset preparation, see [scripts/data_prep/README.md](scripts/data_prep/README.md).\n\n\n\n## Pre-trained Models\n\nWe provide pre-trained Tahoe-x1 models of various sizes:\n\n| Model Name | Parameters | Context Length | Checkpoint Path | WandB ID | Config File |\n|------------|------------|----------------|-----------------|----------|-------------|\n| **Tx1-3B** | 3B | 2048  | `s3://tahoe-hackathon-data/MFM/ckpts/3b/` | [mygjkq5c](https://wandb.ai/vevotx/tahoe-x1/runs/mygjkq5c) | `./configs/mcli/tahoex-3b-v2-cont-train.yaml` |\n| **Tx1-1.3B** | 1.3B | 2048 | `s3://tahoe-hackathon-data/MFM/ckpts/1b/` | [26iormxc](https://wandb.ai/vevotx/tahoe-x1/runs/26iormxc) | `./configs/gcloud/tahoex-1_3b-merged.yaml` |\n| **Tx1-70M** | 70M | 1024 | `s3://tahoe-hackathon-data/MFM/ckpts/70m/` | [ftb65le8](https://wandb.ai/vevotx/tahoe-x1/runs/ftb65le8) | `./configs/gcloud/tahoex-70m-merged.yaml` |\n\nModel weights are also available as safetensor files on our  [\ud83e\udd17 Huggingface model card](https://huggingface.co/tahoebio/Tahoe-x1).\n\n## Training and Fine-tuning\n\n### Training from Scratch\n\nA sample test configuration is available at `configs/test_run.yaml` for quick experimentation.\n\nUse the main training script with a YAML configuration file: \n\n```bash\ncomposer scripts/train.py -f configs/test_run.yaml\n```\n\nOr with command-line arguments:\n\n```bash\ncomposer scripts/train.py \\\n  --model_name tahoex \\\n  --data_path /path/to/data \\\n  --max_seq_len 2048 \\\n  --batch_size 32\n```\n\nNote that the current codebase only supports `attn_impl: flash` and `use_attn_mask: False`. The Triton backend and custom attention masks (used for training Tx1-1B and Tx1-70M) are no longer supported. If you have questions about using custom attention masks with the Triton backend, please contact us.\n\n### Fine-tuning\n\nTo fine-tune a pre-trained model on your own data:\n\n1. Download a pre-trained checkpoint from S3\n2. Modify the training configuration to load from checkpoint\n3. Prepare your dataset in the MDS format (see [scripts/data_prep/README.md](scripts/data_prep/README.md))\n4. Launch training with the `--load_path` argument\n\n```bash\npython scripts/train.py \\\n  -f configs/finetune_config.yaml \\\n  --load_path s3://path/to/checkpoint\n```\n\n### Launching runs on different platforms\nFor launching runs on specific platforms such as MosaicML, Google Cloud, or RunAI, refer to the corresponding configuration folders under `configs/` and their respective README files.\n\n### Preparing Models for Inference\n\nPackage a trained model with its vocabulary and metadata:\n\n```bash\npython scripts/prepare_for_inference.py \\\n  --model_path /path/to/checkpoint \\\n  --vocab_path /path/to/vocab.json \\\n  --output_path /path/to/inference_model\n```\n\n\n## Generating Cell and Gene Embeddings\n\n### Quick Start with Inference Script\n\nExtract cell embeddings from an AnnData object:\n\n```python\nfrom omegaconf import OmegaConf as om\nfrom scripts.inference.predict_embeddings import predict_embeddings\n\ncfg = {\n    \"model_name\": \"Tx1-70m\",\n    \"paths\": {\n        \"hf_repo_id\": \"tahoebio/Tahoe-x1\",\n        \"hf_model_size\": \"70m\",\n        \"adata_input\": \"/path/to/your_data.h5ad\",\n    },\n    \"data\": {\n        \"cell_type_key\": \"cell_type\",\n        \"gene_id_key\": \"ensembl_id\"\n    },\n    \"predict\": {\n        \"seq_len_dataset\": 2048,\n        \"return_gene_embeddings\": False,\n    }\n}\n\ncfg = om.create(cfg)\nadata = predict_embeddings(cfg)\n\n# Access embeddings\ncell_embeddings = adata.obsm[\"Tx1-70m\"]\n```\n\n### Extracting Gene Embeddings\n\nSet `return_gene_embeddings: True` in the configuration to extract gene-level representations.\n\n\n## Tutorials and Benchmarks\n\n### Tutorials\n- **[Clustering Tutorial](tutorials/clustering_tutorial.ipynb)**: Cell clustering and UMAP visualization\n- **[Training Tutorial](tutorials/training_tutorial.ipynb)**: Step-by-step guide to training Tahoe-x1 models\n\n\n### Benchmarks\n\nTx1 achieves state-of-the-art performance across disease-relevant benchmarks. See our [preprint](http://www.tahoebio.ai/news/tahoe-x1) for detailed results.\n\n| Benchmark | Task | Code Location |\n|-----------|------|---------------|\n| **DepMap Essentiality** | Predict broad and context-specific gene dependencies | [`scripts/depmap/`](scripts/depmap/) |\n| **MSigDB Hallmarks** | Recover 50 hallmark pathway memberships from gene embeddings | [`scripts/msigdb/`](scripts/msigdb/) |\n| **Cell-Type Classification** | Classify cell types across 5 tissues (Tabula Sapiens 2.0) | [`cz-benchmarks`](https://github.com/chanzuckerberg/cz-benchmarks) |\n| **Perturbation Prediction** | Predict transcriptional responses in held-out contexts | [`scripts/state_transition/`](scripts/state_transition/) |\n\n\n### Additional Resources\n- **Data Preparation**: [scripts/data_prep/README.md](scripts/data_prep/README.md)\n- **Platform Usage**: [mcli/README.md](mcli/README.md) and [gcloud/README.md](gcloud/README.md)\n\n## Troubleshooting\n\n### Common Issues and Solutions\n\n- **PyTorch/CUDA mismatch**: Ensure PyTorch is installed with the correct CUDA version for your system\n- **Docker permission denied**: Run Docker commands with `sudo` or add your user to the docker group\n- **OOM (Out of Memory)**: Ensure half-precision, flash-attention are enabled, set microbatch_size to auto\n- **S3 access denied**: For public buckets, the code will automatically retry with unsigned requests\n\n\nFor additional help, please open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) with:\n- Your system configuration (OS, GPU, PyTorch version)\n- Complete error message and stack trace\n- Steps to reproduce the issue\n\n## Acknowledgements\nWe thank the developers of the following open-source projects:\n- **[scGPT](https://github.com/bowang-lab/scGPT/tree/main)**: For inspiring the Tahoe-x1 architecture\n- **[llm-foundry](https://github.com/mosaicml/llm-foundry)**: Efficient training infrastructure for large language models\n- **[streaming](https://github.com/mosaicml/streaming)**: Fast, efficient dataset streaming\n- **[CZ CELLxGENE](https://cellxgene.cziscience.com/)**: Chan Zuckerberg Initiative's single-cell atlas\n- **[Arc scBaseCount](https://github.com/ArcInstitute/arc-virtual-cell-atlas/)**: Arc Institute's virtual cell atlas\n- **[Arc Institute STATE](https://github.com/arcinstitute/state)**: State Transition model for perturbation prediction\n---\n\nFor questions, issues, or collaboration inquiries, please open an issue on [GitHub](https://github.com/tahoebio/tahoe-x1) or write to us at [admin@tahoebio.ai](mailto:admin@tahoebio.ai).\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Tahoe-x1: Perturbation trained single-cell foundation models with up to 3 billion parameters",
    "version": "1.0.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/tahoebio/tahoe-x1/issues",
        "Documentation": "https://github.com/tahoebio/tahoe-x1#readme",
        "Homepage": "https://github.com/tahoebio/tahoe-x1",
        "Model Card": "https://huggingface.co/tahoebio/Tahoe-x1",
        "Paper": "http://www.tahoebio.ai/news/tahoe-x1",
        "Repository": "https://github.com/tahoebio/tahoe-x1"
    },
    "split_keywords": [
        "single-cell",
        " foundation-model",
        " genomics",
        " transcriptomics",
        " perturbation",
        " machine-learning",
        " deep-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "853bc7d39b5096829db6769261b677ec2c5f999927d8130598023cda17caa3c2",
                "md5": "34d734c938f0612bd05b4898c0b27e2d",
                "sha256": "033de754120e3cf40ccecb77d559bcb64a7c856236fe85b5331ad6e542fdbf23"
            },
            "downloads": -1,
            "filename": "tahoex-1.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "34d734c938f0612bd05b4898c0b27e2d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 47647,
            "upload_time": "2025-10-23T05:15:57",
            "upload_time_iso_8601": "2025-10-23T05:15:57.062313Z",
            "url": "https://files.pythonhosted.org/packages/85/3b/c7d39b5096829db6769261b677ec2c5f999927d8130598023cda17caa3c2/tahoex-1.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b9a5e3e16d45b75546651e1e7b7397499fe26ce815d9972bed34e210bdb3198f",
                "md5": "056b0a3f1bbc56556c02e8c28a7332e1",
                "sha256": "6c76546569b06166d485526a0a1d38126d3176a06b619027557a0cf45ae3935c"
            },
            "downloads": -1,
            "filename": "tahoex-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "056b0a3f1bbc56556c02e8c28a7332e1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 49998,
            "upload_time": "2025-10-23T05:15:58",
            "upload_time_iso_8601": "2025-10-23T05:15:58.374923Z",
            "url": "https://files.pythonhosted.org/packages/b9/a5/e3e16d45b75546651e1e7b7397499fe26ce815d9972bed34e210bdb3198f/tahoex-1.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-23 05:15:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tahoebio",
    "github_project": "tahoe-x1",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tahoex"
}
        
Elapsed time: 1.62629s