epiagent


Nameepiagent JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/xy-chen16/EpiAgent
SummaryFoundation model for single-cell epigenomic data.
upload_time2024-12-27 16:09:59
maintainerNone
docs_urlNone
authorXiaoyang Chen
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # EpiAgent

Large-scale foundation models have recently opened new avenues for artificial general intelligence. Such a research paradigm has recently shown considerable promise in the analysis of single-cell sequencing data, while to date, efforts have centered on transcriptome. In contrast to gene expression, chromatin accessibility provides more decisive insights into cell states, shaping the chromatin regulatory landscapes that control transcription in distinct cell types. Yet, challenges also persist due to the abundance of features, high data sparsity, and the quasi-binary nature of these data. Here, we introduce EpiAgent, the first foundation model for single-cell epigenomic data, pretrained on a large-scale Human-scATAC-Corpus comprising approximately 5 million cells and 35 billion tokens. EpiAgent encodes chromatin accessibility patterns of cells as concise “cell sentences,” and employs bidirectional attention to capture cellular heterogeneity behind regulatory networks. With comprehensive benchmarks, we demonstrate that EpiAgent excels in typical downstream tasks, including unsupervised feature extraction, supervised cell annotation, and data imputation. By incorporating external embeddings, EpiAgent facilitates the prediction of cellular responses to both out-of-sample stimulated and unseen genetic perturbations, as well as reference data integration and query data mapping. By simulating the knockout of key cis-regulatory elements, EpiAgent enables in-silico treatment for cancer analysis. We further extended zero-shot capabilities of EpiAgent, allowing direct cell type annotation on newly sequenced datasets without additional training.

<p align="center">
  <img src="https://github.com/xy-chen16/EpiAgent/blob/main/inst/model.png" width="700" height="385" alt="image">
</p>

---

## Updates / News

- **2024.12.21**: Our paper was published on bioRxiv. Read the preprint [here](https://www.biorxiv.org/content/10.1101/2024.12.19.629312v1).
- **2024.12.27**: Source code and Python package released on PyPI under the name `epiagent` (v0.0.1). Install it via `pip install epiagent`.
- **2024.12.28**: Updated GitHub repository with pretrained EpiAgent model and two supervised models for cell type annotation: EpiAgent-B and EpiAgent-NT. Models and example datasets can be downloaded from [Google Drive](https://drive.google.com/drive/folders/1WlNykSCNtZGsUp2oG0dw3cDdVKYDR-iX?usp=sharing). Additionally, we added usage demos for zero-shot applications ([link](https://github.com/xy-chen16/EpiAgent/demo/)).

---

## Installation

### Environment Setup

EpiAgent is built on the **PyTorch 2.0** framework with **FlashAttention v2**. We recommend using **CUDA 11.7** for optimal performance.

#### Step 1: Set up a Python environment

We recommend creating a virtual Python environment with [Anaconda](https://docs.anaconda.com/free/anaconda/install/linux/):

```bash
$ conda create -n EpiAgent python=3.11
$ conda activate EpiAgent
```
#### Step 2: Install Pytorch

Install PyTorch based on your system configuration. Refer to [PyTorch installation instructions](https://pytorch.org/get-started/previous-versions/) for the exact command. For example:

```bash
$ pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 # torch 2.0.1 + cuda 11.7
```

#### Step 3: Install FlashAttention (if not already installed)

Install `flash-attn` by following the instructions below (adapted from the [FlashAttention GitHub repository](https://github.com/Dao-AILab/flash-attention/tree/v2.7.2)):

1. FlashAttention uses ninja to compile its C++/CUDA components efficiently. Check if ninja is already installed and working correctly:、:

```bash
$ ninja --version
$ echo $?
```

If the above commands return a nonzero exit code or you encounter errors, reinstall `ninja` to ensure it works properly:

```bash
$ pip uninstall -y ninja && pip install ninja
```

2. Install FlashAttention:

After ensuring ninja is installed, proceed with the `FlashAttention` installation. Use the following command to install a compatible version:

```bash
$ pip install flash-attn==2.5.8 --no-build-isolation
```

#### Step 4: Install EpiAgent and dependencies

To install EpiAgent, run:

```bash
$ pip install epiagent
```

## Data Preprocessing

EpiAgent uses a unified set of **candidate cis-regulatory elements (cCREs)** as features. We recommend starting from fragment files to process input data compatible with EpiAgent. The preprocessing steps include:

1. **Reference Genome Conversion (Optional):**
   - Our cCRE coordinates are based on hg38. If your fragment files use hg19, use `liftOver` to convert them to hg38.

2. **Fragment Overlap Calculation:**
   - Use `bedtools` to calculate overlaps between fragments and cCREs.

3. **Cell-by-cCRE Matrix Construction:**
   - Use `epiagent.preprocessing.construct_cell_by_ccre_matrix` to create the cell-by-cCRE matrix and add metadata.

4. **TF-IDF and Tokenization:**
   - Perform global TF-IDF to assign importance to accessible cCREs, followed by tokenization to generate cell sentences.

For a detailed example, refer to the demo notebook: [Data Preprocessing.ipynb](https://github.com/xy-chen16/EpiAgent/demo/Data%20Preprocessing.ipynb).

---

## Downstream Analysis

### Feature Extraction
- Pretrained EpiAgent model parameters and example files are available [here](https://drive.google.com/drive/folders/1WlNykSCNtZGsUp2oG0dw3cDdVKYDR-iX?usp=sharing).
- A demo for zero-shot feature extraction is available in [Zero-shot Feature Extraction using EpiAgent.ipynb](https://github.com/xy-chen16/EpiAgent/demo/Zero-shot%20Feature%20Extraction%20using%20EpiAgent.ipynb).

### Direct Cell Type Annotation

Two supervised models, **EpiAgent-B** and **EpiAgent-NT**, are designed for direct cell type annotation. These models and their example datasets can be downloaded [here](https://drive.google.com/drive/folders/1WlNykSCNtZGsUp2oG0dw3cDdVKYDR-iX?usp=sharing). For specific demos:

- Annotating brain cell datasets with **EpiAgent-B**: [Zero-shot annotation using EpiAgent-B.ipynb](https://github.com/xy-chen16/EpiAgent/demo/Zero-shot%20annotation%20using%20EpiAgent-B.ipynb)
- Annotating other tissue datasets with **EpiAgent-NT**: [Zero-shot annotation using EpiAgent-NT.ipynb](https://github.com/xy-chen16/EpiAgent/demo/Zero-shot%20annotation%20using%20EpiAgent-NT.ipynb)

### Other tasks
- **Data Imputation**
- **Prediction of Cellular Responses to Stimulations and Genetic Perturbations**
- **Reference Data Integration and Query Data Mapping**
- **In-silico Treatment Simulations**

Fine-tuning and additional code demos will be updated soon.

---

## Citation

If you use EpiAgent in your research, please cite our paper:

Chen X, Li K, Cui X, Wang Z, Jiang Q, Lin J, Li Z, Gao Z, Jiang R. EpiAgent: Foundation model for single-cell epigenomic data. bioRxiv. 2024:2024-12.

---

## Contact

For questions about the paper or code, please email: xychen20@mails.tsinghua.edu.cn


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/xy-chen16/EpiAgent",
    "name": "epiagent",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Xiaoyang Chen",
    "author_email": "xychen20@mails.tsinghua.edu.cn",
    "download_url": "https://files.pythonhosted.org/packages/e4/60/ffb96e80293c948c2f6b10889afac267d00ee4b332a855e7715c8d778b3d/epiagent-0.0.1.tar.gz",
    "platform": null,
    "description": "# EpiAgent\n\nLarge-scale foundation models have recently opened new avenues for artificial general intelligence. Such a research paradigm has recently shown considerable promise in the analysis of single-cell sequencing data, while to date, efforts have centered on transcriptome. In contrast to gene expression, chromatin accessibility provides more decisive insights into cell states, shaping the chromatin regulatory landscapes that control transcription in distinct cell types. Yet, challenges also persist due to the abundance of features, high data sparsity, and the quasi-binary nature of these data. Here, we introduce EpiAgent, the first foundation model for single-cell epigenomic data, pretrained on a large-scale Human-scATAC-Corpus comprising approximately 5 million cells and 35 billion tokens. EpiAgent encodes chromatin accessibility patterns of cells as concise \u201ccell sentences,\u201d and employs bidirectional attention to capture cellular heterogeneity behind regulatory networks. With comprehensive benchmarks, we demonstrate that EpiAgent excels in typical downstream tasks, including unsupervised feature extraction, supervised cell annotation, and data imputation. By incorporating external embeddings, EpiAgent facilitates the prediction of cellular responses to both out-of-sample stimulated and unseen genetic perturbations, as well as reference data integration and query data mapping. By simulating the knockout of key cis-regulatory elements, EpiAgent enables in-silico treatment for cancer analysis. We further extended zero-shot capabilities of EpiAgent, allowing direct cell type annotation on newly sequenced datasets without additional training.\n\n<p align=\"center\">\n  <img src=\"https://github.com/xy-chen16/EpiAgent/blob/main/inst/model.png\" width=\"700\" height=\"385\" alt=\"image\">\n</p>\n\n---\n\n## Updates / News\n\n- **2024.12.21**: Our paper was published on bioRxiv. Read the preprint [here](https://www.biorxiv.org/content/10.1101/2024.12.19.629312v1).\n- **2024.12.27**: Source code and Python package released on PyPI under the name `epiagent` (v0.0.1). Install it via `pip install epiagent`.\n- **2024.12.28**: Updated GitHub repository with pretrained EpiAgent model and two supervised models for cell type annotation: EpiAgent-B and EpiAgent-NT. Models and example datasets can be downloaded from [Google Drive](https://drive.google.com/drive/folders/1WlNykSCNtZGsUp2oG0dw3cDdVKYDR-iX?usp=sharing). Additionally, we added usage demos for zero-shot applications ([link](https://github.com/xy-chen16/EpiAgent/demo/)).\n\n---\n\n## Installation\n\n### Environment Setup\n\nEpiAgent is built on the **PyTorch 2.0** framework with **FlashAttention v2**. We recommend using **CUDA 11.7** for optimal performance.\n\n#### Step 1: Set up a Python environment\n\nWe recommend creating a virtual Python environment with [Anaconda](https://docs.anaconda.com/free/anaconda/install/linux/):\n\n```bash\n$ conda create -n EpiAgent python=3.11\n$ conda activate EpiAgent\n```\n#### Step 2: Install Pytorch\n\nInstall PyTorch based on your system configuration. Refer to [PyTorch installation instructions](https://pytorch.org/get-started/previous-versions/) for the exact command. For example:\n\n```bash\n$ pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 # torch 2.0.1 + cuda 11.7\n```\n\n#### Step 3: Install FlashAttention (if not already installed)\n\nInstall `flash-attn` by following the instructions below (adapted from the [FlashAttention GitHub repository](https://github.com/Dao-AILab/flash-attention/tree/v2.7.2)):\n\n1. FlashAttention uses ninja to compile its C++/CUDA components efficiently. Check if ninja is already installed and working correctly:\u3001:\n\n```bash\n$ ninja --version\n$ echo $?\n```\n\nIf the above commands return a nonzero exit code or you encounter errors, reinstall `ninja` to ensure it works properly:\n\n```bash\n$ pip uninstall -y ninja && pip install ninja\n```\n\n2. Install FlashAttention:\n\nAfter ensuring ninja is installed, proceed with the `FlashAttention` installation. Use the following command to install a compatible version:\n\n```bash\n$ pip install flash-attn==2.5.8 --no-build-isolation\n```\n\n#### Step 4: Install EpiAgent and dependencies\n\nTo install EpiAgent, run:\n\n```bash\n$ pip install epiagent\n```\n\n## Data Preprocessing\n\nEpiAgent uses a unified set of **candidate cis-regulatory elements (cCREs)** as features. We recommend starting from fragment files to process input data compatible with EpiAgent. The preprocessing steps include:\n\n1. **Reference Genome Conversion (Optional):**\n   - Our cCRE coordinates are based on hg38. If your fragment files use hg19, use `liftOver` to convert them to hg38.\n\n2. **Fragment Overlap Calculation:**\n   - Use `bedtools` to calculate overlaps between fragments and cCREs.\n\n3. **Cell-by-cCRE Matrix Construction:**\n   - Use `epiagent.preprocessing.construct_cell_by_ccre_matrix` to create the cell-by-cCRE matrix and add metadata.\n\n4. **TF-IDF and Tokenization:**\n   - Perform global TF-IDF to assign importance to accessible cCREs, followed by tokenization to generate cell sentences.\n\nFor a detailed example, refer to the demo notebook: [Data Preprocessing.ipynb](https://github.com/xy-chen16/EpiAgent/demo/Data%20Preprocessing.ipynb).\n\n---\n\n## Downstream Analysis\n\n### Feature Extraction\n- Pretrained EpiAgent model parameters and example files are available [here](https://drive.google.com/drive/folders/1WlNykSCNtZGsUp2oG0dw3cDdVKYDR-iX?usp=sharing).\n- A demo for zero-shot feature extraction is available in [Zero-shot Feature Extraction using EpiAgent.ipynb](https://github.com/xy-chen16/EpiAgent/demo/Zero-shot%20Feature%20Extraction%20using%20EpiAgent.ipynb).\n\n### Direct Cell Type Annotation\n\nTwo supervised models, **EpiAgent-B** and **EpiAgent-NT**, are designed for direct cell type annotation. These models and their example datasets can be downloaded [here](https://drive.google.com/drive/folders/1WlNykSCNtZGsUp2oG0dw3cDdVKYDR-iX?usp=sharing). For specific demos:\n\n- Annotating brain cell datasets with **EpiAgent-B**: [Zero-shot annotation using EpiAgent-B.ipynb](https://github.com/xy-chen16/EpiAgent/demo/Zero-shot%20annotation%20using%20EpiAgent-B.ipynb)\n- Annotating other tissue datasets with **EpiAgent-NT**: [Zero-shot annotation using EpiAgent-NT.ipynb](https://github.com/xy-chen16/EpiAgent/demo/Zero-shot%20annotation%20using%20EpiAgent-NT.ipynb)\n\n### Other tasks\n- **Data Imputation**\n- **Prediction of Cellular Responses to Stimulations and Genetic Perturbations**\n- **Reference Data Integration and Query Data Mapping**\n- **In-silico Treatment Simulations**\n\nFine-tuning and additional code demos will be updated soon.\n\n---\n\n## Citation\n\nIf you use EpiAgent in your research, please cite our paper:\n\nChen X, Li K, Cui X, Wang Z, Jiang Q, Lin J, Li Z, Gao Z, Jiang R. EpiAgent: Foundation model for single-cell epigenomic data. bioRxiv. 2024:2024-12.\n\n---\n\n## Contact\n\nFor questions about the paper or code, please email: xychen20@mails.tsinghua.edu.cn\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Foundation model for single-cell epigenomic data.",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/xy-chen16/EpiAgent"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "576d39e25b6f60ce9b3efa9409edd3f4b5fe70b69fc310f49a913e9b2bae4d1b",
                "md5": "c82b35d7f17d82bdcf81cb507adee988",
                "sha256": "5084ed41f1774befd00fc7f5fcd094c2ec3b15506dbf7bd97ec28943043fa3ab"
            },
            "downloads": -1,
            "filename": "epiagent-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c82b35d7f17d82bdcf81cb507adee988",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 15565,
            "upload_time": "2024-12-27T16:09:53",
            "upload_time_iso_8601": "2024-12-27T16:09:53.380153Z",
            "url": "https://files.pythonhosted.org/packages/57/6d/39e25b6f60ce9b3efa9409edd3f4b5fe70b69fc310f49a913e9b2bae4d1b/epiagent-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e460ffb96e80293c948c2f6b10889afac267d00ee4b332a855e7715c8d778b3d",
                "md5": "b288850fe4fa63aba50e571b904243be",
                "sha256": "405250af6f3f35f122f95bd03d5d3b2bf4b4a1f8897bdfeadff5b563b4a03ebb"
            },
            "downloads": -1,
            "filename": "epiagent-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b288850fe4fa63aba50e571b904243be",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 2198441,
            "upload_time": "2024-12-27T16:09:59",
            "upload_time_iso_8601": "2024-12-27T16:09:59.107325Z",
            "url": "https://files.pythonhosted.org/packages/e4/60/ffb96e80293c948c2f6b10889afac267d00ee4b332a855e7715c8d778b3d/epiagent-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-27 16:09:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "xy-chen16",
    "github_project": "EpiAgent",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "epiagent"
}
        
Elapsed time: 1.45899s