mrna-bench


Namemrna-bench JSON
Version 1.0.1 PyPI version JSON
download
home_pageNone
SummaryBenchmarking suite for mRNA property prediction.
upload_time2025-01-23 23:15:44
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT License Copyright (c) 2024 Ian Shi Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords mrna genomic foundation model benchmark
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # mRNABench
This repository contains a workflow to benchmark the embedding quality of genomic foundation models on (m)RNA specific tasks. The mRNABench contains a catalogue of datasets and training split logic which can be used to evaluate the embedding quality of several catalogued models.

**Jump to:** [Model Catalog](#model-catalog) [Dataset Catalog](#dataset-catalog)

## Setup
Several configurations of the mRNABench are available.

### Datasets Only
If you are interested in the benchmark datasets **only**, you can run:

```bash
pip install mrna-bench
```

### Full Version
The inference-capable version of mRNABench that can generate embeddings using
Orthrus, DNA-BERT2, NucleotideTransformer, RNA-FM, and HyenaDNA can be 
installed as shown below. Note that this requires PyTorch version 2.2.2 with 
CUDA 12.1 and Triton uninstalled (due to a DNA-BERT2 issue).
```bash
conda create --name mrna_bench python=3.10
conda activate mrna_bench

pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install mrna-bench[base_models]
pip uninstall triton
```
Inference with other models will require the installation of the model's
dependencies first, which are usually listed on the model's GitHub page (see below).

### Post-install
After installation, please run the following in Python to set where data associated with the benchmarks will be stored.
```python
import mrna_bench as mb

path_to_dir_to_store_data = "DESIRED_PATH"
mb.update_data_path(path_to_dir_to_store_data)
```

## Usage
Datasets can be retrieved using:

```python
import mrna_bench as mb

dataset = mb.load_dataset("go-mf")
data_df = dataset.data_df
```

The mRNABench can also be used to test out common genomic foundation models:
```python
import torch

import mrna_bench as mb
from mrna_bench.embedder import DatasetEmbedder
from mrna_bench.linear_probe import LinearProbe

device = torch.device("cuda")

dataset = mb.load_dataset("go-mf")
model = mb.load_model("Orthrus", "orthrus-large-6-track", device)

embedder = DatasetEmbedder(model, dataset)
embeddings = embedder.embed_dataset()
embeddings = embeddings.detach().cpu().numpy()

prober = LinearProbe(
    dataset=dataset,
    embeddings=embeddings,
    task="multilabel",
    target_col="target",
    split_type="homology"
)

metrics = prober.run_linear_probe()
print(metrics)
```
Also see the `scripts/` folder for example scripts that uses slurm to embed dataset chunks in parallel for reduce runtime, as well as an example of multi-seed linear probing.

## Model Catalog
The current models catalogued are:

| Model Name |  Model Versions         | Description   | Citation | Supported<br>by<br>`base_models` |
| :--------: |  ---------------------- | ------------- | :------: | :------------------------: |
| `Orthrus` | `orthrus_large_6_track`<br> `orthrus_base_4_track` | Mamba-based RNA FM pre-trained using contrastive learning. | [code](https://github.com/bowang-lab/Orthrus) [paper](https://www.biorxiv.org/content/10.1101/2024.10.10.617658v2)| ✅ |
| `AIDO.RNA` | `aido_rna_650m` <br> `aido_rna_650m_cds` <br> `aido_rna_1b600m` <br> `aido_rna_1b600m_cds` | Encoder Transformer-based RNA FM pre-trained using MLM on 42M ncRNA sequences. Version that is domain adapted to CDS is available. | [paper](https://www.biorxiv.org/content/10.1101/2024.11.28.625345v1) | |
| `RNA-FM` | `rna-fm` <br> `mrna-fm` | Transformer-based RNA FM pre-trained using MLM on 23M ncRNA sequences. mRNA-FM trained on CDS using codon tokenizer. | [github](https://github.com/ml4bio/RNA-FM) | ✅ |
| `DNABERT2` | `dnabert2` | Transformer-based DNA FM pre-trained using MLM on multispecies genomic dataset. Uses BPE and other modern architectural improvements for efficiency. | [github](https://github.com/MAGICS-LAB/DNABERT_2) | ✅ |
| `NucleotideTransformer` | `2.5b-multi-species` <br> `2.5b-1000g` <br> `500m-human-ref` <br> `500m-1000g` <br> `v2-50m-multi-species` <br> `v2-100m-multi-species` <br> `v2-250m-multi-species` <br> `v2-500m-multi-species` | Transformer-based DNA FM pre-trained using MLM on a variety of possible datasets at various model sizes. Sequence is tokenized using 6-mers. | [github](https://github.com/instadeepai/nucleotide-transformer) | ✅ |
| `HyenaDNA` | `hyenadna-large-1m-seqlen-hf` <br> `hyenadna-medium-450k-seqlen-hf` <br> `hyenadna-medium-160k-seqlen-hf` <br> `hyenadna-small-32k-seqlen-hf` <br> `hyenadna-tiny-16k-seqlen-d128-hf` | Hyena-based DNA FM pre-trained using NTP on the human reference genome. Available at various model sizes and pretraining sequence contexts. | [github](https://github.com/HazyResearch/hyena-dna) | ✅ |

### Adding a new model
All models should inherit from the template `EmbeddingModel`. Each model file should lazily load dependencies within its `__init__` methods so each model can be used individually without install all other models. Models must implement `get_model_short_name(model_version)` which fetches the internal name for the model. This must be unique for every model version and must not contain underscores. Models should implement either `embed_sequence` or `embed_sequence_sixtrack` (see code for method signature). New models should be added to `MODEL_CATALOG`.

## Dataset Catalog
The current datasets catalogued are:
| Dataset Name | Catalogue Identifier | Description | Tasks | Citation |
|---|---|---|---|---|
| GO Molecular Function | <code>go-mf</code> | Classification of the molecular function of a transcript's  product as defined by the GO Resource. | `multilabel` | [website](https://geneontology.org/) |
| Mean Ribosome Load (Sugimoto) | <code>mrl&#8209;sugimoto</code> | Mean Ribosome Load per transcript isoform as measured in Sugimoto et al. 2022. | `regression` | [paper](https://www.nature.com/articles/s41594-022-00819-2) |
| RNA Half-life (Human) | <code>rnahl&#8209;human</code> | RNA half-life of human transcripts collected by Agarwal et al. 2022. | `regression` | [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02811-x) |
| RNA Half-life (Mouse) | <code>rnahl&#8209;mouse</code> | RNA half-life of mouse transcripts collected by Agarwal et al. 2022. | `regression` | [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02811-x) |
| Protein Subcellular Localization | <code>prot&#8209;loc</code> | Subcellular localization of transcript protein product defined in Protein Atlas. | `multilabel` | [website](https://www.proteinatlas.org/) |
| Protein Coding Gene Essentiality | <code>pcg&#8209;ess</code> | Essentiality of PCGs as measured by CRISPR knockdown. Log-fold expression and binary essentiality available on several cell lines. | `regression` `classification`| [paper](https://www.cell.com/cell/fulltext/S0092-8674(24)01203-0)|

### Adding a new dataset
New datasets should inherit from `BenchmarkDataset`. Dataset names cannot contain underscores. Each new dataset should download raw data and process it into a dataframe by overriding `process_raw_data`. This dataframe should store transcript as rows, using string encoding in the `sequence` column. If homology splitting is required, a column `gene` containing gene names is required. Six track embedding also requires columns `cds` and `splice`. The target column can have any name, as it is specified at time of probing. New datasets should be added to `DATASET_CATALOG`.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mrna-bench",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "mrna, genomic foundation model, benchmark",
    "author": null,
    "author_email": "\"Ruian (Ian) Shi\" <ian.shi@mail.utoronto.ca>",
    "download_url": "https://files.pythonhosted.org/packages/39/84/132453a4e4f0b096ed551540b0e55aee1858b67bc94390af43afc52017c9/mrna_bench-1.0.1.tar.gz",
    "platform": null,
    "description": "# mRNABench\nThis repository contains a workflow to benchmark the embedding quality of genomic foundation models on (m)RNA specific tasks. The mRNABench contains a catalogue of datasets and training split logic which can be used to evaluate the embedding quality of several catalogued models.\n\n**Jump to:** [Model Catalog](#model-catalog) [Dataset Catalog](#dataset-catalog)\n\n## Setup\nSeveral configurations of the mRNABench are available.\n\n### Datasets Only\nIf you are interested in the benchmark datasets **only**, you can run:\n\n```bash\npip install mrna-bench\n```\n\n### Full Version\nThe inference-capable version of mRNABench that can generate embeddings using\nOrthrus, DNA-BERT2, NucleotideTransformer, RNA-FM, and HyenaDNA can be \ninstalled as shown below. Note that this requires PyTorch version 2.2.2 with \nCUDA 12.1 and Triton uninstalled (due to a DNA-BERT2 issue).\n```bash\nconda create --name mrna_bench python=3.10\nconda activate mrna_bench\n\npip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121\npip install mrna-bench[base_models]\npip uninstall triton\n```\nInference with other models will require the installation of the model's\ndependencies first, which are usually listed on the model's GitHub page (see below).\n\n### Post-install\nAfter installation, please run the following in Python to set where data associated with the benchmarks will be stored.\n```python\nimport mrna_bench as mb\n\npath_to_dir_to_store_data = \"DESIRED_PATH\"\nmb.update_data_path(path_to_dir_to_store_data)\n```\n\n## Usage\nDatasets can be retrieved using:\n\n```python\nimport mrna_bench as mb\n\ndataset = mb.load_dataset(\"go-mf\")\ndata_df = dataset.data_df\n```\n\nThe mRNABench can also be used to test out common genomic foundation models:\n```python\nimport torch\n\nimport mrna_bench as mb\nfrom mrna_bench.embedder import DatasetEmbedder\nfrom mrna_bench.linear_probe import LinearProbe\n\ndevice = torch.device(\"cuda\")\n\ndataset = mb.load_dataset(\"go-mf\")\nmodel = mb.load_model(\"Orthrus\", \"orthrus-large-6-track\", device)\n\nembedder = DatasetEmbedder(model, dataset)\nembeddings = embedder.embed_dataset()\nembeddings = embeddings.detach().cpu().numpy()\n\nprober = LinearProbe(\n    dataset=dataset,\n    embeddings=embeddings,\n    task=\"multilabel\",\n    target_col=\"target\",\n    split_type=\"homology\"\n)\n\nmetrics = prober.run_linear_probe()\nprint(metrics)\n```\nAlso see the `scripts/` folder for example scripts that uses slurm to embed dataset chunks in parallel for reduce runtime, as well as an example of multi-seed linear probing.\n\n## Model Catalog\nThe current models catalogued are:\n\n| Model Name |  Model Versions         | Description   | Citation | Supported<br>by<br>`base_models` |\n| :--------: |  ---------------------- | ------------- | :------: | :------------------------: |\n| `Orthrus` | `orthrus_large_6_track`<br> `orthrus_base_4_track` | Mamba-based RNA FM pre-trained using contrastive learning. | [code](https://github.com/bowang-lab/Orthrus) [paper](https://www.biorxiv.org/content/10.1101/2024.10.10.617658v2)| \u2705 |\n| `AIDO.RNA` | `aido_rna_650m` <br> `aido_rna_650m_cds` <br> `aido_rna_1b600m` <br> `aido_rna_1b600m_cds` | Encoder Transformer-based RNA FM pre-trained using MLM on 42M ncRNA sequences. Version that is domain adapted to CDS is available. | [paper](https://www.biorxiv.org/content/10.1101/2024.11.28.625345v1) | |\n| `RNA-FM` | `rna-fm` <br> `mrna-fm` | Transformer-based RNA FM pre-trained using MLM on 23M ncRNA sequences. mRNA-FM trained on CDS using codon tokenizer. | [github](https://github.com/ml4bio/RNA-FM) | \u2705 |\n| `DNABERT2` | `dnabert2` | Transformer-based DNA FM pre-trained using MLM on multispecies genomic dataset. Uses BPE and other modern architectural improvements for efficiency. | [github](https://github.com/MAGICS-LAB/DNABERT_2) | \u2705 |\n| `NucleotideTransformer` | `2.5b-multi-species` <br> `2.5b-1000g` <br> `500m-human-ref` <br> `500m-1000g` <br> `v2-50m-multi-species` <br> `v2-100m-multi-species` <br> `v2-250m-multi-species` <br> `v2-500m-multi-species` | Transformer-based DNA FM pre-trained using MLM on a variety of possible datasets at various model sizes. Sequence is tokenized using 6-mers. | [github](https://github.com/instadeepai/nucleotide-transformer) | \u2705 |\n| `HyenaDNA` | `hyenadna-large-1m-seqlen-hf` <br> `hyenadna-medium-450k-seqlen-hf` <br> `hyenadna-medium-160k-seqlen-hf` <br> `hyenadna-small-32k-seqlen-hf` <br> `hyenadna-tiny-16k-seqlen-d128-hf` | Hyena-based DNA FM pre-trained using NTP on the human reference genome. Available at various model sizes and pretraining sequence contexts. | [github](https://github.com/HazyResearch/hyena-dna) | \u2705 |\n\n### Adding a new model\nAll models should inherit from the template `EmbeddingModel`. Each model file should lazily load dependencies within its `__init__` methods so each model can be used individually without install all other models. Models must implement `get_model_short_name(model_version)` which fetches the internal name for the model. This must be unique for every model version and must not contain underscores. Models should implement either `embed_sequence` or `embed_sequence_sixtrack` (see code for method signature). New models should be added to `MODEL_CATALOG`.\n\n## Dataset Catalog\nThe current datasets catalogued are:\n| Dataset Name | Catalogue Identifier | Description | Tasks | Citation |\n|---|---|---|---|---|\n| GO Molecular Function | <code>go-mf</code> | Classification of the molecular function of a transcript's  product as defined by the GO Resource. | `multilabel` | [website](https://geneontology.org/) |\n| Mean Ribosome Load (Sugimoto) | <code>mrl&#8209;sugimoto</code> | Mean Ribosome Load per transcript isoform as measured in Sugimoto et al. 2022. | `regression` | [paper](https://www.nature.com/articles/s41594-022-00819-2) |\n| RNA Half-life (Human) | <code>rnahl&#8209;human</code> | RNA half-life of human transcripts collected by Agarwal et al. 2022. | `regression` | [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02811-x) |\n| RNA Half-life (Mouse) | <code>rnahl&#8209;mouse</code> | RNA half-life of mouse transcripts collected by Agarwal et al. 2022. | `regression` | [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02811-x) |\n| Protein Subcellular Localization | <code>prot&#8209;loc</code> | Subcellular localization of transcript protein product defined in Protein Atlas. | `multilabel` | [website](https://www.proteinatlas.org/) |\n| Protein Coding Gene Essentiality | <code>pcg&#8209;ess</code> | Essentiality of PCGs as measured by CRISPR knockdown. Log-fold expression and binary essentiality available on several cell lines. | `regression` `classification`| [paper](https://www.cell.com/cell/fulltext/S0092-8674(24)01203-0)|\n\n### Adding a new dataset\nNew datasets should inherit from `BenchmarkDataset`. Dataset names cannot contain underscores. Each new dataset should download raw data and process it into a dataframe by overriding `process_raw_data`. This dataframe should store transcript as rows, using string encoding in the `sequence` column. If homology splitting is required, a column `gene` containing gene names is required. Six track embedding also requires columns `cds` and `splice`. The target column can have any name, as it is specified at time of probing. New datasets should be added to `DATASET_CATALOG`.\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2024 Ian Shi\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "Benchmarking suite for mRNA property prediction.",
    "version": "1.0.1",
    "project_urls": {
        "Repository": "https://github.com/IanShi1996/mRNABench"
    },
    "split_keywords": [
        "mrna",
        " genomic foundation model",
        " benchmark"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "618f8cc489a8dad6d5daea3a0bcec725b40c9a4a51319d56d5404e53f4eac549",
                "md5": "515a47a2a7c099759455e1c56cb18a1d",
                "sha256": "300349674b241c9b01cb51bd15c7d707585d1d01914b08ccfbb13395b8a9d7ae"
            },
            "downloads": -1,
            "filename": "mrna_bench-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "515a47a2a7c099759455e1c56cb18a1d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 37216,
            "upload_time": "2025-01-23T23:15:42",
            "upload_time_iso_8601": "2025-01-23T23:15:42.803769Z",
            "url": "https://files.pythonhosted.org/packages/61/8f/8cc489a8dad6d5daea3a0bcec725b40c9a4a51319d56d5404e53f4eac549/mrna_bench-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3984132453a4e4f0b096ed551540b0e55aee1858b67bc94390af43afc52017c9",
                "md5": "252143b889c5a26bd82f87149c28d97c",
                "sha256": "edf7ff0dd1017c4ced0f03df003e63d9c416636ba09eca84c89422ceac3fbbdb"
            },
            "downloads": -1,
            "filename": "mrna_bench-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "252143b889c5a26bd82f87149c28d97c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 27000,
            "upload_time": "2025-01-23T23:15:44",
            "upload_time_iso_8601": "2025-01-23T23:15:44.524969Z",
            "url": "https://files.pythonhosted.org/packages/39/84/132453a4e4f0b096ed551540b0e55aee1858b67bc94390af43afc52017c9/mrna_bench-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-23 23:15:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "IanShi1996",
    "github_project": "mRNABench",
    "github_not_found": true,
    "lcname": "mrna-bench"
}
        
Elapsed time: 0.73650s