[](https://badge.fury.io/py/genomic-benchmarks)
# Genomic Benchmarks 🧬🏋️✔️
In this repository, we collect benchmarks for classification of genomic sequences. It is shipped as a Python package, together with functions helping to download & manipulate datasets and train NN models. Current SOTA model on genomic benchmarks is [HyenaDNA](https://github.com/HazyResearch/hyena-dna), see metrics in the [experiments](experiments/README.md#hyenadna) folder.
## Install
Genomic Benchmarks can be installed as follows:
```bash
pip install genomic-benchmarks
```
To use it with papermill, TF or pytorch, install the corresponding dependencies:
```bash
# if you want to use jupyter and papermill
pip install jupyter>=1.0.0
pip install papermill>=2.3.0
# if you want to train NN with TF
pip install tensorflow>=2.6.0
pip install tensorflow-addons
pip install typing-extensions --upgrade # fixing TF installation issue
# if you want to train NN with torch
pip install torch>=1.10.0
pip install torchtext
```
For the package development, use Python 3.8 (ideally 3.8.9) and the installation described [here](README_devel.md).
## Usage
Get the list of all datasets with the `list_datasets` function
```python
>>> from genomic_benchmarks.data_check import list_datasets
>>>
>>> list_datasets()
['demo_coding_vs_intergenomic_seqs', 'demo_human_or_worm', 'dummy_mouse_enhancers_ensembl', 'human_enhancers_cohn', 'human_enhancers_ensembl', 'human_ensembl_regulatory', 'human_nontata_promoters', 'human_ocr_ensembl']
```
You can get basic information about the benchmark with `info` function:
```python
>>> from genomic_benchmarks.data_check import info
>>>
>>> info("human_nontata_promoters", version=0)
Dataset `human_nontata_promoters` has 2 classes: negative, positive.
All lenghts of genomic intervals equals 251.
Totally 36131 sequences have been found, 27097 for training and 9034 for testing.
train test
negative 12355 4119
positive 14742 4915
```
The function `download_dataset` downloads the full-sequence form of the required benchmark (splitted into train and test sets, one folder for each class). If not specified otherwise, the data will be stored in `.genomic_benchmarks` subfolder of your home directory. By default, the dataset is obtained from our cloud cache (`use_cloud_cache=True`).
```python
>>> from genomic_benchmarks.loc2seq import download_dataset
>>>
>>> download_dataset("human_nontata_promoters", version=0)
Downloading 1VdUg0Zu8yfLS6QesBXwGz1PIQrTW3Ze4 into /home/petr/.genomic_benchmarks/human_nontata_promoters.zip... Done.
Unzipping...Done.
PosixPath('/home/petr/.genomic_benchmarks/human_nontata_promoters')
```
Getting TensorFlow Dataset for the benchmark and displaying samples is straightforward:
```python
>>> from pathlib import Path
>>> import tensorflow as tf
>>>
>>> BATCH_SIZE = 64
>>> SEQ_TRAIN_PATH = Path.home() / '.genomic_benchmarks' / 'human_nontata_promoters' / 'train'
>>> CLASSES = ['negative', 'positive']
>>>
>>> train_dset = tf.keras.preprocessing.text_dataset_from_directory(
... directory=SEQ_TRAIN_PATH,
... batch_size=BATCH_SIZE,
... class_names=CLASSES)
Found 27097 files belonging to 2 classes.
>>>
>>> list(train_dset)[0][0][0]
<tf.Tensor: shape=(), dtype=string, numpy=b'TCCTGCCTTTCCACTTGCACCAGTTTTCCCACCCCAGCCTCAGGGCGGGGCTGCCTCGTCACTTGTCTCGGGGCAGATCTGCCCTACACACGTTAGCGCCGCGCGCAAAGCAGCCCCGCAGCACCCAGGCGCCTCCTGGCGGCGCCGCGAAGGGGCGGGGCTGTCGGCTGCGCGTTGTGCGCTGTCCCAGGTTGGAAACCAGTGCCCCAGGCGGCGAGGAGAGCGGTGCCTTGCAGGGATGCTGCGGGCGG'>
```
See [How_To_Train_CNN_Classifier_With_TF.ipynb](notebooks/How_To_Train_CNN_Classifier_With_TF.ipynb) for more detailed description how to train CNN classifier with TensorFlow.
Getting Pytorch Dataset and displaying samples is also easy:
```python
>>> from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanNontataPromoters
>>>
>>> dset = HumanNontataPromoters(split='train', version=0)
>>> dset[0]
('CAATCTCACAGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGGCAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCCAAAGTCAAGGCACATGGCAAGAAGGTGCTGACTTCCTTGGGAGATGCCATAAAGCACCTGGATGATCTCAAGGGCACCTTTGCCCAGCTGAGTGAACTGCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAAGGTGAGTCCAGGAGATGT', 0)
```
See [How_To_Train_CNN_Classifier_With_Pytorch.ipynb](notebooks/How_To_Train_CNN_Classifier_With_Pytorch.ipynb) for more detailed description how to train CNN classifier with Pytorch.
## Hugging Face
We also provide these benchmarks through HuggingFace Hub: https://huggingface.co/katarinagresova
If you are used to using Hugging Face dataset, you can use this option to access Genomic Benchmarks. See [How_To_Use_Datasets_From_HF.ipynb](notebooks/How_To_Use_Datasets_From_HF.ipynb) for a guide.
## Structure of package
* [datasets](datasets/): Each folder is one benchmark dataset (or a set of bechmarks in subfolders), see [README.md](datasets/README.md) for the format specification
* [docs](docs/): Each folder contains a Python notebook that has been used for the dataset creation
* [experiments](experiments/): Training a simple neural network model(s) for each benchmark dataset, can be used as a baseline
* [notebooks](notebooks/): Main use-cases demonstrated in a form of Jupyter notebooks
* [src/genomic_benchmarks](src/genomic_benchmarks/): Python module for datasets manipulation (downlading, checking, etc.)
* [tests](tests/): Unit tests for `pytest` and `pytest-cov`
## How to contribute
### How to contribute a model
If you beat our current best model on any dataset or just came with an interesting new idea, let us know about it: Make you code publicly available (GitHub repo, Colab...) and fill in the form at
https://forms.gle/pvkkrgHNCNmAAC1TA
### How to contribute a dataset
If you have an interesting genomic dataset, send us [an issue](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/issues) with the description and possibly link to the data (e.g. BED file and FASTQ reference). In the future, we will provide functions to make the import easy.
If you are a hero, read [the specification of our dataset format](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/datasets) and send us a pull request with new `datasets/[YOUR_DATASET_NAME]` and `docs/[YOUR_DATASET_NAME]` folders.
### How to improve code in this package
We welcome new code contributors. If you see a bug, send us [an issue](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/issues) with a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). Or even better, fix the bug and send us a pull request.
## Citing Genomic Benchmarks
If you use Genomic Benchmarks in your research, please cite it as follows.
### Text
Grešová, Katarína, et al. "Genomic benchmarks: a collection of datasets for genomic sequence classification." BMC Genomic Data 24.1 (2023): 25.
### BibTeX
```bib
@article{grevsova2023genomic,
title={Genomic benchmarks: a collection of datasets for genomic sequence classification},
author={Gre{\v{s}}ov{\'a}, Katar{\'\i}na and Martinek, Vlastimil and {\v{C}}ech{\'a}k, David and {\v{S}}ime{\v{c}}ek, Petr and Alexiou, Panagiotis},
journal={BMC Genomic Data},
volume={24},
number={1},
pages={25},
year={2023},
publisher={Springer}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks",
"name": "genomic-benchmarks",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "bioinformatics, genomics, data",
"author": "RBP Bioinformatics",
"author_email": "ML.Bioinfo.CEITEC@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/45/d5/5c212c1625185f4315e259337170d00474a61413624543b8e64df37107f8/genomic_benchmarks-1.0.0.tar.gz",
"platform": null,
"description": "[](https://badge.fury.io/py/genomic-benchmarks) \n\n# Genomic Benchmarks \ud83e\uddec\ud83c\udfcb\ufe0f\u2714\ufe0f\n\nIn this repository, we collect benchmarks for classification of genomic sequences. It is shipped as a Python package, together with functions helping to download & manipulate datasets and train NN models. Current SOTA model on genomic benchmarks is [HyenaDNA](https://github.com/HazyResearch/hyena-dna), see metrics in the [experiments](experiments/README.md#hyenadna) folder.\n\n\n## Install\n\nGenomic Benchmarks can be installed as follows:\n\n```bash\npip install genomic-benchmarks\n```\n\nTo use it with papermill, TF or pytorch, install the corresponding dependencies:\n\n```bash\n# if you want to use jupyter and papermill\npip install jupyter>=1.0.0\npip install papermill>=2.3.0\n\n# if you want to train NN with TF\npip install tensorflow>=2.6.0\npip install tensorflow-addons\npip install typing-extensions --upgrade # fixing TF installation issue\n\n# if you want to train NN with torch\npip install torch>=1.10.0\npip install torchtext\n\n```\n\nFor the package development, use Python 3.8 (ideally 3.8.9) and the installation described [here](README_devel.md).\n\n## Usage\nGet the list of all datasets with the `list_datasets` function\n\n```python\n>>> from genomic_benchmarks.data_check import list_datasets\n>>> \n>>> list_datasets()\n['demo_coding_vs_intergenomic_seqs', 'demo_human_or_worm', 'dummy_mouse_enhancers_ensembl', 'human_enhancers_cohn', 'human_enhancers_ensembl', 'human_ensembl_regulatory', 'human_nontata_promoters', 'human_ocr_ensembl']\n```\n\nYou can get basic information about the benchmark with `info` function:\n\n```python\n>>> from genomic_benchmarks.data_check import info\n>>> \n>>> info(\"human_nontata_promoters\", version=0)\nDataset `human_nontata_promoters` has 2 classes: negative, positive.\n\nAll lenghts of genomic intervals equals 251.\n\nTotally 36131 sequences have been found, 27097 for training and 9034 for testing.\n train test\nnegative 12355 4119\npositive 14742 4915\n```\n\nThe function `download_dataset` downloads the full-sequence form of the required benchmark (splitted into train and test sets, one folder for each class). If not specified otherwise, the data will be stored in `.genomic_benchmarks` subfolder of your home directory. By default, the dataset is obtained from our cloud cache (`use_cloud_cache=True`). \n\n```python\n>>> from genomic_benchmarks.loc2seq import download_dataset\n>>> \n>>> download_dataset(\"human_nontata_promoters\", version=0)\nDownloading 1VdUg0Zu8yfLS6QesBXwGz1PIQrTW3Ze4 into /home/petr/.genomic_benchmarks/human_nontata_promoters.zip... Done.\nUnzipping...Done.\nPosixPath('/home/petr/.genomic_benchmarks/human_nontata_promoters')\n```\n\nGetting TensorFlow Dataset for the benchmark and displaying samples is straightforward: \n\n```python\n>>> from pathlib import Path\n>>> import tensorflow as tf\n>>> \n>>> BATCH_SIZE = 64\n>>> SEQ_TRAIN_PATH = Path.home() / '.genomic_benchmarks' / 'human_nontata_promoters' / 'train'\n>>> CLASSES = ['negative', 'positive']\n>>> \n>>> train_dset = tf.keras.preprocessing.text_dataset_from_directory(\n... directory=SEQ_TRAIN_PATH,\n... batch_size=BATCH_SIZE,\n... class_names=CLASSES)\nFound 27097 files belonging to 2 classes.\n>>> \n>>> list(train_dset)[0][0][0]\n<tf.Tensor: shape=(), dtype=string, numpy=b'TCCTGCCTTTCCACTTGCACCAGTTTTCCCACCCCAGCCTCAGGGCGGGGCTGCCTCGTCACTTGTCTCGGGGCAGATCTGCCCTACACACGTTAGCGCCGCGCGCAAAGCAGCCCCGCAGCACCCAGGCGCCTCCTGGCGGCGCCGCGAAGGGGCGGGGCTGTCGGCTGCGCGTTGTGCGCTGTCCCAGGTTGGAAACCAGTGCCCCAGGCGGCGAGGAGAGCGGTGCCTTGCAGGGATGCTGCGGGCGG'>\n```\nSee [How_To_Train_CNN_Classifier_With_TF.ipynb](notebooks/How_To_Train_CNN_Classifier_With_TF.ipynb) for more detailed description how to train CNN classifier with TensorFlow.\n\nGetting Pytorch Dataset and displaying samples is also easy:\n```python\n>>> from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanNontataPromoters\n>>> \n>>> dset = HumanNontataPromoters(split='train', version=0)\n>>> dset[0]\n('CAATCTCACAGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGGCAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCCAAAGTCAAGGCACATGGCAAGAAGGTGCTGACTTCCTTGGGAGATGCCATAAAGCACCTGGATGATCTCAAGGGCACCTTTGCCCAGCTGAGTGAACTGCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAAGGTGAGTCCAGGAGATGT', 0)\n```\nSee [How_To_Train_CNN_Classifier_With_Pytorch.ipynb](notebooks/How_To_Train_CNN_Classifier_With_Pytorch.ipynb) for more detailed description how to train CNN classifier with Pytorch.\n\n## Hugging Face\n\nWe also provide these benchmarks through HuggingFace Hub: https://huggingface.co/katarinagresova\n\nIf you are used to using Hugging Face dataset, you can use this option to access Genomic Benchmarks. See [How_To_Use_Datasets_From_HF.ipynb](notebooks/How_To_Use_Datasets_From_HF.ipynb) for a guide.\n\n## Structure of package\n\n * [datasets](datasets/): Each folder is one benchmark dataset (or a set of bechmarks in subfolders), see [README.md](datasets/README.md) for the format specification\n * [docs](docs/): Each folder contains a Python notebook that has been used for the dataset creation\n * [experiments](experiments/): Training a simple neural network model(s) for each benchmark dataset, can be used as a baseline\n * [notebooks](notebooks/): Main use-cases demonstrated in a form of Jupyter notebooks \n * [src/genomic_benchmarks](src/genomic_benchmarks/): Python module for datasets manipulation (downlading, checking, etc.)\n * [tests](tests/): Unit tests for `pytest` and `pytest-cov`\n\n## How to contribute\n\n### How to contribute a model\n\nIf you beat our current best model on any dataset or just came with an interesting new idea, let us know about it: Make you code publicly available (GitHub repo, Colab...) and fill in the form at\n\nhttps://forms.gle/pvkkrgHNCNmAAC1TA\n\n### How to contribute a dataset\n\nIf you have an interesting genomic dataset, send us [an issue](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/issues) with the description and possibly link to the data (e.g. BED file and FASTQ reference). In the future, we will provide functions to make the import easy. \n\nIf you are a hero, read [the specification of our dataset format](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/datasets) and send us a pull request with new `datasets/[YOUR_DATASET_NAME]` and `docs/[YOUR_DATASET_NAME]` folders.\n\n### How to improve code in this package\n\nWe welcome new code contributors. If you see a bug, send us [an issue](https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/issues) with a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). Or even better, fix the bug and send us a pull request. \n\n## Citing Genomic Benchmarks\n\nIf you use Genomic Benchmarks in your research, please cite it as follows.\n\n### Text\n\nGre\u0161ov\u00e1, Katar\u00edna, et al. \"Genomic benchmarks: a collection of datasets for genomic sequence classification.\" BMC Genomic Data 24.1 (2023): 25.\n\n### BibTeX\n\n```bib\n@article{grevsova2023genomic,\n title={Genomic benchmarks: a collection of datasets for genomic sequence classification},\n author={Gre{\\v{s}}ov{\\'a}, Katar{\\'\\i}na and Martinek, Vlastimil and {\\v{C}}ech{\\'a}k, David and {\\v{S}}ime{\\v{c}}ek, Petr and Alexiou, Panagiotis},\n journal={BMC Genomic Data},\n volume={24},\n number={1},\n pages={25},\n year={2023},\n publisher={Springer}\n}\n```\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Genomic Benchmarks",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks"
},
"split_keywords": [
"bioinformatics",
" genomics",
" data"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "45d55c212c1625185f4315e259337170d00474a61413624543b8e64df37107f8",
"md5": "94dbb3ea05cbe52b586f904a258f42f1",
"sha256": "1cf72a9e9413b77ee3e6de77d35d10de49f9e7f652a2fa6cd05d7b20388d9170"
},
"downloads": -1,
"filename": "genomic_benchmarks-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "94dbb3ea05cbe52b586f904a258f42f1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 21755,
"upload_time": "2025-07-30T12:24:26",
"upload_time_iso_8601": "2025-07-30T12:24:26.315785Z",
"url": "https://files.pythonhosted.org/packages/45/d5/5c212c1625185f4315e259337170d00474a61413624543b8e64df37107f8/genomic_benchmarks-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-30 12:24:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ML-Bioinfo-CEITEC",
"github_project": "genomic_benchmarks",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "genomic-benchmarks"
}