# ProteinGym
## Table of Contents
- [Overview](#overview)
- [Fitness prediction performance](#fitness-prediction-performance)
- [Resources](#resources)
- [How to contribute?](#how-to-contribute)
- [Usage and reproducibility](#usage-and-reproducibility)
- [Acknowledgements](#acknowledgements)
- [Releases](#releases)
- [License](#license)
- [Reference](#reference)
- [Links](#links)
## Overview
ProteinGym is an extensive set of Deep Mutational Scanning (DMS) assays and annotated human clinical variants curated to enable thorough comparisons of various mutation effect predictors in different regimes. Both the DMS assays and clinical variants are divided into 1) a substitution benchmark which currently consists of the experimental characterisation of ~2.7M missense variants across 217 DMS assays and 2,525 clinical proteins, and 2) an indel benchmark that includes ∼300k mutants across 74 DMS assays and 1,555 clinical proteins.
Each processed file in each benchmark corresponds to a single DMS assay or clinical protein, and contains the following variables:
- mutant (str): describes the set of substitutions to apply on the reference sequence to obtain the mutated sequence (eg., A1P:D2N implies the amino acid 'A' at position 1 should be replaced by 'P', and 'D' at position 2 should be replaced by 'N'). Present in the the ProteinGym substitution benchmark only (not indels).
- mutated_sequence (str): represents the full amino acid sequence for the mutated protein.
- DMS_score (float): corresponds to the experimental measurement in the DMS assay. Across all assays, the higher the DMS_score value, the higher the fitness of the mutated protein. This column is not present in the clinical files, since they are classified as benign/pathogenic, and do not have continuous scores
- DMS_score_bin (int): indicates whether the DMS_score is above the fitness cutoff (1 is fit (pathogenic for clinical variants), 0 is not fit (benign for clinical variants))
Additionally, we provide two reference files for each benchmark that give further details on each assay and contain in particular:
- The UniProt_ID of the corresponding protein, along with taxon and MSA depth category
- The target sequence (target_seq) used in the assay
- For the assays, details on how the DMS_score was created from the raw files and how it was binarized
To download the benchmarks, please see `DMS benchmark - Substitutions` and `DMS benchmark - Indels` in the "Resources" section below.
## Fitness prediction performance
The [benchmarks](https://github.com/OATML-Markslab/ProteinGym/tree/main/benchmarks) folder provides detailed performance files for all baselines on the DMS and clinical benchmarks.
We report the following metrics:
- For DMS benchmarks in the zero-shot setting: Spearman, NDCG, AUC, MCC and Top-K recall
- For DMS benchmarks in the supervised setting: Spearman and MSE
- For clinical benchmarks: AUC
Metrics are aggregated as follows:
1. Aggregating by UniProt ID (to avoid biasing results towards proteins for which several DMS assays are available in ProteinGym)
2. Aggregating by different functional categories, and taking the mean across those categories.
These files are named e.g. `DMS_substitutions_Spearman_DMS_level.csv`, `DMS_substitutions_Spearman_Uniprot_level` and `DMS_substitutions_Spearman_Uniprot_Selection_Type_level` respectively for these different steps.
For other deep dives (performance split by taxa, MSA depth, mutational depth and more), these are all contained in the `benchmarks/DMS_zero_shot/substitutions/Spearman/Summary_performance_DMS_substitutions_Spearman.csv` folder (resp. DMS_indels/clinical_substitutions/clinical_indels & their supervised counterparts). These files are also what are hosted on the website.
We also include, as on the website, a bootstrapped standard error of these aggregated metrics to reflect the variance in the final numbers with respect to the individual assays.
To calculate the DMS substitution benchmark metrics:
1. Download the model scores from the website
2. Run `./scripts/scoring_DMS_zero_shot/performance_substitutions.sh`
And for indels, follow step #1 and run `./scripts/scoring_DMS_zero_shot/performance_substitutions_indels.sh`.
### ProteinGym benchmarks - Leaderboard
The full ProteinGym benchmarks performance files are also accessible via our dedicated website: https://www.proteingym.org/.
It includes leaderboards for the substitution and indel benchmarks, as well as detailed DMS-level performance files for all baselines.
The current version of the substitution benchmark includes the following baselines:
Model name | Model type | Reference
--- | --- | --- |
Site Independent | Alignment-based model | [Hopf, T.A., Ingraham, J., Poelwijk, F.J., Schärfe, C.P., Springer, M., Sander, C., & Marks, D.S. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35, 128-135.](https://www.nature.com/articles/nbt.3769)
EVmutation | Alignment-based model | [Hopf, T.A., Ingraham, J., Poelwijk, F.J., Schärfe, C.P., Springer, M., Sander, C., & Marks, D.S. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35, 128-135.](https://www.nature.com/articles/nbt.3769)
WaveNet | Alignment-based model | [Shin, J., Riesselman, A.J., Kollasch, A.W., McMahon, C., Simon, E., Sander, C., Manglik, A., Kruse, A.C., & Marks, D.S. (2021). Protein design and variant prediction using autoregressive generative models. Nature Communications, 12.](https://www.nature.com/articles/s41467-021-22732-w)
DeepSequence | Alignment-based model | [Riesselman, A.J., Ingraham, J., & Marks, D.S. (2018). Deep generative models of genetic variation capture the effects of mutations. Nature Methods, 15, 816-822.](https://www.nature.com/articles/s41592-018-0138-4)
GEMME | Alignment-based model | [Laine, É., Karami, Y., & Carbone, A. (2019). GEMME: A Simple and Fast Global Epistatic Model Predicting Mutational Effects. Molecular Biology and Evolution, 36, 2604 - 2619.](https://pubmed.ncbi.nlm.nih.gov/31406981/)
EVE | Alignment-based model | [Frazer, J., Notin, P., Dias, M., Gomez, A.N., Min, J.K., Brock, K.P., Gal, Y., & Marks, D.S. (2021). Disease variant prediction with deep generative models of evolutionary data. Nature.](https://www.nature.com/articles/s41586-021-04043-8)
Unirep | Protein language model | [Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., & Church, G.M. (2019). Unified rational protein engineering with sequence-based deep representation learning. Nature Methods, 1-8](https://www.nature.com/articles/s41592-019-0598-1)
ESM-1b | Protein language model | Original model: [Rives, A., Goyal, S., Meier, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., & Fergus, R. (2019). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America, 118](https://www.biorxiv.org/content/10.1101/622803v4); Extensions: [Brandes, N., Goldman, G., Wang, C.H. et al. Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet 55, 1512–1522 (2023).](https://doi.org/10.1038/s41588-023-01465-0)
ESM-1v | Protein language model | [Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., & Rives, A. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. NeurIPS.](https://proceedings.neurips.cc/paper/2021/hash/f51338d736f95dd42427296047067694-Abstract.html)
VESPA | Protein language model | [Marquet, C., Heinzinger, M., Olenyi, T., Dallago, C., Bernhofer, M., Erckert, K., & Rost, B. (2021). Embeddings from protein language models predict conservation and variant effects. Human Genetics, 141, 1629 - 1647.](https://link.springer.com/article/10.1007/s00439-021-02411-y)
RITA | Protein language model | [Hesslow, D., Zanichelli, N., Notin, P., Poli, I., & Marks, D.S. (2022). RITA: a Study on Scaling Up Generative Protein Sequence Models. ArXiv, abs/2205.05789.](https://arxiv.org/abs/2205.05789)
ProtGPT2 | Protein language model | [Ferruz, N., Schmidt, S., & Höcker, B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications, 13.](https://www.nature.com/articles/s41467-022-32007-7)
ProGen2 | Protein language model | [Nijkamp, E., Ruffolo, J.A., Weinstein, E.N., Naik, N., & Madani, A. (2022). ProGen2: Exploring the Boundaries of Protein Language Models. ArXiv, abs/2206.13517.](https://arxiv.org/abs/2206.13517)
MSA Transformer | Hybrid - Alignment & PLM |[Rao, R., Liu, J., Verkuil, R., Meier, J., Canny, J.F., Abbeel, P., Sercu, T., & Rives, A. (2021). MSA Transformer. ICML.](http://proceedings.mlr.press/v139/rao21a.html)
Tranception | Hybrid - Alignment & PLM | [Notin, P., Dias, M., Frazer, J., Marchena-Hurtado, J., Gomez, A.N., Marks, D.S., & Gal, Y. (2022). Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. ICML.](https://proceedings.mlr.press/v162/notin22a.html)
TranceptEVE | Hybrid - Alignment & PLM | [Notin, P., Van Niekerk, L., Kollasch, A., Ritter, D., Gal, Y. & Marks, D.S. & (2022). TranceptEVE: Combining Family-specific and Family-agnostic Models of Protein Sequences for Improved Fitness Prediction. NeurIPS, LMRL workshop.](https://www.biorxiv.org/content/10.1101/2022.12.07.519495v1?rss=1)
CARP | Protein language model | [Yang, K.K., Fusi, N., Lu, A.X. (2022). Convolutions are competitive with transformers for protein sequence pretraining.](https://doi.org/10.1101/2022.05.19.492714)
MIF | Inverse folding | [Yang, K.K., Yeh, H., Zanichelli, N. (2022). Masked Inverse Folding with Sequence Transfer for Protein Representation Learning.](https://doi.org/10.1101/2022.05.25.493516)
ProteinMPNN | Inverse folding | [J. Dauparas, I. Anishchenko, N. Bennett, H. Bai, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, A. Courbet, R. J. de Haas, N. Bethel, P. J. Y. Leung, T. F. Huddy, S. Pellock, D. Tischer, F. Chan,B. Koepnick, H. Nguyen, A. Kang, B. Sankaran,A. K. Bera, N. P. King,D. Baker (2022). Robust deep learning-based protein sequence design using ProteinMPNN. Science, Vol 378.](https://www.science.org/doi/10.1126/science.add2187)
ESM-IF1 | Inverse folding | [Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, Alexander Rives (2022). Learning Inverse Folding from Millions of Predicted Structures. ICML](https://www.biorxiv.org/content/10.1101/2022.04.10.487779v2.full.pdf+html)
ProtSSN | Hybrid - Structure & PLM | [Yang Tan, Bingxin Zhou, Lirong Zheng, Guisheng Fan, Liang Hong. (2023). Semantical and Topological Protein Encoding Toward Enhanced Bioactivity and Thermostability.](https://www.biorxiv.org/content/10.1101/2023.12.01.569522v1)
SaProt | Hybrid - Structure & PLM | [Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, Fajie Yuan. (2024). SaProt: Protein Language Modeling with Structure-aware Vocabulary. ICLR](href='https://www.biorxiv.org/content/10.1101/2023.10.01.560349v5)
Except for the WaveNet model (which only uses alignments to recover a set of homologous protein sequences to train on, but then trains on non-aligned sequences), all alignment-based methods are unable to score indels given the fixed coordinate system they are trained on. Similarly, the masked-marginals procedure to generate the masked-marginals for ESM-1v and MSA Transformer requires the position to exist in the wild-type sequence. All the other model architectures listed above (eg., Tranception, RITA, ProGen2) are included in the indel benchmark.
For clinical baselines, we used dbNSFP 4.4a as detailed in the manuscript appendix (and in `proteingym/clinical_benchmark_notebooks/clinical_subs_processing.ipynb`).
## Resources
To download and unzip the data, use the following template, replacing {VERSION} with the desired version number (e.g., "v1.1") and {FILENAME} with the specific file you want to download, as listed in the table below. The latest version is v1.1.
For example, you can download & unzip the zero-shot predictions for all baselines for all DMS substitution assays as follows:
```
VERSION="v1.1"
FILENAME="DMS_ProteinGym_substitutions.zip"
curl -o ${FILENAME} https://marks.hms.harvard.edu/proteingym/ProteinGym_${VERSION}/${FILENAME}
unzip ${FILENAME} && rm ${FILENAME}
```
Data | Size (unzipped) | Filename
--- | --- | --- |
DMS benchmark - Substitutions | 1.0GB | DMS_ProteinGym_substitutions.zip
DMS benchmark - Indels | 200MB | DMS_ProteinGym_indels.zip
Zero-shot DMS Model scores - Substitutions | 31GB | zero_shot_substitutions_scores.zip
Zero-shot DMS Model scores - Indels | 5.2GB | zero_shot_indels_scores.zip
Supervised DMS Model performance - Substitutions | 2.7MB | DMS_supervised_substitutions_scores.zip
Supervised DMS Model performance - Indels | 0.9MB | DMS_supervised_indels_scores.zip
Multiple Sequence Alignments (MSAs) for DMS assays | 5.2GB | DMS_msa_files.zip
Redundancy-based sequence weights for DMS assays | 200MB | DMS_msa_weights.zip
Predicted 3D structures from inverse-folding models | 84MB | ProteinGym_AF2_structures.zip
Clinical benchmark - Substitutions | 123MB | clinical_ProteinGym_substitutions.zip
Clinical benchmark - Indels | 2.8MB | clinical_ProteinGym_indels.zip
Clinical MSAs | 17.8GB | clinical_msa_files.zip
Clinical MSA weights | 250MB | clinical_msa_weights.zip
Clinical Model scores - Substitutions | 0.9GB | zero_shot_clinical_substitutions_scores.zip
Clinical Model scores - Indels | 0.7GB | zero_shot_clinical_indels_scores.zip
CV folds - Substitutions - Singles | 50M | cv_folds_singles_substitutions.zip
CV folds - Substitutions - Multiples | 81M | cv_folds_multiples_substitutions.zip
CV folds - Indels | 19MB | cv_folds_indels.zip
Then we also host the raw DMS assays (before preprocessing)
Data | Size (unzipped) | Link
--- | --- | --- |
DMS benchmark: Substitutions (raw) | 500MB | substitutions_raw_DMS.zip
DMS benchmark: Indels (raw) | 450MB | indels_raw_DMS.zip
Clinical benchmark: Substitutions (raw) | 58MB | substitutions_raw_clinical.zip
Clinical benchmark: Indels (raw) | 12.4MB | indels_raw_clinical.zip
## How to contribute?
### New assays
If you would like to suggest new assays to be part of ProteinGym, please raise an issue on this repository with a `new_assay' label. The criteria we typically consider for inclusion are as follows:
1. The corresponding raw dataset needs to be publicly available
2. The assay needs to be protein-related (ie., exclude UTR, tRNA, promoter, etc.)
3. The dataset needs to have insufficient number of measurements
4. The assay needs to have a sufficiently high dynamic range
5. The assay has to be relevant to fitness prediction
### New baselines
If you would like new baselines to be included in ProteinGym (ie., website, performance files, detailed scoring files), please follow the following steps:
1. Submit a PR to our repo with two things:
- A new subfolder under proteingym/baselines named with your new model name. This subfolder should include a python scoring script similar to [this script](https://github.com/OATML-Markslab/ProteinGym/blob/main/proteingym/baselines/rita/compute_fitness.py), as well as all code dependencies required for the scoring script to run properly
- An example bash script (e.g., under scripts/scoring_DMS_zero_shot) with all relevant hyperparameters for scoring, similar to [this script](https://github.com/OATML-Markslab/ProteinGym/blob/main/scripts/scoring_DMS_zero_shot/scoring_RITA_substitutions.sh)
2. Raise an issue with a 'new model' label, providing instructions on how to download relevant model checkpoints for scoring, and reporting the performance of your model on the relevant benchmark using our performance scripts (e.g., [for zero-shot DMS benchmarks](https://github.com/OATML-Markslab/ProteinGym/blob/main/proteingym/performance_DMS_benchmarks.py)). Please note that our DMS performance scripts correct for various biases (e.g., number of assays per protein family and function groupings) and thus the resulting aggregated performance is not the same as the arithmetic average across assays.
At this point we are only considering new baselines satisfying the following conditions:
1. The model is able to score all mutants in the relevant benchmark (to ensure all models are compared exactly on the same set of mutants everywhere);
2. The corresponding model is open source (we should be able to reproduce scores if needed).
At this stage, we are only considering requests for which all model scores for all mutants in a given benchmark (substitution or indel) are provided by the requester; but we are planning on regularly scoring new baselines ourselves for methods with wide adoption by the community and/or suggestions with many upvotes.
### Notes
12 December 2023: The code for training and evaluating supervised models is currently shared in https://github.com/OATML-Markslab/ProteinNPT. We are in the process of integrating the code into this repo.
## Usage and reproducibility
If you would like to compute all performance metrics for the various benchmarks, please follow the following steps:
1. Download locally all relevant files as per instructions above (see Resources)
2. Update the paths for all files downloaded in the prior step in the [config script](https://github.com/OATML-Markslab/ProteinGym/blob/main/scripts/zero_shot_config.sh)
3. If adding a new model, adjust the [config.json](https://github.com/OATML-Markslab/ProteinGym/blob/main/config.json) file accordingly and add the model scores to the relevant path (e.g., [DMS_output_score_folder_subs](https://github.com/OATML-Markslab/ProteinGym/blob/main/scripts/zero_shot_config.sh#L19))
4. If focusing on DMS benchmarks, run the [merge script](https://github.com/OATML-Markslab/ProteinGym/blob/main/scripts/scoring_DMS_zero_shot/merge_all_scores.sh). This will create a single file for each DMS assay, with scores for all model baselines
5. Run the relevant performance script (eg., [scripts/scoring_DMS_zero_shot/performance_substitutions.sh](https://github.com/OATML-Markslab/ProteinGym/blob/main/scripts/scoring_DMS_zero_shot/performance_substitutions.sh))
## Acknowledgements
Our codebase leveraged code from the following repositories to compute baselines:
Model | Repo
--- | ---
UniRep | https://github.com/churchlab/UniRep
UniRep | https://github.com/chloechsu/combining-evolutionary-and-assay-labelled-data
EVE | https://github.com/OATML-Markslab/EVE
GEMME | https://hub.docker.com/r/elodielaine/gemme
ESM | https://github.com/facebookresearch/esm
EVmutation | https://github.com/debbiemarkslab/EVcouplings
ProGen2 | https://github.com/salesforce/progen
HMMER | https://github.com/EddyRivasLab/hmmer
MSA Transformer | https://github.com/rmrao/msa-transformer
ProtGPT2 | https://huggingface.co/nferruz/ProtGPT2
ProteinMPNN | https://github.com/dauparas/ProteinMPNN
RITA | https://github.com/lightonai/RITA
Tranception | https://github.com/OATML-Markslab/Tranception
VESPA | https://github.com/Rostlab/VESPA
CARP | https://github.com/microsoft/protein-sequence-models
MIF | https://github.com/microsoft/protein-sequence-models
Foldseek | https://github.com/steineggerlab/foldseek
ProtSSN | https://github.com/tyang816/ProtSSN
SaProt | https://github.com/westlake-repl/SaProt
We would like to thank the GEMME team for providing model scores on an earlier version of the benchmark (ProteinGym v0.1), and the ProtSSN and SaProt teams for integrating their model in the ProteinGym repo.
Special thanks the teams of experimentalists who developed and performed the assays that ProteinGym is built on. If you are using ProteinGym in your work, please consider citing the corresponding papers. To facilitate this, we have prepared a file (assays.bib) containing the bibtex entries for all these papers.
## Releases
1. [ProteinGym_v1.0](https://zenodo.org/records/13932633): Initial release.
2. [ProteinGym_v1.1](https://zenodo.org/records/13936340): Updates to reference file, and addition of ProtSSN and SaProt baselines.
## License
This project is available under the MIT license found in the LICENSE file in this GitHub repository.
## Reference
If you use ProteinGym in your work, please cite the following paper:
```bibtex
@inproceedings{NEURIPS2023_cac723e5,
author = {Notin, Pascal and Kollasch, Aaron and Ritter, Daniel and van Niekerk, Lood and Paul, Steffanie and Spinner, Han and Rollins, Nathan and Shaw, Ada and Orenbuch, Rose and Weitzman, Ruben and Frazer, Jonathan and Dias, Mafalda and Franceschi, Dinko and Gal, Yarin and Marks, Debora},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {64331--64379},
publisher = {Curran Associates, Inc.},
title = {ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/cac723e5ff29f65e3fcbb0739ae91bee-Paper-Datasets_and_Benchmarks.pdf},
volume = {36},
year = {2023}
}
```
## Links
- Website: https://www.proteingym.org/
- NeurIPS proceedings: [link to abstract](https://papers.nips.cc/paper_files/paper/2023/hash/cac723e5ff29f65e3fcbb0739ae91bee-Abstract-Datasets_and_Benchmarks.html)
- Preprint: [link to abstract](https://www.biorxiv.org/content/10.1101/2023.12.07.570727v1)
Raw data
{
"_id": null,
"home_page": "https://github.com/OATML-Markslab/ProteinGym",
"name": "proteingym",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "OATML-Markslab",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/3e/3a/50f7db67969cccc84cbfee30cb37d8fef01215128ecc9607feb6a5e5fdb4/proteingym-1.1.tar.gz",
"platform": null,
"description": "# ProteinGym\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Fitness prediction performance](#fitness-prediction-performance)\n- [Resources](#resources)\n- [How to contribute?](#how-to-contribute)\n- [Usage and reproducibility](#usage-and-reproducibility)\n- [Acknowledgements](#acknowledgements)\n- [Releases](#releases)\n- [License](#license)\n- [Reference](#reference)\n- [Links](#links)\n\n## Overview\n\nProteinGym is an extensive set of Deep Mutational Scanning (DMS) assays and annotated human clinical variants curated to enable thorough comparisons of various mutation effect predictors in different regimes. Both the DMS assays and clinical variants are divided into 1) a substitution benchmark which currently consists of the experimental characterisation of ~2.7M missense variants across 217 DMS assays and 2,525 clinical proteins, and 2) an indel benchmark that includes \u223c300k mutants across 74 DMS assays and 1,555 clinical proteins.\n\nEach processed file in each benchmark corresponds to a single DMS assay or clinical protein, and contains the following variables:\n- mutant (str): describes the set of substitutions to apply on the reference sequence to obtain the mutated sequence (eg., A1P:D2N implies the amino acid 'A' at position 1 should be replaced by 'P', and 'D' at position 2 should be replaced by 'N'). Present in the the ProteinGym substitution benchmark only (not indels).\n- mutated_sequence (str): represents the full amino acid sequence for the mutated protein.\n- DMS_score (float): corresponds to the experimental measurement in the DMS assay. Across all assays, the higher the DMS_score value, the higher the fitness of the mutated protein. This column is not present in the clinical files, since they are classified as benign/pathogenic, and do not have continuous scores\n- DMS_score_bin (int): indicates whether the DMS_score is above the fitness cutoff (1 is fit (pathogenic for clinical variants), 0 is not fit (benign for clinical variants))\n\nAdditionally, we provide two reference files for each benchmark that give further details on each assay and contain in particular:\n- The UniProt_ID of the corresponding protein, along with taxon and MSA depth category\n- The target sequence (target_seq) used in the assay\n- For the assays, details on how the DMS_score was created from the raw files and how it was binarized \n\nTo download the benchmarks, please see `DMS benchmark - Substitutions` and `DMS benchmark - Indels` in the \"Resources\" section below.\n\n## Fitness prediction performance\n\nThe [benchmarks](https://github.com/OATML-Markslab/ProteinGym/tree/main/benchmarks) folder provides detailed performance files for all baselines on the DMS and clinical benchmarks.\n\nWe report the following metrics:\n- For DMS benchmarks in the zero-shot setting: Spearman, NDCG, AUC, MCC and Top-K recall\n- For DMS benchmarks in the supervised setting: Spearman and MSE\n- For clinical benchmarks: AUC\n\nMetrics are aggregated as follows:\n1. Aggregating by UniProt ID (to avoid biasing results towards proteins for which several DMS assays are available in ProteinGym)\n2. Aggregating by different functional categories, and taking the mean across those categories.\n\nThese files are named e.g. `DMS_substitutions_Spearman_DMS_level.csv`, `DMS_substitutions_Spearman_Uniprot_level` and `DMS_substitutions_Spearman_Uniprot_Selection_Type_level` respectively for these different steps.\n\nFor other deep dives (performance split by taxa, MSA depth, mutational depth and more), these are all contained in the `benchmarks/DMS_zero_shot/substitutions/Spearman/Summary_performance_DMS_substitutions_Spearman.csv` folder (resp. DMS_indels/clinical_substitutions/clinical_indels & their supervised counterparts). These files are also what are hosted on the website.\n\nWe also include, as on the website, a bootstrapped standard error of these aggregated metrics to reflect the variance in the final numbers with respect to the individual assays.\n\nTo calculate the DMS substitution benchmark metrics:\n1. Download the model scores from the website\n2. Run `./scripts/scoring_DMS_zero_shot/performance_substitutions.sh`\n\nAnd for indels, follow step #1 and run `./scripts/scoring_DMS_zero_shot/performance_substitutions_indels.sh`.\n\n### ProteinGym benchmarks - Leaderboard\n\nThe full ProteinGym benchmarks performance files are also accessible via our dedicated website: https://www.proteingym.org/.\nIt includes leaderboards for the substitution and indel benchmarks, as well as detailed DMS-level performance files for all baselines.\nThe current version of the substitution benchmark includes the following baselines:\n\nModel name | Model type | Reference\n--- | --- | --- |\nSite Independent | Alignment-based model | [Hopf, T.A., Ingraham, J., Poelwijk, F.J., Sch\u00e4rfe, C.P., Springer, M., Sander, C., & Marks, D.S. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35, 128-135.](https://www.nature.com/articles/nbt.3769)\nEVmutation | Alignment-based model | [Hopf, T.A., Ingraham, J., Poelwijk, F.J., Sch\u00e4rfe, C.P., Springer, M., Sander, C., & Marks, D.S. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35, 128-135.](https://www.nature.com/articles/nbt.3769)\nWaveNet | Alignment-based model | [Shin, J., Riesselman, A.J., Kollasch, A.W., McMahon, C., Simon, E., Sander, C., Manglik, A., Kruse, A.C., & Marks, D.S. (2021). Protein design and variant prediction using autoregressive generative models. Nature Communications, 12.](https://www.nature.com/articles/s41467-021-22732-w)\nDeepSequence | Alignment-based model | [Riesselman, A.J., Ingraham, J., & Marks, D.S. (2018). Deep generative models of genetic variation capture the effects of mutations. Nature Methods, 15, 816-822.](https://www.nature.com/articles/s41592-018-0138-4)\nGEMME | Alignment-based model | [Laine, \u00c9., Karami, Y., & Carbone, A. (2019). GEMME: A Simple and Fast Global Epistatic Model Predicting Mutational Effects. Molecular Biology and Evolution, 36, 2604 - 2619.](https://pubmed.ncbi.nlm.nih.gov/31406981/)\nEVE | Alignment-based model | [Frazer, J., Notin, P., Dias, M., Gomez, A.N., Min, J.K., Brock, K.P., Gal, Y., & Marks, D.S. (2021). Disease variant prediction with deep generative models of evolutionary data. Nature.](https://www.nature.com/articles/s41586-021-04043-8)\nUnirep | Protein language model | [Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., & Church, G.M. (2019). Unified rational protein engineering with sequence-based deep representation learning. Nature Methods, 1-8](https://www.nature.com/articles/s41592-019-0598-1)\nESM-1b | Protein language model | Original model: [Rives, A., Goyal, S., Meier, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., & Fergus, R. (2019). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America, 118](https://www.biorxiv.org/content/10.1101/622803v4); Extensions: [Brandes, N., Goldman, G., Wang, C.H. et al. Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet 55, 1512\u20131522 (2023).](https://doi.org/10.1038/s41588-023-01465-0)\nESM-1v | Protein language model | [Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., & Rives, A. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. NeurIPS.](https://proceedings.neurips.cc/paper/2021/hash/f51338d736f95dd42427296047067694-Abstract.html)\nVESPA | Protein language model | [Marquet, C., Heinzinger, M., Olenyi, T., Dallago, C., Bernhofer, M., Erckert, K., & Rost, B. (2021). Embeddings from protein language models predict conservation and variant effects. Human Genetics, 141, 1629 - 1647.](https://link.springer.com/article/10.1007/s00439-021-02411-y)\nRITA | Protein language model | [Hesslow, D., Zanichelli, N., Notin, P., Poli, I., & Marks, D.S. (2022). RITA: a Study on Scaling Up Generative Protein Sequence Models. ArXiv, abs/2205.05789.](https://arxiv.org/abs/2205.05789)\nProtGPT2 | Protein language model | [Ferruz, N., Schmidt, S., & H\u00f6cker, B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications, 13.](https://www.nature.com/articles/s41467-022-32007-7)\nProGen2 | Protein language model | [Nijkamp, E., Ruffolo, J.A., Weinstein, E.N., Naik, N., & Madani, A. (2022). ProGen2: Exploring the Boundaries of Protein Language Models. ArXiv, abs/2206.13517.](https://arxiv.org/abs/2206.13517)\nMSA Transformer | Hybrid - Alignment & PLM |[Rao, R., Liu, J., Verkuil, R., Meier, J., Canny, J.F., Abbeel, P., Sercu, T., & Rives, A. (2021). MSA Transformer. ICML.](http://proceedings.mlr.press/v139/rao21a.html)\nTranception | Hybrid - Alignment & PLM | [Notin, P., Dias, M., Frazer, J., Marchena-Hurtado, J., Gomez, A.N., Marks, D.S., & Gal, Y. (2022). Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. ICML.](https://proceedings.mlr.press/v162/notin22a.html)\nTranceptEVE | Hybrid - Alignment & PLM | [Notin, P., Van Niekerk, L., Kollasch, A., Ritter, D., Gal, Y. & Marks, D.S. & (2022). TranceptEVE: Combining Family-specific and Family-agnostic Models of Protein Sequences for Improved Fitness Prediction. NeurIPS, LMRL workshop.](https://www.biorxiv.org/content/10.1101/2022.12.07.519495v1?rss=1)\nCARP | Protein language model | [Yang, K.K., Fusi, N., Lu, A.X. (2022). Convolutions are competitive with transformers for protein sequence pretraining.](https://doi.org/10.1101/2022.05.19.492714)\nMIF | Inverse folding | [Yang, K.K., Yeh, H., Zanichelli, N. (2022). Masked Inverse Folding with Sequence Transfer for Protein Representation Learning.](https://doi.org/10.1101/2022.05.25.493516)\nProteinMPNN | Inverse folding | [J. Dauparas, I. Anishchenko, N. Bennett, H. Bai, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, A. Courbet, R. J. de Haas, N. Bethel, P. J. Y. Leung, T. F. Huddy, S. Pellock, D. Tischer, F. Chan,B. Koepnick, H. Nguyen, A. Kang, B. Sankaran,A. K. Bera, N. P. King,D. Baker (2022). Robust deep learning-based protein sequence design using ProteinMPNN. Science, Vol 378.](https://www.science.org/doi/10.1126/science.add2187)\nESM-IF1 | Inverse folding | [Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, Alexander Rives (2022). Learning Inverse Folding from Millions of Predicted Structures. ICML](https://www.biorxiv.org/content/10.1101/2022.04.10.487779v2.full.pdf+html)\nProtSSN | Hybrid - Structure & PLM | [Yang Tan, Bingxin Zhou, Lirong Zheng, Guisheng Fan, Liang Hong. (2023). Semantical and Topological Protein Encoding Toward Enhanced Bioactivity and Thermostability.](https://www.biorxiv.org/content/10.1101/2023.12.01.569522v1)\nSaProt | Hybrid - Structure & PLM | [Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, Fajie Yuan. (2024). SaProt: Protein Language Modeling with Structure-aware Vocabulary. ICLR](href='https://www.biorxiv.org/content/10.1101/2023.10.01.560349v5)\n\nExcept for the WaveNet model (which only uses alignments to recover a set of homologous protein sequences to train on, but then trains on non-aligned sequences), all alignment-based methods are unable to score indels given the fixed coordinate system they are trained on. Similarly, the masked-marginals procedure to generate the masked-marginals for ESM-1v and MSA Transformer requires the position to exist in the wild-type sequence. All the other model architectures listed above (eg., Tranception, RITA, ProGen2) are included in the indel benchmark.\n\nFor clinical baselines, we used dbNSFP 4.4a as detailed in the manuscript appendix (and in `proteingym/clinical_benchmark_notebooks/clinical_subs_processing.ipynb`).\n\n## Resources\n\nTo download and unzip the data, use the following template, replacing {VERSION} with the desired version number (e.g., \"v1.1\") and {FILENAME} with the specific file you want to download, as listed in the table below. The latest version is v1.1.\nFor example, you can download & unzip the zero-shot predictions for all baselines for all DMS substitution assays as follows:\n```\nVERSION=\"v1.1\"\nFILENAME=\"DMS_ProteinGym_substitutions.zip\"\ncurl -o ${FILENAME} https://marks.hms.harvard.edu/proteingym/ProteinGym_${VERSION}/${FILENAME}\nunzip ${FILENAME} && rm ${FILENAME}\n```\n\nData | Size (unzipped) | Filename\n--- | --- | --- |\nDMS benchmark - Substitutions | 1.0GB | DMS_ProteinGym_substitutions.zip\nDMS benchmark - Indels | 200MB | DMS_ProteinGym_indels.zip\nZero-shot DMS Model scores - Substitutions | 31GB | zero_shot_substitutions_scores.zip\nZero-shot DMS Model scores - Indels | 5.2GB | zero_shot_indels_scores.zip\nSupervised DMS Model performance - Substitutions | 2.7MB | DMS_supervised_substitutions_scores.zip\nSupervised DMS Model performance - Indels | 0.9MB | DMS_supervised_indels_scores.zip\nMultiple Sequence Alignments (MSAs) for DMS assays | 5.2GB | DMS_msa_files.zip\nRedundancy-based sequence weights for DMS assays | 200MB | DMS_msa_weights.zip\nPredicted 3D structures from inverse-folding models | 84MB | ProteinGym_AF2_structures.zip\nClinical benchmark - Substitutions | 123MB | clinical_ProteinGym_substitutions.zip\nClinical benchmark - Indels | 2.8MB | clinical_ProteinGym_indels.zip\nClinical MSAs | 17.8GB | clinical_msa_files.zip\nClinical MSA weights | 250MB | clinical_msa_weights.zip\nClinical Model scores - Substitutions | 0.9GB | zero_shot_clinical_substitutions_scores.zip\nClinical Model scores - Indels | 0.7GB | zero_shot_clinical_indels_scores.zip\nCV folds - Substitutions - Singles | 50M | cv_folds_singles_substitutions.zip\nCV folds - Substitutions - Multiples | 81M | cv_folds_multiples_substitutions.zip\nCV folds - Indels | 19MB | cv_folds_indels.zip\n\nThen we also host the raw DMS assays (before preprocessing)\n\nData | Size (unzipped) | Link\n--- | --- | --- |\nDMS benchmark: Substitutions (raw) | 500MB | substitutions_raw_DMS.zip\nDMS benchmark: Indels (raw) | 450MB | indels_raw_DMS.zip\nClinical benchmark: Substitutions (raw) | 58MB | substitutions_raw_clinical.zip\nClinical benchmark: Indels (raw) | 12.4MB | indels_raw_clinical.zip\n\n## How to contribute?\n\n### New assays\nIf you would like to suggest new assays to be part of ProteinGym, please raise an issue on this repository with a `new_assay' label. The criteria we typically consider for inclusion are as follows:\n1. The corresponding raw dataset needs to be publicly available\n2. The assay needs to be protein-related (ie., exclude UTR, tRNA, promoter, etc.)\n3. The dataset needs to have insufficient number of measurements\n4. The assay needs to have a sufficiently high dynamic range\n5. The assay has to be relevant to fitness prediction\n\n### New baselines\nIf you would like new baselines to be included in ProteinGym (ie., website, performance files, detailed scoring files), please follow the following steps:\n1. Submit a PR to our repo with two things:\n - A new subfolder under proteingym/baselines named with your new model name. This subfolder should include a python scoring script similar to [this script](https://github.com/OATML-Markslab/ProteinGym/blob/main/proteingym/baselines/rita/compute_fitness.py), as well as all code dependencies required for the scoring script to run properly\n - An example bash script (e.g., under scripts/scoring_DMS_zero_shot) with all relevant hyperparameters for scoring, similar to [this script](https://github.com/OATML-Markslab/ProteinGym/blob/main/scripts/scoring_DMS_zero_shot/scoring_RITA_substitutions.sh)\n2. Raise an issue with a 'new model' label, providing instructions on how to download relevant model checkpoints for scoring, and reporting the performance of your model on the relevant benchmark using our performance scripts (e.g., [for zero-shot DMS benchmarks](https://github.com/OATML-Markslab/ProteinGym/blob/main/proteingym/performance_DMS_benchmarks.py)). Please note that our DMS performance scripts correct for various biases (e.g., number of assays per protein family and function groupings) and thus the resulting aggregated performance is not the same as the arithmetic average across assays.\n\nAt this point we are only considering new baselines satisfying the following conditions:\n1. The model is able to score all mutants in the relevant benchmark (to ensure all models are compared exactly on the same set of mutants everywhere);\n2. The corresponding model is open source (we should be able to reproduce scores if needed).\n\nAt this stage, we are only considering requests for which all model scores for all mutants in a given benchmark (substitution or indel) are provided by the requester; but we are planning on regularly scoring new baselines ourselves for methods with wide adoption by the community and/or suggestions with many upvotes.\n\n### Notes\n12 December 2023: The code for training and evaluating supervised models is currently shared in https://github.com/OATML-Markslab/ProteinNPT. We are in the process of integrating the code into this repo.\n\n## Usage and reproducibility\n\nIf you would like to compute all performance metrics for the various benchmarks, please follow the following steps:\n1. Download locally all relevant files as per instructions above (see Resources)\n2. Update the paths for all files downloaded in the prior step in the [config script](https://github.com/OATML-Markslab/ProteinGym/blob/main/scripts/zero_shot_config.sh)\n3. If adding a new model, adjust the [config.json](https://github.com/OATML-Markslab/ProteinGym/blob/main/config.json) file accordingly and add the model scores to the relevant path (e.g., [DMS_output_score_folder_subs](https://github.com/OATML-Markslab/ProteinGym/blob/main/scripts/zero_shot_config.sh#L19))\n4. If focusing on DMS benchmarks, run the [merge script](https://github.com/OATML-Markslab/ProteinGym/blob/main/scripts/scoring_DMS_zero_shot/merge_all_scores.sh). This will create a single file for each DMS assay, with scores for all model baselines\n5. Run the relevant performance script (eg., [scripts/scoring_DMS_zero_shot/performance_substitutions.sh](https://github.com/OATML-Markslab/ProteinGym/blob/main/scripts/scoring_DMS_zero_shot/performance_substitutions.sh))\n\n## Acknowledgements\n\nOur codebase leveraged code from the following repositories to compute baselines:\n\nModel | Repo\n--- | ---\nUniRep | https://github.com/churchlab/UniRep\nUniRep | https://github.com/chloechsu/combining-evolutionary-and-assay-labelled-data\nEVE | https://github.com/OATML-Markslab/EVE\nGEMME | https://hub.docker.com/r/elodielaine/gemme\nESM | https://github.com/facebookresearch/esm\nEVmutation | https://github.com/debbiemarkslab/EVcouplings\nProGen2 | https://github.com/salesforce/progen\nHMMER | https://github.com/EddyRivasLab/hmmer\nMSA Transformer | https://github.com/rmrao/msa-transformer\nProtGPT2 | https://huggingface.co/nferruz/ProtGPT2\nProteinMPNN | https://github.com/dauparas/ProteinMPNN\nRITA | https://github.com/lightonai/RITA\nTranception | https://github.com/OATML-Markslab/Tranception\nVESPA | https://github.com/Rostlab/VESPA\nCARP | https://github.com/microsoft/protein-sequence-models\nMIF | https://github.com/microsoft/protein-sequence-models\nFoldseek | https://github.com/steineggerlab/foldseek\nProtSSN | https://github.com/tyang816/ProtSSN\nSaProt | https://github.com/westlake-repl/SaProt\n\nWe would like to thank the GEMME team for providing model scores on an earlier version of the benchmark (ProteinGym v0.1), and the ProtSSN and SaProt teams for integrating their model in the ProteinGym repo.\n\nSpecial thanks the teams of experimentalists who developed and performed the assays that ProteinGym is built on. If you are using ProteinGym in your work, please consider citing the corresponding papers. To facilitate this, we have prepared a file (assays.bib) containing the bibtex entries for all these papers.\n\n## Releases\n\n1. [ProteinGym_v1.0](https://zenodo.org/records/13932633): Initial release.\n2. [ProteinGym_v1.1](https://zenodo.org/records/13936340): Updates to reference file, and addition of ProtSSN and SaProt baselines.\n\n## License\nThis project is available under the MIT license found in the LICENSE file in this GitHub repository.\n\n## Reference\nIf you use ProteinGym in your work, please cite the following paper:\n```bibtex\n@inproceedings{NEURIPS2023_cac723e5,\n author = {Notin, Pascal and Kollasch, Aaron and Ritter, Daniel and van Niekerk, Lood and Paul, Steffanie and Spinner, Han and Rollins, Nathan and Shaw, Ada and Orenbuch, Rose and Weitzman, Ruben and Frazer, Jonathan and Dias, Mafalda and Franceschi, Dinko and Gal, Yarin and Marks, Debora},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},\n pages = {64331--64379},\n publisher = {Curran Associates, Inc.},\n title = {ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design},\n url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/cac723e5ff29f65e3fcbb0739ae91bee-Paper-Datasets_and_Benchmarks.pdf},\n volume = {36},\n year = {2023}\n}\n```\n\n## Links\n- Website: https://www.proteingym.org/\n- NeurIPS proceedings: [link to abstract](https://papers.nips.cc/paper_files/paper/2023/hash/cac723e5ff29f65e3fcbb0739ae91bee-Abstract-Datasets_and_Benchmarks.html)\n- Preprint: [link to abstract](https://www.biorxiv.org/content/10.1101/2023.12.07.570727v1)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction",
"version": "1.1",
"project_urls": {
"Homepage": "https://github.com/OATML-Markslab/ProteinGym"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bd227d197666f313df0f7913c9c93a9ba36fbf7dd3fb55cbac066ac64bece28b",
"md5": "15e2e02ff8ef89cc1fbe93f5967dc84b",
"sha256": "430827aa896598dbde7e462b757f9cf3f8257a1ee07c4765bb21b0654e11b93e"
},
"downloads": -1,
"filename": "proteingym-1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "15e2e02ff8ef89cc1fbe93f5967dc84b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 20714,
"upload_time": "2024-10-15T21:51:07",
"upload_time_iso_8601": "2024-10-15T21:51:07.269052Z",
"url": "https://files.pythonhosted.org/packages/bd/22/7d197666f313df0f7913c9c93a9ba36fbf7dd3fb55cbac066ac64bece28b/proteingym-1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3e3a50f7db67969cccc84cbfee30cb37d8fef01215128ecc9607feb6a5e5fdb4",
"md5": "5fac121bb609891a9709583486e7d88d",
"sha256": "19529d0d8727c698d347b0a4c632592c6aa33f6cf5a4d4e5238a7645cd349c3e"
},
"downloads": -1,
"filename": "proteingym-1.1.tar.gz",
"has_sig": false,
"md5_digest": "5fac121bb609891a9709583486e7d88d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 26363,
"upload_time": "2024-10-15T21:51:08",
"upload_time_iso_8601": "2024-10-15T21:51:08.307231Z",
"url": "https://files.pythonhosted.org/packages/3e/3a/50f7db67969cccc84cbfee30cb37d8fef01215128ecc9607feb6a5e5fdb4/proteingym-1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-15 21:51:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "OATML-Markslab",
"github_project": "ProteinGym",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "proteingym"
}