Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
ProteinGym is an extensive set of Deep Mutational Scanning (DMS) assays and annotated human clinical variants curated to enable thorough comparisons of various mutation effect predictors in different regimes. Both the DMS assays and clinical variants are divided into 1) a substitution benchmark which currently consists of the experimental characterisation of ~2.7M missense variants across 217 DMS assays and 2,525 clinical proteins, and 2) an indel benchmark that includes ∼300k mutants across 74 DMS assays and 1,555 clinical proteins.
Each processed file in each benchmark corresponds to a single DMS assay or clinical protein, and contains the following variables:
Additionally, we provide two reference files for each benchmark that give further details on each assay and contain in particular:
To download the benchmarks, please see DMS benchmark: Substitutions
and DMS benchmark: Indels
in the "Resources" section below.
The benchmarks folder provides detailed performance files for all baselines on the DMS and clinical benchmarks.
Metrics for DMS assays (both supervised and zero-shot): Spearman, NDCG, AUC, MCC and Top-K recall Metrics for clinical benchmark: AUC
Metrics are aggregated as follows:
These files are named e.g. DMS_substitutions_Spearman_DMS_level.csv
, DMS_substitutions_Spearman_Uniprot_level
and DMS_substitutions_Spearman_Uniprot_Selection_Type_level
respectively for these different steps.
For other deep dives (performance split by taxa, MSA depth, mutational depth and more), these are all contained in the benchmarks/DMS_zero_shot/substitutions/Spearman/Summary_performance_DMS_substitutions_Spearman.csv
folder (resp. DMS_indels/clinical_substitutions/clinical_indels & their supervised counterparts). These files are also what are hosted on the website.
We also include, as on the website, a bootstrapped standard error of these aggregated metrics to reflect the variance in the final numbers with respect to the individual assays.
To calculate the DMS substitution benchmark metrics:
./scripts/scoring_DMS_zero_shot/performance_substitutions.sh
And for indels, follow step #1 and run ./scripts/scoring_DMS_zero_shot/performance_substitutions_indels.sh
.
The full ProteinGym benchmarks performance files are also accessible via our dedicated website: https://www.proteingym.org/. It includes leaderboards for the substitution and indel benchmarks, as well as detailed DMS-level performance files for all baselines. The current version of the substitution benchmark includes the following baselines:
ESM-1v | Protein language model | Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., & Rives, A. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. NeurIPS. VESPA | Protein language model | Marquet, C., Heinzinger, M., Olenyi, T., Dallago, C., Bernhofer, M., Erckert, K., & Rost, B. (2021). Embeddings from protein language models predict conservation and variant effects. Human Genetics, 141, 1629 - 1647. RITA | Protein language model | Hesslow, D., Zanichelli, N., Notin, P., Poli, I., & Marks, D.S. (2022). RITA: a Study on Scaling Up Generative Protein Sequence Models. ArXiv, abs/2205.05789. ProtGPT2 | Protein language model | Ferruz, N., Schmidt, S., & Höcker, B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nature Communications, 13. ProGen2 | Protein language model | Nijkamp, E., Ruffolo, J.A., Weinstein, E.N., Naik, N., & Madani, A. (2022). ProGen2: Exploring the Boundaries of Protein Language Models. ArXiv, abs/2206.13517. MSA Transformer | Hybrid |Rao, R., Liu, J., Verkuil, R., Meier, J., Canny, J.F., Abbeel, P., Sercu, T., & Rives, A. (2021). MSA Transformer. ICML. Tranception | Hybrid | Notin, P., Dias, M., Frazer, J., Marchena-Hurtado, J., Gomez, A.N., Marks, D.S., & Gal, Y. (2022). Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. ICML. TranceptEVE | Hybrid | Notin, P., Van Niekerk, L., Kollasch, A., Ritter, D., Gal, Y. & Marks, D.S. & (2022). TranceptEVE: Combining Family-specific and Family-agnostic Models of Protein Sequences for Improved Fitness Prediction. NeurIPS, LMRL workshop. CARP | Protein language model | Yang, K.K., Fusi, N., Lu, A.X. (2022). Convolutions are competitive with transformers for protein sequence pretraining. MIF | Inverse folding | Yang, K.K., Yeh, H., Zanichelli, N. (2022). Masked Inverse Folding with Sequence Transfer for Protein Representation Learning.
Except for the WaveNet model (which only uses alignments to recover a set of homologous protein sequences to train on, but then trains on non-aligned sequences), all alignment-based methods are unable to score indels given the fixed coordinate system they are trained on. Similarly, the masked-marginals procedure to generate the masked-marginals for ESM-1v and MSA Transformer requires the position to exist in the wild-type sequence. All the other model architectures listed above (eg., Tranception, RITA, ProGen2) are included in the indel benchmark.
For clinical baselines, we used dbNSFP 4.4a as detailed in the manuscript appendix (and in proteingym/clinical_benchmark_notebooks/clinical_subs_processing.ipynb
).
To download and unzip the data, run the following commands for each of the data sources you would like to download, e.g. for all the baseline scores on the DMS substitution scores:
curl -o scores_all_models_proteingym_substitutions.zip https://marks.hms.harvard.edu/proteingym/scores_all_models_proteingym_substitutions.zip
unzip scores_all_models_proteingym_substitutions.zip
rm scores_all_models_proteingym_substitutions.zip
Then we also host the raw DMS assays (before preprocessing)
Data | Size (unzipped) | Link |
---|---|---|
DMS benchmark: Substitutions (raw) | 500MB | https://marks.hms.harvard.edu/proteingym/substitutions_raw_DMS.zip |
DMS benchmark: Indels (raw) | 450MB | https://marks.hms.harvard.edu/proteingym/indels_raw_DMS.zip |
Clinical benchmark: Substitutions (raw) | 58MB | https://marks.hms.harvard.edu/proteingym/substitutions_raw_clinical.zip |
Clinical benchmark: Indels (raw) | 12.4MB | https://marks.hms.harvard.edu/proteingym/indels_raw_clinical.zip |
If you would like to suggest new assays to be part of ProteinGym, please raise an issue on this repository with a `new_assay' label. The criteria we typically consider for inclusion are as follows:
If you would like new baselines to be included in ProteinGym (ie., website, performance files, detailed scoring files), please follow the following steps:
At this point we are only considering new baselines satisfying the following conditions:
At this stage, we are only considering requests for which all model scores for all mutants in a given benchmark (substitution or indel) are provided by the requester; but we are planning on regularly scoring new baselines ourselves for methods with wide adoption by the community and/or suggestions with many upvotes.
12 December 2023: The code for training and evaluating supervised models is currently shared in https://github.com/OATML-Markslab/ProteinNPT. We are in the process of moving the code to this repo.
If you would like to compute all performance metrics for the various benchmarks, please follow the following steps:
Our codebase leveraged code from the following repositories to compute baselines:
We would like to also thank the teams of experimentalists who developed and performed the assays that ProteinGym is built on. If you are using ProteinGym in your work, please consider citing the corresponding papers. To facilitate this, we have prepared a file (assays.bib) containing the bibtex entries for all these papers.
This project is available under the MIT license found in the LICENSE file in this GitHub repository.
If you use ProteinGym in your work, please cite the following paper: ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction
Website: https://www.proteingym.org/
FAQs
ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction
We found that proteingym demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.