Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.
AI Summary:
The study demonstrates that protein language models (PLMs) can capture features of protein three-dimensional structure from amino acid sequences alone, without requiring multiple sequence alignments. The PLM, ESM-2, has learned the sequence context within which variation occurs, capturing evolutionary constraint.AI Topics:
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Download (7MB)
Protein language models (PLMs) capture features of protein three-dimensional structure from amino acid sequences alone, without requiring multiple sequence alignments (MSA). The concepts of grammar and semantics from natural language have the potential to capture functional properties of proteins. Here, we investigate how these representations enable assessment of variation due to mutation. Applied to SARS-CoV-2’s spike protein using in silico deep mutational scanning (DMS) we demonstrate the PLM, ESM-2, has learned the sequence context within which variation occurs, capturing evolutionary constraint. This recapitulates what conventionally requires MSA data to predict. Unlike other state-of-the-art methods which require protein structures or multiple sequences for training, we show what can be accomplished using an unmodified pretrained PLM. We demonstrate that the grammaticality and semantic scores represent novel metrics. Applied to SARS-CoV-2 variants across the pandemic we show that ESM-2 representations encode the evolutionary history between variants, as well as the distinct nature of variants of concern upon their emergence, associated with shifts in receptor binding and antigenicity. PLM likelihoods can also identify epistatic interactions among sites in the protein. Our results here affirm that PLMs are broadly useful for variant-effect prediction, including unobserved changes, and can be applied to understand novel viral pathogens with the potential to be applied to any protein sequence, pathogen or otherwise.
Title | From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences |
---|---|
Creators | Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. |
Identification Number | 10.1101/2024.07.05.602129 |
Date | 18 September 2024 |
Divisions | College of Medical Veterinary and Life Sciences > School of Cancer Sciences College of Medical Veterinary and Life Sciences > School of Infection & Immunity |
Additional Information | The authors acknowledge funding from the UK Medical Research Council (MRC, MC_UU_12014/12, MC_UU_00034/5, MR/V01157X/1 and a Doctoral Training Programme in Precision Medicine studentship for KDL, MR/N013166/1), the Wellcome Trust (220977/Z/20/Z), the UK Research and Innovation (UKRI) to the G2P-UK consortium (MR/W005611/1) and G2P2 consortium (MR/Y004205), and the COVID-19 Genomics UK Consortium (COG-UK), which was supported by funding from the MRC, part of UKRI, the UK National Institute of Health and Care Research (MC_PC_19027) and Genome Research Limited, operating as the Wellcome Sanger Institute. |
URI | https://pub.demo35.eprints-hosting.org/id/eprint/154 |
---|
Item Type | Article |
---|---|
Depositing User | Unnamed user with email ejo1f20@soton.ac.uk |
Date Deposited | 11 Jun 2025 16:35 |
Revision | 12 |
Last Modified | 12 Jun 2025 12:00 |
![]() |