Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.

Abstract

Protein language models (PLMs) capture features of protein three-dimensional structure from amino acid sequences alone, without requiring multiple sequence alignments (MSA). The concepts of grammar and semantics from natural language have the potential to capture functional properties of proteins. Here, we investigate how these representations enable assessment of variation due to mutation. Applied to SARS-CoV-2’s spike protein using in silico deep mutational scanning (DMS) we demonstrate the PLM, ESM-2, has learned the sequence context within which variation occurs, capturing evolutionary constraint. This recapitulates what conventionally requires MSA data to predict. Unlike other state-of-the-art methods which require protein structures or multiple sequences for training, we show what can be accomplished using an unmodified pretrained PLM. We demonstrate that the grammaticality and semantic scores represent novel metrics. Applied to SARS-CoV-2 variants across the pandemic we show that ESM-2 representations encode the evolutionary history between variants, as well as the distinct nature of variants of concern upon their emergence, associated with shifts in receptor binding and antigenicity. PLM likelihoods can also identify epistatic interactions among sites in the protein. Our results here affirm that PLMs are broadly useful for variant-effect prediction, including unobserved changes, and can be applied to understand novel viral pathogens with the potential to be applied to any protein sequence, pathogen or otherwise.

People
Lamb, Kieran D.
Author

Lytras, Spyros and Lamb, Kieran D. and Ito, Jumpei and Grove, Joe and Yuan, Ke and Sato, Kei and Hughes, Joseph and Robertson, David L. (2025) Pathogen genomic surveillance and the AI revolution. Journal of Virology, 99 (2): e0160124. ISSN 0022-538X

Liu, Dan and Young, Francesca and Lamb, Kieran D. and Claudio Quiros, Adalberto and Pancheva, Alexandrina and Miller, Crispin and Macdonald, Craig and Robertson, David L. and Yuan, Ke (2024) PLM-interact: extending protein language models to predict protein-protein interactions. bioRxiv.

Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.

See full publications list
Hughes, Joseph
Author

Lytras, Spyros and Lamb, Kieran D. and Ito, Jumpei and Grove, Joe and Yuan, Ke and Sato, Kei and Hughes, Joseph and Robertson, David L. (2025) Pathogen genomic surveillance and the AI revolution. Journal of Virology, 99 (2): e0160124. ISSN 0022-538X

Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.

See full publications list
Lytras, Spyros
Author

Lytras, Spyros and Lamb, Kieran D. and Ito, Jumpei and Grove, Joe and Yuan, Ke and Sato, Kei and Hughes, Joseph and Robertson, David L. (2025) Pathogen genomic surveillance and the AI revolution. Journal of Virology, 99 (2): e0160124. ISSN 0022-538X

Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.

See full publications list
Young, Francesca
Author

Liu, Dan and Young, Francesca and Lamb, Kieran D. and Claudio Quiros, Adalberto and Pancheva, Alexandrina and Miller, Crispin and Macdonald, Craig and Robertson, David L. and Yuan, Ke (2024) PLM-interact: extending protein language models to predict protein-protein interactions. bioRxiv.

Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.

See full publications list
Koci, Orges
Author

Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.

Koci, Orges and Russell, Richard K. and Shaikh, M. Guftar and Edwards, Christine and Gerasimidis, Konstantinos and Ijaz, Umer Zeeshan (2024) CViewer: a Java-based statistical framework for integration of shotgun metagenomics with other omics datasets. Microbiome, 12 (1): 117. ISSN 2049-2618

See full publications list
Herzig, James
Author

Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.

See full publications list
Lovell, Simon C,
Author

Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.

See full publications list
Grove, Joe
Author

Lytras, Spyros and Lamb, Kieran D. and Ito, Jumpei and Grove, Joe and Yuan, Ke and Sato, Kei and Hughes, Joseph and Robertson, David L. (2025) Pathogen genomic surveillance and the AI revolution. Journal of Virology, 99 (2): e0160124. ISSN 0022-538X

Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.

See full publications list
Yuan, Ke
Author

Farndale, Lucas and Insall, Robert and Yuan, Ke (2025) TriDeNT: Triple deep network training for privileged knowledge distillation in histopathology. Medical Image Analysis, 102: 103479. ISSN 1361-8415

Ji, Yanni and Cutiongco, Marie F.A. and Jensen, Bjørn Sand and Yuan, Ke (2025) Generating realistic single-cell images from CellProfiler representations. Medical Image Analysis. ISSN 1361-8415 (In Press)

Coudray, Nicolas and Juarez, Michelle C. and Criscito, Maressa C. and Claudio Quiros, Adalberto and Wilken, Reason and Jackson Cullison, Stephanie R. and Stevenson, Mary L. and Doudican, Nicole A. and Yuan, Ke and Aquino, Jamie D. and Klufas, Daniel M. and North, Jeffrey P. and Yu, Siegrid S. and Murad, Fadi and Ruiz, Emily and Schmults, Chrysalyne D. and Cardona Machado, Cristian D. and Cañueto, Javier and Choudhary, Anirudh and Hughes, Alysia N. and Stockard, Alyssa and Leibovit-Reiben, Zachary and Mangold, Aaron R. and Tsirigos, Aristotelis and Carucci, John A. (2025) Self supervised artificial intelligence predicts poor outcome from primary cutaneous squamous cell carcinoma at diagnosis. npj Digital Medicine, 8 (1): 105. ISSN 2398-6352

See full publications list
Robertson, David L.
Author

Lytras, Spyros and Lamb, Kieran D. and Ito, Jumpei and Grove, Joe and Yuan, Ke and Sato, Kei and Hughes, Joseph and Robertson, David L. (2025) Pathogen genomic surveillance and the AI revolution. Journal of Virology, 99 (2): e0160124. ISSN 0022-538X

Liu, Dan and Young, Francesca and Lamb, Kieran D. and Claudio Quiros, Adalberto and Pancheva, Alexandrina and Miller, Crispin and Macdonald, Craig and Robertson, David L. and Yuan, Ke (2024) PLM-interact: extending protein language models to predict protein-protein interactions. bioRxiv.

Lamb, Kieran D. and Hughes, Joseph and Lytras, Spyros and Young, Francesca and Koci, Orges and Herzig, James and Lovell, Simon C, and Grove, Joe and Yuan, Ke and Robertson, David L. (2024) From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences. bioRxiv.

See full publications list
Texts
154:158
lightbox image
349407.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (7MB) | Preview
Information
Library

View Item