Learning the protein language: Evolution, structure, and function

Tristan Bepler; Bonnie Berger

doi:10.1016/j.cels.2021.05.017

Learning the protein language: Evolution, structure, and function

Cell Syst. 2021 Jun 16;12(6):654-669.e3. doi: 10.1016/j.cels.2021.05.017.

Authors

Tristan Bepler¹, Bonnie Berger²

Affiliations

¹ Simons Machine Learning Center, New York Structural Biology Center, New York, NY, USA; Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA; Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA, USA. Electronic address: tbepler@nysbc.org.
² Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA. Electronic address: bab@mit.edu.

Abstract

Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community.

Keywords: contact prediction; deep neural networks; inductive bias; language models; natural language processing; protein sequences; proteins; transfer learning; transmembrane region prediction.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Amino Acid Sequence
Databases, Protein
Language*
Machine Learning
Proteins* / chemistry

Substances

Proteins

Abstract

Publication types

MeSH terms

Substances

Grants and funding