虎嗅

Proteins Are Also "Emerging"? Biohub Chief Scientist: The Next AlphaFold Is Here—Using 6.8 Billion Evolutionary Sequences to Train the Strongest Biological Language Model in Protein Science History

原文:蛋白质也有"涌现"?Biohub首席科学家:下一个AlphaFold在这里,用68亿条进化序列,训练出蛋白质科学史上最强生物语言模型

Summary of Key Points

This news article focuses on the "ChatGPT moment" in protein science: ESM Cambrian (ESMC), developed by Alex Rives' team, the most powerful open-source protein foundation model to date, has been officially made available for everyone to use. By incorporating metagenomic data, ESMC overcomes the issue of diminishing returns in model training and validates the effectiveness of the "scaling law" in the field of proteins—meaning that larger models with more data lead to significantly improved capabilities. ESMC has made breakthroughs in antibody design, protein structure prediction, and the discovery of new gene editing systems. It is also connected to Biohub's $500 million initiative for "virtual cells," which aims to build models that can predict cell behavior using a combination of AI and experimental data, ultimately contributing to disease treatment.

1. The Principle of "Larger Is Better" in Protein Science: The Application of the Scaling Law

You can think of the "scaling law" as follows: the more model parameters and training data there are, the more significant a qualitative leap in the model's capabilities can be (similar to how ChatGPT improved from GPT-3 to GPT-4). Alex believed in this principle for protein science as early as 2018. Proteins are chains composed of amino acids, and by predicting the next amino acid in the chain, the model can learn about the protein's structure and function. Unlike natural language models, sequences generated randomly from a protein model may still be valid proteins (they won't produce gibberish) because the rules for combining amino acids are fixed. The key is that the context of amino acids—what surrounds them—determines their structure and function, and the model can "understand" this by analyzing these contexts.

2. Metagenomic Data: A Counterintuitive Method to Break Through Bottlenecks

The previous version of ESM (ESM2) encountered a problem of diminishing returns; as the model grew larger and more computing power was used, its performance improved slowly. The solution lies in using metagenomic data, which goes against traditional biological research methods:

  • Traditional Research: Focused on specific issues (e.g., studying the function of a particular gene) with controlled experiments and repeated verifications.
  • Metagenomic Data: Samples from various sources (hydrothermal vents, Antarctic ice, deep seas, human intestines) are combined and sequenced directly. Any protein sequence is considered useful. This approach generates large, diverse datasets, but they are also messy (unknown origins and possibly incomplete). However, incorporating metagenomic data has made the scaling curve of ESMC more consistent, indicating that the issue was insufficient data, not inadequate computing power.

3. ESMC's Strengths: Outperforming AlphaFold in Antibody Design and Discovering New Gene Editing Systems

ESMC's achievements are numerous, with particularly notable advances in antibody design and structure/function discovery:

  • Antibody Design: Antibodies are crucial for disease treatment (about a quarter of new drugs are antibodies), but designing full-length antibodies has been challenging. ESMC avoids the need for multiple sequence comparisons by directly searching for protein features learned from the model, resulting in high success rates when identifying effective antibodies (e.g., scFv single-chain antibodies). This is because the goal of antibody evolution is diversity to combat various viruses, and traditional methods that rely on similar sequences are less effective. ESMC captures the essential characteristics of antibodies.
  • Structure and Function Discovery: ESMC has created a map of 6.8 billion protein sequences and predicted the structures of 1.1 billion proteins. It has also spontaneously learned known functional motifs (e.g., "nucleophilic elbows") and discovered functionally related proteins with distant evolutionary relationships (e.g., new gene editing systems). These findings are all the result of the model's own learning process, without any manual input.

4. From Proteins to Virtual Cells: What Biohub's $500 Million Initiative Aims To Achieve

Alex's team's vision extends beyond proteins to creating "virtual cells" using AI models to simulate cell behavior and predict the effects of new interventions (e.g., drugs):

  • Current Status: Existing virtual cell models can only fit existing data; they cannot predict new scenarios (e.g., how a new drug will affect cells).
  • Goal: The goal is for the model to be able to predict outcomes of untested experiments, just as it can predict protein structures. For example, by inputting a new drug, the model could predict cell responses.
  • $500 Million Initiative: $400 million will be used for internal data generation and technology development, and $100 million for external collaborations. The strategy is to "scale up biological experiments" by observing cells under various conditions (e.g., with different drugs, changing environments) to collect sufficient data to understand cellular behavior.

5. Future Challenges and the Need for Collaborative Use of ESMC

Despite ESMC's strength, there are still challenges:

  • Computing Power: Alex highlights that computing power is a significant, often overlooked bottleneck. Doubling the computing power would significantly enhance ESMC's performance, and the data volume would need to increase accordingly.
  • Data Potential: There are approximately 100 billion protein sequences available, and this is just the beginning; the issue of diminishing returns has not yet been reached.
  • Community Call: ESMC is open-source under an MIT license, encouraging researchers worldwide to use it for their research. The team's goal is not to develop drugs but to create tools that can advance science and ultimately lead to cures.

In summary, this news article shows how AI is revolutionizing protein science, from predicting protein structures to designing drugs and simulating cellular behavior, potentially leading to more groundbreaking medical advancements. The open-source release of ESMC allows more people to participate in this revolutionary process.