Summary of Key Points
The development of AI in the field of biology has lagged far behind that in programming. The fundamental bottleneck is not the lack of reasoning capability in large models, but rather the outdated infrastructure for biological data. This infrastructure was designed for human manual operation—similar to an old city from the horse-drawn carriage era, which is unsuitable for AI agents (modern cars). Research by Anthropic suggests that the solution lies in creating stable and user-friendly tools for these agents. For example, their collaboration with NCBI resulted in the development of gget virus, a tool that significantly improves the accuracy and reliability of agents when retrieving biological data.
The Biggest Bottleneck for Biological AI Agents: Outdated Data Infrastructure
Imagine the biological data infrastructure as an old city without proper planning for cars—narrow streets and many turns, making it impossible for modern cars (agents) to navigate smoothly. There are three main issues:
1. Disorganized Formats and Diverse Databases: Biological data comes in various strange file formats and is stored in different databases with no unified entry point.
2. Customized Tools: The tools used to manipulate this data are tailored for specific scenarios, making them unsuitable for general use by agents.
3. Lack of Clear Feedback: In software development, changing a line of code can quickly be tested to determine its correctness, but in biology, there is no clear “reward signal” when an agent performs correctly (for example, it’s difficult to immediately verify the accuracy of data retrieval).
In contrast, the infrastructure in software development is designed for “cars,” with standardized APIs (like clear lanes) and version control (like traffic rules), allowing agents to function smoothly.
Karpathy’s Criticism: Biology and Web Development Face the Same Problem
A few months ago, AI expert Karpathy mentioned that writing web applications was relatively easy, but tasks like authentication and payment required multiple clicks within the browser, taking him a week. He complained, “The code is the easiest part; the troublesome part is the clicking.”
This problem is exactly the same in biology: tools for handling biological data are designed for human manual operation. For instance, virologists have to manually select dozens of criteria when searching for virus sequences from databases, which agents cannot automate.
The “Click Tax” in Virology: A Problem Both Humans and Agents Find Troublesome
Take the Ebola outbreak as an example. When Ebola emerged in Congo, scientists needed to quickly compare new strains with historical data to determine the effectiveness of existing diagnostics and treatments. This process involved manually selecting dozens of criteria in the NCBI Virus database (such as the host being human, the sampling location being in Africa, and the sequence length being sufficient). It was not only tedious but also prone to errors.
This need for manual clicking is like paying a “click tax” for scientific research—both humans and agents find it frustrating. Agents struggle with web pages containing dropdown menus and checkboxes and cannot remember all the selection criteria.
How Unreliable Are Agent-Based Data Retrievals?
The Anthropic team conducted a test (VirBench) where multiple AI models (such as GPT-5.5 and Claude Opus) were tasked with retrieving 120 virus sequences. The results showed:
1. Low Accuracy: Even the best model had an accuracy of only 91.3%, while the worst was 16.9%.
2. Inconsistent Results: Running the same model three times yielded vastly different results. For example, when searching for Ebola sequences, the standard answer was 266 entries, but Claude Sonnet returned 106, 15, and 5 entries respectively on three attempts.
3. Misleading Conclusions: Using incorrect data can lead to absurd conclusions—for instance, misdating the common ancestor of a virus to 1922 or incorrectly determining the effectiveness of antibody treatments.
The root of these problems is that agents lack a reliable path to access data and can only operate by guessing, resulting in seemingly reasonable but actually erroneous outcomes.
The Solution: Providing a Stable Interface for Data
Anthropic and NCBI collaborated to develop gget virus, a tool that provides a stable interface that agents can directly use. It can:
- Coordinate APIs from multiple databases and automatically handle selection criteria.
- Produce standardized results with detailed logs for easy verification.
- Solve issues related to batch retrieval and paginated data display.
The effects were immediate: the accuracy of all agents increased to over 90% (with GPT-5.5 reaching 99.7%), and the results became more consistent when run repeatedly.
The authors emphasize that scientific agents need a “boring but reliable” foundation. While models can be creative (e.g., generating hypotheses), the underlying data access and retrieval logic must be stable. Even as models become more advanced in the future, such reliable infrastructure will remain essential; otherwise, agents would have to constantly navigate an unreliable “labyrinth,” which is both time-consuming and costly.
Conclusion
For AI to help humans solve biological problems, it’s not enough to rely solely on large models. The biological data infrastructure must first be transformed to fit the needs of agents. gget virus is just the first step; more such tools are needed to make AI a reliable assistant for scientists.