🤖 AI Summary
Assessing large language models’ (LLMs) capability to debug high-consequence virology experimental protocols—and their potential role in dual-use technology governance—remains unexplored. Method: We introduce VCT, the first multimodal question-answering benchmark for virology, comprising 322 expert-crafted questions spanning foundational, tacit, and visual knowledge domains; we further propose a multimodal LLM evaluation framework integrating text, flowchart, and experimental image understanding. Contribution/Results: This work presents the first systematic quantification of LLM performance on dual-use virological procedural reasoning: the o3 model achieves 43.8% accuracy—significantly exceeding both the human expert average (22.1%) and 94% of individual experts. These findings provide empirical evidence and a methodological foundation for integrating LLMs into life science dual-use governance frameworks.
📝 Abstract
We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of $322$ multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of $22.1%$ on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI's o3, reaches $43.8%$ accuracy, outperforming $94%$ of expert virologists even within their sub-areas of specialization. The ability to provide expert-level virology troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused. Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences.