LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study systematically evaluates whether large language models (LLMs) can substantially enhance the performance of biological novices in high-stakes dual-use computational biology tasks. Through multi-model, multi-benchmark human experiments, the authors compare novices’ accuracy on eight biosafety-related tasks with and without LLM assistance, benchmarking against both internet-assisted experts and standalone model performance. Results demonstrate that LLM support increases novices’ accuracy by a factor of 4.16 relative to unassisted controls and even surpasses internet-assisted experts on three tasks. Notably, 89.6% of participants readily accessed sensitive dual-use information, highlighting both the transformative potential and significant safety risks of human-LLM collaboration. This work provides the first quantitative assessment of LLMs’ real-world augmentation effect for non-experts in high-risk biological contexts.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.

Problem

Research questions and friction points this paper is trying to address.

large language models

novice uplift

dual-use

in silico biology

biosecurity

Innovation

Methods, ideas, or system contributions that make the work stand out.

human uplift

dual-use risk

large language models