🤖 AI Summary
This study systematically evaluates whether large language models (LLMs) can substantially enhance the performance of biological novices in high-stakes dual-use computational biology tasks. Through multi-model, multi-benchmark human experiments, the authors compare novices’ accuracy on eight biosafety-related tasks with and without LLM assistance, benchmarking against both internet-assisted experts and standalone model performance. Results demonstrate that LLM support increases novices’ accuracy by a factor of 4.16 relative to unassisted controls and even surpasses internet-assisted experts on three tasks. Notably, 89.6% of participants readily accessed sensitive dual-use information, highlighting both the transformative potential and significant safety risks of human-LLM collaboration. This work provides the first quantitative assessment of LLMs’ real-world augmentation effect for non-experts in high-risk biological contexts.
📝 Abstract
Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.