🤖 AI Summary
This study addresses the absence of standardized benchmarks for evaluating the biological understanding and biosafety risks of nucleotide foundation models in viral genomics. The authors propose ViroBench, the first large-scale benchmark specifically designed for viral genomes, establishing a comprehensive evaluation framework that integrates biological functionality and safety across four task categories and eighteen scenarios, systematically assessing 66 models. Through extensive experiments—including multi-architecture comparisons, ablation studies, cross-clade and temporal generalization tests, and functional validation of generated sequences—the work reveals that current models exhibit poor out-of-distribution generalization and a disconnect between statistical likelihood and biological function. Notably, taxonomic diversity in pretraining data proves more critical than model scale, with lightweight models trained on diverse data achieving a 67.5% performance gain. All data and code are publicly released to support reproducible research.
📝 Abstract
Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we introduce ViroBench, the first comprehensive and large-scale benchmark specifically designed for NFMs in viral settings. ViroBench evaluates models across two critical dimensions: biological understanding and latent biosecurity risk, covering 18 diverse scenarios within 4 task types. Extensive evaluation of 66 NFMs across diverse architectures yields three critical conclusions. Firstly, NFMs exhibit a performance degradation in biological understanding under phylogenetic and temporal shifts, indicating weak extrapolation capabilities. Secondly, generation tasks reveal a decoupling between statistical likelihood and biological functional validity, posing latent biosecurity risks. Thirdly, controlled ablation studies reveal that taxonomic diversity in pretraining data outweighs parameter scale. Specifically, a lightweight baseline trained on diverse data achieves a 67.5% performance gain over its original model. Overall, ViroBench provides interpretable, diagnostic evaluations and a reproducible measurement framework for future research on viral nucleotide foundation models. The datasets and code are publicly available at https://github.com/QIANJINYDX/ViroBench.