🤖 AI Summary
This study addresses the lack of systematic evaluation benchmarks for cross-scale modeling in biomolecular systems. The authors propose BioMol-LLM-Bench, the first comprehensive benchmark encompassing 26 tasks across four difficulty levels, integrated with external computational tools, to rigorously evaluate 13 prominent large language models. Their analysis reveals that chain-of-thought reasoning yields limited benefits, while hybrid Mamba-Attention architectures demonstrate superior performance in long-sequence modeling. Although supervised fine-tuning enhances domain-specific capabilities, it concurrently compromises generalization. The findings indicate that current models perform adequately on classification tasks but remain notably deficient in handling complex regression problems.
📝 Abstract
The modeling of bio-molecular system across molecular scales remains a central challenge in scientific research. Large language models (LLMs) are increasingly applied to bio-molecular discovery, yet systematic evaluation across multi-scale biological problems and rigorous assessment of their tool-augmented capabilities remain limited. We reveal a systematic gap between LLM performance and mechanistic understanding through the proposed cross-scale bio-molecular benchmark: BioMol-LLM-Bench, a unified framework comprising 26 downstream tasks that covers 4 distinct difficulty levels, and computational tools are integrated for a more comprehensive evaluation. Evaluation on 13 representative models reveals 4 main findings: chain-of-thought data provides limited benefit and may even reduce performance on biological tasks; hybrid mamba-attention architectures are more effective for long bio-molecular sequences; supervised fine-tuning improves specialization at the cost of generalization; and current LLMs perform well on classification tasks but remain weak on challenging regression tasks. Together, these findings provide practical guidance for future LLM-based modeling of molecular systems.