The limits of bio-molecular modeling with large language models : a cross-scale evaluation

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation benchmarks for cross-scale modeling in biomolecular systems. The authors propose BioMol-LLM-Bench, the first comprehensive benchmark encompassing 26 tasks across four difficulty levels, integrated with external computational tools, to rigorously evaluate 13 prominent large language models. Their analysis reveals that chain-of-thought reasoning yields limited benefits, while hybrid Mamba-Attention architectures demonstrate superior performance in long-sequence modeling. Although supervised fine-tuning enhances domain-specific capabilities, it concurrently compromises generalization. The findings indicate that current models perform adequately on classification tasks but remain notably deficient in handling complex regression problems.

Technology Category

Application Category

📝 Abstract

The modeling of bio-molecular system across molecular scales remains a central challenge in scientific research. Large language models (LLMs) are increasingly applied to bio-molecular discovery, yet systematic evaluation across multi-scale biological problems and rigorous assessment of their tool-augmented capabilities remain limited. We reveal a systematic gap between LLM performance and mechanistic understanding through the proposed cross-scale bio-molecular benchmark: BioMol-LLM-Bench, a unified framework comprising 26 downstream tasks that covers 4 distinct difficulty levels, and computational tools are integrated for a more comprehensive evaluation. Evaluation on 13 representative models reveals 4 main findings: chain-of-thought data provides limited benefit and may even reduce performance on biological tasks; hybrid mamba-attention architectures are more effective for long bio-molecular sequences; supervised fine-tuning improves specialization at the cost of generalization; and current LLMs perform well on classification tasks but remain weak on challenging regression tasks. Together, these findings provide practical guidance for future LLM-based modeling of molecular systems.

Problem

Research questions and friction points this paper is trying to address.

bio-molecular modeling

large language models

cross-scale evaluation

molecular systems

LLM benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-scale evaluation

BioMol-LLM-Bench

tool-augmented LLMs