A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The development of scientific large language models (Sci-LLMs) is hindered by the multimodality, cross-scale nature, and domain specificity of scientific data, necessitating a systematic framework for their trustworthy evolution. Method: We propose a data-centric paradigm for Sci-LLM development: (i) constructing a hierarchical scientific knowledge model and a unified scientific data taxonomy; (ii) shifting evaluation from static benchmarking to process-oriented assessment; (iii) designing a technical pathway integrating multimodal processing, domain-adaptive representation learning, and cross-scale modeling; and (iv) introducing a semi-automatic annotation–expert validation closed-loop mechanism. Contribution/Results: We systematically catalog 270+ training datasets and 190+ benchmarks across disciplines, mapping multidomain application landscapes. We further propose, for the first time, an evolutionary roadmap toward closed-loop autonomous scientific discovery systems—establishing both theoretical foundations and practical guidelines for building reliable, continuously evolving AI research partners.

Technology Category

Application Category

📝 Abstract
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
Problem

Research questions and friction points this paper is trying to address.

Surveying scientific large language models' data foundations and agent applications
Addressing multimodal, cross-scale challenges in scientific corpora
Developing trustworthy AI systems for accelerating scientific discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-centric synthesis for model-data co-evolution
Unified taxonomy addressing multimodal scientific data challenges
Closed-loop autonomous agent systems for discovery
🔎 Similar Papers
No similar papers found.
M
Ming Hu
Shanghai Artificial Intelligence Laboratory
Chenglong Ma
Chenglong Ma
Fudan University; Shanghai Innovation Institute
multi-modal modelsgenerative modelsmedical image analysis
W
Wei Li
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
W
Wanghan Xu
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
J
Jiamin Wu
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong
Jucheng Hu
Jucheng Hu
University College London
Efficient Machine LearningAi4ScienceLLMMLLMVLM
Tianbin Li
Tianbin Li
Shanghai Artificial Intelligence Laboratory
Machine LearningComputer VisionGeneral Intelligence
G
Guohang Zhuang
Shanghai Artificial Intelligence Laboratory
J
Jiaqi Liu
Shanghai Artificial Intelligence Laboratory, UNC-Chapel Hill
Y
Yingzhou Lu
Stanford University
Y
Ying Chen
Shanghai Artificial Intelligence Laboratory
C
Chaoyang Zhang
Shanghai Artificial Intelligence Laboratory
C
Cheng Tan
Shanghai Artificial Intelligence Laboratory
J
Jie Ying
Shanghai Artificial Intelligence Laboratory
G
Guocheng Wu
Shanghai Artificial Intelligence Laboratory
S
Shujian Gao
Shanghai Artificial Intelligence Laboratory
P
Pengcheng Chen
Shanghai Artificial Intelligence Laboratory
J
Jiashi Lin
Shanghai Artificial Intelligence Laboratory
Haitao Wu
Haitao Wu
Shanghai Artificial Intelligence Laboratory
Lulu Chen
Lulu Chen
Virginia Tech
Machine LearningData MiningBioinformatics
Fengxiang Wang
Fengxiang Wang
National University of Defense Technology
Computer VisionRemote Sensing
Y
Yuanyuan Zhang
Purdue University
X
Xiangyu Zhao
Shanghai Artificial Intelligence Laboratory
F
Feilong Tang
Shanghai Artificial Intelligence Laboratory, Monash University
Encheng Su
Encheng Su
Technical University of Munich
medical imagellmdeep learning