A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

The development of scientific large language models (Sci-LLMs) is hindered by the multimodality, cross-scale nature, and domain specificity of scientific data, necessitating a systematic framework for their trustworthy evolution. Method: We propose a data-centric paradigm for Sci-LLM development: (i) constructing a hierarchical scientific knowledge model and a unified scientific data taxonomy; (ii) shifting evaluation from static benchmarking to process-oriented assessment; (iii) designing a technical pathway integrating multimodal processing, domain-adaptive representation learning, and cross-scale modeling; and (iv) introducing a semi-automatic annotation–expert validation closed-loop mechanism. Contribution/Results: We systematically catalog 270+ training datasets and 190+ benchmarks across disciplines, mapping multidomain application landscapes. We further propose, for the first time, an evolutionary roadmap toward closed-loop autonomous scientific discovery systems—establishing both theoretical foundations and practical guidelines for building reliable, continuously evolving AI research partners.

Technology Category

Application Category

📝 Abstract

Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.

Problem

Research questions and friction points this paper is trying to address.

Surveying scientific large language models' data foundations and agent applications

Addressing multimodal, cross-scale challenges in scientific corpora

Developing trustworthy AI systems for accelerating scientific discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-centric synthesis for model-data co-evolution

Unified taxonomy addressing multimodal scientific data challenges

Closed-loop autonomous agent systems for discovery

🔎 Similar Papers

No similar papers found.