🤖 AI Summary
The development of scientific large language models (Sci-LLMs) is hindered by the multimodality, cross-scale nature, and domain specificity of scientific data, necessitating a systematic framework for their trustworthy evolution.
Method: We propose a data-centric paradigm for Sci-LLM development: (i) constructing a hierarchical scientific knowledge model and a unified scientific data taxonomy; (ii) shifting evaluation from static benchmarking to process-oriented assessment; (iii) designing a technical pathway integrating multimodal processing, domain-adaptive representation learning, and cross-scale modeling; and (iv) introducing a semi-automatic annotation–expert validation closed-loop mechanism.
Contribution/Results: We systematically catalog 270+ training datasets and 190+ benchmarks across disciplines, mapping multidomain application landscapes. We further propose, for the first time, an evolutionary roadmap toward closed-loop autonomous scientific discovery systems—establishing both theoretical foundations and practical guidelines for building reliable, continuously evolving AI research partners.
📝 Abstract
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.