Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A rigorous, operational definition and systematic evaluation framework for Scientific General Intelligence (SGI) remain lacking. Method: We propose an operational definition of SGI and introduce SGI-Bench—the first multimodal benchmark covering the full scientific discovery pipeline—featuring four scientist-aligned tasks grounded in the Practical Inquiry Model (PIM): deep inquiry, creative hypothesis generation, dry/wet experimental execution, and cross-disciplinary reasoning. Our methodology integrates retrieval-augmented generation (RAG), multi-stage task modeling, expert annotation, execution-based validation, and a novel test-time reinforcement learning (TTRL) paradigm to enhance hypothesis novelty in an unsupervised manner. Results: Experiments reveal severe bottlenecks in current LLMs—particularly in deep inquiry (10–20% accuracy) and wet-lab experimental fidelity. TTRL boosts hypothesis novelty by 37% without requiring ground-truth references. This work establishes foundational benchmarks, definitions, and methodologies for evaluating and advancing SGI capabilities.

Technology Category

Application Category

📝 Abstract
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
Problem

Research questions and friction points this paper is trying to address.

Defining and evaluating Scientific General Intelligence in AI
Assessing LLMs' performance in scientist-aligned workflows
Addressing gaps in deep research, idea generation, and experimental reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Practical Inquiry Model operationalizes scientific workflows
SGI-Bench enables cross-disciplinary evaluation of LLMs
Test-Time Reinforcement Learning enhances hypothesis novelty
🔎 Similar Papers
No similar papers found.
W
Wanghan Xu
Shanghai Artificial Intelligence Laboratory
Y
Yuhao Zhou
Shanghai Artificial Intelligence Laboratory
Y
Yifan Zhou
Shanghai Artificial Intelligence Laboratory
Q
Qinglong Cao
Shanghai Artificial Intelligence Laboratory
S
Shuo Li
Shanghai Artificial Intelligence Laboratory
J
Jia Bu
Shanghai Artificial Intelligence Laboratory
B
Bo Liu
Shanghai Artificial Intelligence Laboratory
Y
Yixin Chen
Shanghai Artificial Intelligence Laboratory
X
Xuming He
Shanghai Artificial Intelligence Laboratory
X
Xiangyu Zhao
Shanghai Artificial Intelligence Laboratory
Xiang Zhuang
Xiang Zhuang
Ph.D. student, Zhejiang University
Fengxiang Wang
Fengxiang Wang
National University of Defense Technology
Computer VisionRemote Sensing
Z
Zhiwang Zhou
Shanghai Artificial Intelligence Laboratory
Q
Qiantai Feng
Shanghai Artificial Intelligence Laboratory
Wenxuan Huang
Wenxuan Huang
CUHK & ECNU
Artificial General IntelligenceMLLMLLMAIGCModel Acceleration
Jiaqi Wei
Jiaqi Wei
PhD student, Zhejiang University
NLPLLMAI for Science
H
Hao Wu
Shanghai Artificial Intelligence Laboratory
Y
Yuejin Yang
Shanghai Artificial Intelligence Laboratory
G
Guangshuai Wang
Shanghai Artificial Intelligence Laboratory
S
Sheng Xu
Shanghai Artificial Intelligence Laboratory
Z
Ziyan Huang
Shanghai Artificial Intelligence Laboratory
Xinyao Liu
Xinyao Liu
University of Science and Technology of China
Computer VisionLarge Language Model
J
Jiyao Liu
Shanghai Artificial Intelligence Laboratory
C
Cheng Tang
Shanghai Artificial Intelligence Laboratory
W
Wei Li
Shanghai Artificial Intelligence Laboratory