Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

A rigorous, operational definition and systematic evaluation framework for Scientific General Intelligence (SGI) remain lacking. Method: We propose an operational definition of SGI and introduce SGI-Bench—the first multimodal benchmark covering the full scientific discovery pipeline—featuring four scientist-aligned tasks grounded in the Practical Inquiry Model (PIM): deep inquiry, creative hypothesis generation, dry/wet experimental execution, and cross-disciplinary reasoning. Our methodology integrates retrieval-augmented generation (RAG), multi-stage task modeling, expert annotation, execution-based validation, and a novel test-time reinforcement learning (TTRL) paradigm to enhance hypothesis novelty in an unsupervised manner. Results: Experiments reveal severe bottlenecks in current LLMs—particularly in deep inquiry (10–20% accuracy) and wet-lab experimental fidelity. TTRL boosts hypothesis novelty by 37% without requiring ground-truth references. This work establishes foundational benchmarks, definitions, and methodologies for evaluating and advancing SGI capabilities.

Technology Category

Application Category

📝 Abstract

Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

Problem

Research questions and friction points this paper is trying to address.

Defining and evaluating Scientific General Intelligence in AI

Assessing LLMs' performance in scientist-aligned workflows

Addressing gaps in deep research, idea generation, and experimental reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Practical Inquiry Model operationalizes scientific workflows

SGI-Bench enables cross-disciplinary evaluation of LLMs

Test-Time Reinforcement Learning enhances hypothesis novelty

🔎 Similar Papers

No similar papers found.