🤖 AI Summary
A rigorous, operational definition and systematic evaluation framework for Scientific General Intelligence (SGI) remain lacking. Method: We propose an operational definition of SGI and introduce SGI-Bench—the first multimodal benchmark covering the full scientific discovery pipeline—featuring four scientist-aligned tasks grounded in the Practical Inquiry Model (PIM): deep inquiry, creative hypothesis generation, dry/wet experimental execution, and cross-disciplinary reasoning. Our methodology integrates retrieval-augmented generation (RAG), multi-stage task modeling, expert annotation, execution-based validation, and a novel test-time reinforcement learning (TTRL) paradigm to enhance hypothesis novelty in an unsupervised manner. Results: Experiments reveal severe bottlenecks in current LLMs—particularly in deep inquiry (10–20% accuracy) and wet-lab experimental fidelity. TTRL boosts hypothesis novelty by 37% without requiring ground-truth references. This work establishes foundational benchmarks, definitions, and methodologies for evaluating and advancing SGI capabilities.
📝 Abstract
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.