Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A standardized benchmark for evaluating large language models’ (LLMs) scientific discovery capabilities remains absent. Method: This paper introduces Auto-Bench—the first automated, science-oriented evaluation benchmark—formalizing scientific discovery as an interactive causal structure exploration task. It proposes a novel dynamic intervention evaluation paradigm grounded in oracle feedback and constructs realistic simulation environments spanning chemistry and social sciences. Auto-Bench integrates causal graph modeling, multi-turn oracle querying, and reinforcement-based reasoning to systematically assess leading LLMs—including GPT-4, Gemini, and Qwen. Contribution/Results: Experiments reveal that state-of-the-art LLMs significantly underperform human scientists in hypothesis-driven exploration, causal intervention reasoning, and iterative knowledge updating—highlighting a fundamental gap in their ability to close the knowledge loop and emulate human-like scientific practice.

Technology Category

Application Category

📝 Abstract
Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, extit{Auto-Bench}, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications. By engaging interactively with an oracle, the models iteratively refine their understanding of underlying interactions, the chemistry and social interactions, through strategic interventions. We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases, which suggests an important gap between machine and human intelligence that future development of LLMs need to take into consideration.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs in scientific discovery
Standardized benchmark for AI scientists
Assess causal graph understanding in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated benchmark for LLMs
Causal graph discovery method
Interactive oracle refinement process
🔎 Similar Papers
No similar papers found.