NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

Existing scientific law discovery benchmarks face a trilemma among scientific relevance, scalability, and memorization resistance, while oversimplifying discovery as static function fitting—ignoring its inherently interactive, exploratory nature. This paper introduces NewtonBench, a novel benchmark comprising 324 tasks spanning 12 physics domains. It pioneers the “metaphysical transformation” mechanism to generate dynamic, scientifically grounded tasks that simultaneously ensure scalability and memorization resistance. Crucially, it shifts evaluation from static fitting to interactive model exploration via LLM-based agents, integrating code interpreters to enable controlled simulation, active intervention, and dynamic observation. Experiments reveal that state-of-the-art models exhibit only nascent and fragile discovery capabilities, with performance degrading markedly as system complexity and noise increase. Surprisingly, tool augmentation can cause high-capability models to prematurely converge on suboptimal solutions.

Technology Category

Application Category

📝 Abstract

Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using metaphysical shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

Problem

Research questions and friction points this paper is trying to address.

Evaluating scientific law discovery faces relevance-scalability-memorization tradeoffs

Existing benchmarks oversimplify discovery as static function fitting

Current methods fail to capture interactive exploration of complex systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Metaphysical shifts generate scalable and memorization-resistant problems

Interactive model discovery replaces static function fitting

Simulated complex systems require experimental probing for hidden principles

🔎 Similar Papers

No similar papers found.

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Research Scientist, AI Language