DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) genuinely understand physical mechanisms or merely rely on memorized patterns, and evaluates their capacity for scientific discovery under unfamiliar physical laws. To this end, we introduce DiscoverPhysics, an interactive benchmark that dynamically generates 22 virtual worlds with physics deviating from reality, requiring agents to design experiments, analyze N-body simulation trajectories, formulate natural-language explanations, and implement motion laws in Python. Results show that even the strongest models solve only about half of the tasks and consistently fail on problems demanding inference of latent variable structures. Open-source models significantly lag behind commercial counterparts, and prediction accuracy exhibits no strong correlation with explanation quality. This work presents the first systematic assessment of LLMs’ long-horizon reasoning and conceptual construction abilities in open-ended scientific discovery.

📝 Abstract

Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.

Problem

Research questions and friction points this paper is trying to address.

scientific reasoning

physics discovery

out-of-distribution generalization

interactive benchmark

latent structure discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive benchmark

scientific discovery

latent physics

hypothesis refinement

LLM reasoning

🔎 Similar Papers

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

2024-06-13arXiv.orgCitations: 12