PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks lack the capability to rigorously evaluate large language model (LLM)-based agents’ scientific reasoning—particularly their dependence on prior knowledge and adaptability to environmental complexity. Method: We introduce the first benchmark for scientific discovery in interactive physical environments, enabling fine-grained, controllable modulation of agents’ prior knowledge levels and multidimensional disentanglement of performance across hypothesis generation, active exploration, and constrained data acquisition. The benchmark integrates high-fidelity physics simulation, structured data collection, and standardized evaluation protocols to ensure reproducibility and quantifiability. Contribution/Results: Experiments demonstrate significant performance divergence across LLMs under varying prior knowledge availability and task complexity, validating the benchmark’s strong discriminative power and scalability. It establishes a principled, extensible framework for evaluating and advancing LLM agents in scientific reasoning tasks.

Technology Category

Application Category

📝 Abstract
Evaluating the scientific discovery capabilities of large language model based agents, particularly how they cope with varying environmental complexity and utilize prior knowledge, requires specialized benchmarks currently lacking in the landscape. To address this gap, we introduce PhysGym, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning in interactive physics environments. PhysGym's primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent. This allows researchers to dissect agent performance along axes including the complexity of the problem and the prior knowledge levels. The benchmark comprises a suite of interactive simulations, where agents must actively probe environments, gather data sequentially under constraints and formulate hypotheses about underlying physical laws. PhysGym provides standardized evaluation protocols and metrics for assessing hypothesis accuracy and model fidelity. We demonstrate the benchmark's utility by presenting results from baseline LLMs, showcasing its ability to differentiate capabilities based on varying priors and task complexity.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' scientific discovery in physics environments
Controlling prior knowledge levels for performance analysis
Evaluating hypothesis accuracy in interactive physics simulations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled prior knowledge for LLM agents
Interactive physics simulation benchmark suite
Standardized evaluation of hypothesis accuracy
🔎 Similar Papers
No similar papers found.