🤖 AI Summary
Existing benchmarks for evaluating scientific discovery struggle to assess agents’ capabilities in active information gathering, measurement planning, and model inference under physical and cost constraints. To address this gap, this work proposes the MaD Physics benchmark, which comprises three interactive environments grounded in (partially modified) physical laws. Agents are evaluated on their ability to collect data, infer underlying principles, and predict system states within a limited measurement budget. MaD Physics uniquely integrates resource-constrained active measurement with scientific reasoning, employs variable physical laws to prevent knowledge contamination, and supports evaluation of multimodal and in-context learning. Experiments with four Gemini models reveal significant limitations in current large language models regarding structured exploration and efficient data acquisition, offering new directions for advancing scientific reasoning capabilities.
📝 Abstract
Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.