🤖 AI Summary
This work addresses the core challenge of automating scientific discovery by proposing KeplerAgent, a novel framework that explicitly models the multi-stage reasoning process of scientists using a large language model (LLM) agent. KeplerAgent first infers structural properties of a physical system through symmetry analysis and then leverages these physically grounded constraints to guide a symbolic regression engine, enabling stepwise and interpretable equation discovery. By integrating physical priors with data-driven symbolic regression, the method achieves substantially superior performance over existing LLM-based and traditional approaches across multiple physics benchmarks. Notably, it maintains high symbolic accuracy and robustness even in the presence of noisy observational data, demonstrating its potential for reliable, interpretable scientific discovery from real-world measurements.
📝 Abstract
Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.