Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Current AI agents struggle to simultaneously achieve strong statistical fit and physical consistency in scientific modeling. This work introduces a scalable dynamic benchmark based on radial velocity time-series data, comprising 120 model-fitting tasks spanning multiple difficulty levels and realistic astrophysical scenarios, and—critically—incorporates physical constraints into agent evaluation for the first time. By integrating astrophysical simulations, hierarchical task design, and a large language model–based agent evaluation framework, the study systematically assesses eight state-of-the-art agents. Results reveal that while these agents often attain good statistical fits, they consistently exhibit significant biases in recovered physical parameters. Moreover, increasing computational resources yields only marginal gains and frequently leads to unproductive iterative loops. The paper proposes a generalizable simulation-driven evaluation paradigm that exposes fundamental limitations of existing approaches in scientific reasoning.

Technology Category

Application Category

📝 Abstract

The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design a simulation-driven environment for AI agents presumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at https://github.com/Gudmorning2025/Stargazer and https://gudmorning2025.github.io/Stargazer, respectively.

Problem

Research questions and friction points this paper is trying to address.

AI agents

model-fitting

astrophysical constraints

radial-velocity data

physical parameter recovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

model-fitting

AI agents

astrophysical constraints