🤖 AI Summary
Existing space situational awareness systems struggle to interpret the physical and tactical implications of spacecraft maneuvers. To address this gap, this work proposes a benchmark platform integrating high-fidelity celestial mechanics simulations with realistic observational constraints. For the first time, the framework jointly incorporates physical consistency and semantic correctness into its evaluation methodology through three core tasks: intent inference, maneuver parameter estimation, and threat assessment, while introducing composite scenarios featuring noisy observations and multi-source textual intelligence of varying reliability. Experimental results reveal significant performance disparities among open-source large language models: Qwen3 (32B) excels in intent inference, QwQ (32B) leads in threat assessment and parameter accuracy, and GPT-OSS (20B) demonstrates superior reasoning and numerical extraction capabilities. Furthermore, structured prompting substantially enhances the performance of 8B-scale models.
📝 Abstract
Understanding why a spacecraft maneuvers -- rather than simply that it did -- is an increasingly important problem for space domain awareness as Earth orbits grow crowded and contested. Current analysis pipelines are built for detection: they are good at picking up that something happened, less good at reasoning about what it means. AstroMind is a physics-grounded benchmark designed to close that gap. It draws on high-fidelity astrodynamics simulations and real observational constraints, converting them into verifiable reasoning problems across three task types: intent inference, maneuver parameter estimation, and threat assessment. Each scenario includes realistic sensing noise and multi-source textual intelligence at varying reliability levels. Evaluation metrics capture both semantic correctness and quantitative consistency under physical constraints. Benchmarking a suite of open-weight models shows no single model dominates every axis: Qwen3 (32B) leads on intent inference accuracy; QwQ (32B) leads on threat assessment and achieves the lowest median relative error on parsed items; GPT-OSS (20B) produces the strongest judged reasoning quality and extracts the most scalar values for parameter estimation (136 of 241 parsed items). Training data composition and reasoning style matter as much as model size. Structured reasoning prompts help consistently across tested 8B models, with larger gains for those that can already track physical constraints. AstroMind gives the field a shared test for a problem where getting the physics right and reading the tactical situation correctly are both required -- neither is sufficient on its own.