🤖 AI Summary
Large language models (LLMs) exhibit insufficient domain-specific knowledge depth and weak reasoning capabilities in specialized fields such as astronomy—especially at medium-to-small parameter scales. To address this, we introduce AstroSage-70B: the first 70-billion-parameter LLM exclusively designed for the full spectrum of astronomical domains. Our method innovatively embeds interpretable reasoning chains explicitly into supervised fine-tuning data, enabling a “thinking-as-output” astronomical reasoning paradigm. It further integrates continuous pretraining on astronomical literature, multi-stage domain-adaptive reinforcement training, model merging, rigorous data curation, and hyperparameter optimization. Evaluated on the authoritative AstroMLab-1 benchmark (4,425 questions), AstroSage-70B achieves state-of-the-art performance—outperforming all open- and closed-source baselines, including o3, Gemini-2.5-Pro, and Claude-3.7-Sonnet—thereby substantially alleviating the longstanding trade-off between model scale and domain expertise.
📝 Abstract
General-purpose large language models, despite their broad capabilities, often struggle with specialized domain knowledge, a limitation particularly pronounced in more accessible, lower-parameter versions. This gap hinders their deployment as effective agents in demanding fields such as astronomy. Building on our prior work with AstroSage-8B, this study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant. It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation. Developed from the Llama-3.1-70B foundation, AstroSage-70B underwent extensive continued pre-training on a vast corpus of astronomical literature, followed by supervised fine-tuning and model merging. Beyond its 70-billion parameter scale, this model incorporates refined datasets, judiciously chosen learning hyperparameters, and improved training procedures, achieving state-of-the-art performance on complex astronomical tasks. Notably, we integrated reasoning chains into the SFT dataset, enabling AstroSage-70B to either answer the user query immediately, or first emit a human-readable thought process. Evaluated on the AstroMLab-1 benchmark -- comprising 4,425 questions from literature withheld during training -- AstroSage-70B achieves state-of-the-art performance. It surpasses all other tested open-weight and proprietary models, including leading systems like o3, Gemini-2.5-Pro, Claude-3.7-Sonnet, Deepseek-R1, and Qwen-3-235B, even those with API costs two orders of magnitude higher. This work demonstrates that domain specialization, when applied to large-scale models, can enable them to outperform generalist counterparts in specialized knowledge areas like astronomy, thereby advancing the frontier of AI capabilities in the field.