🤖 AI Summary
The atmospheric sciences lack a dedicated large language model (LLM) evaluation benchmark. Method: We introduce the first comprehensive, multi-domain benchmark covering hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. It features a templated approach for automatically generating graduate-level multiple-choice questions and a cross-subfield domain-adapted evaluation framework integrating domain-knowledge injection and standardized assessment protocols. Contribution/Results: We conduct the first systematic horizontal comparison of four LLM categories—instruction-tuned, reasoning-enhanced, math-enhanced, and climate-specialized—evaluating accuracy, reasoning-chain quality, and disciplinary coverage. Results reveal pervasive weaknesses in physical consistency and multi-step reasoning across current LLMs; math-enhanced and domain-specialized models demonstrate superior performance. The benchmark dataset, evaluation framework, and code are fully open-sourced.
📝 Abstract
The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. To address this need, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. We employ a template-based question generation framework, enabling scalable and diverse multiple-choice questions curated from graduate-level atmospheric science problems. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate service by offering a standard and rigorous evaluation framework. Our source codes are currently available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.