π€ AI Summary
Climate science research is hindered by the explosion of multiscale data, fragmented analytical tools, and the limited physical grounding and complex reasoning capabilities of existing large language models (LLMs). To address these challenges, this work proposes ClimAgentβthe first general-purpose autonomous analysis framework tailored for real-world climate science scenarios. ClimAgent enables LLMs to execute end-to-end, cross-subfield climate modeling tasks through a unified tool-calling mechanism and a multi-step rigorous reasoning protocol. The study also introduces ClimaBench, a comprehensive benchmark encompassing 2,000β2025 professional tasks across five major categories. Experimental results demonstrate that ClimAgent improves solution rigor and practicality by 40.21% over baseline LLMs, substantially overcoming current limitations in applying LLMs to complex scientific reasoning.
π Abstract
Climate research is pivotal for mitigating global environmental crises, yet the accelerating volume of multi-scale datasets and the complexity of analytical tools have created significant bottlenecks, constraining scientific discovery to fragmented and labor-intensive workflows. While the emergence Large Language Models (LLMs) offers a transformative paradigm to scale scientific expertise, existing explorations remain largely confined to simple Question-Answering (Q&A) tasks. These approaches often oversimplify real-world challenges, neglecting the intricate physical constraints and the data-driven nature required in professional climate science.To bridge this gap, we introduce ClimAgent, a general-purpose autonomous framework designed to execute a wide spectrum of research tasks across diverse climate sub-fields. By integrating a unified tool-use environment with rigorous reasoning protocols, ClimAgent transcends simple retrieval to perform end-to-end modeling and analysis.To foster systematic evaluation, we propose ClimaBench, the first comprehensive benchmark for real-world climate discovery. It encompasses challenging problems spanning 5 distinct task categories derived from professional scenarios between 2000 and 2025. Experiments on ClimaBench demonstrate that ClimAgent significantly outperforms state-of-the-art baselines, achieving a 40.21% improvement over original LLM solutions in solution rigorousness and practicality. Our code are available at https://github.com/usail-hkust/ClimAgent.