🤖 AI Summary
This work addresses the challenge of diagnosing extreme weather events, which requires multi-step logical reasoning, dynamic tool invocation, and integration of expert meteorological knowledge—capabilities inadequately supported by existing approaches. To this end, the authors propose a multi-agent system that embeds domain-specific meteorological expertise and employs a closed-loop “hypothesize–verify–replan” mechanism to iteratively refine diagnoses of anomalous signals. The study further introduces a novel evaluation benchmark structured around atomic-level subtasks, enabling fine-grained, expert-level validation and assessment of dynamic reasoning capabilities. Experimental results demonstrate that the proposed method significantly improves diagnostic accuracy and system robustness in complex extreme weather scenarios, outperforming current baselines through its synergistic combination of structured reasoning and domain knowledge.
📝 Abstract
While deep learning-based weather forecasting paradigms have made significant strides, addressing extreme weather diagnostics remains a formidable challenge. This gap exists primarily because the diagnostic process demands sophisticated multi-step logical reasoning, dynamic tool invocation, and expert-level prior judgment. Although agents possess inherent advantages in task decomposition and autonomous execution, current architectures are still hampered by critical bottlenecks: inadequate expert knowledge integration, a lack of professional-grade iterative reasoning loops, and the absence of fine-grained validation and evaluation systems for complex workflows under extreme conditions. To this end, we propose HVR-Met, a multi-agent meteorological diagnostic system characterized by the deep integration of expert knowledge. Its central innovation is the ``Hypothesis-Verification-Replanning'' closed-loop mechanism, which facilitates sophisticated iterative reasoning for anomalous meteorological signals during extreme weather events. To bridge gaps within existing evaluation frameworks, we further introduce a novel benchmark focused on atomic-level subtasks. Experimental evidence demonstrates that the system excels in complex diagnostic scenarios.