🤖 AI Summary
Traditional meteorological evaluation metrics fail to meet domain experts’ needs for interpretable and dynamically evolving forecast quality analysis. To address this, we propose a novel multimodal quality assessment paradigm specifically designed for weather radar forecasting, integrating radar image sequences with natural language descriptions. We introduce RQA-70K—the first large-scale, mixed-annotation dataset for radar quality assessment—and formulate dual-dimensional evaluation tasks: (i) single-frame versus sequence-level analysis, and (ii) scalar scoring versus descriptive commentary. Methodologically, we devise a multi-stage iterative training strategy to enhance the physical consistency and expert semantic understanding capabilities of multimodal large language models (MLLMs). Extensive experiments demonstrate that our approach significantly outperforms general-purpose MLLMs across all evaluation scenarios, achieving breakthrough improvements in prediction accuracy, interpretability, and characterization of spatiotemporal evolution.
📝 Abstract
Quality analysis of weather forecasts is an essential topic in meteorology. Although traditional score-based evaluation metrics can quantify certain forecast errors, they are still far from meteorological experts in terms of descriptive capability, interpretability, and understanding of dynamic evolution. With the rapid development of Multi-modal Large Language Models (MLLMs), these models become potential tools to overcome the above challenges. In this work, we introduce an MLLM-based weather forecast analysis method, RadarQA, integrating key physical attributes with detailed assessment reports. We introduce a novel and comprehensive task paradigm for multi-modal quality analysis, encompassing both single frame and sequence, under both rating and assessment scenarios. To support training and benchmarking, we design a hybrid annotation pipeline that combines human expert labeling with automated heuristics. With such an annotation method, we construct RQA-70K, a large-scale dataset with varying difficulty levels for radar forecast quality evaluation. We further design a multi-stage training strategy that iteratively improves model performance at each stage. Extensive experiments show that RadarQA outperforms existing general MLLMs across all evaluation settings, highlighting its potential for advancing quality analysis in weather prediction.