K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the lack of fine-grained evaluation frameworks for multimodal large language models (MLLMs) in specialized domains such as meteorological reasoning, regional cultural understanding, and chart interpretation. To bridge this gap, we introduce K-MetBench, the first diagnostic benchmark derived from the Korean National Meteorological Certification Examination. Our framework constructs a comprehensive evaluation protocol across four dimensions: expert-level chart reasoning, logical coherence, Korean geographical and cultural comprehension, and fine-grained domain-specific analysis. Integrating expert-validated logic, localized cultural context, and multimodal chart understanding, K-MetBench leverages authoritative exam questions and human-annotated data to systematically assess 55 models. Results demonstrate that model scale alone cannot compensate for cultural adaptation, with locally developed Korean models significantly outperforming larger international counterparts on region-specific tasks, while also revealing widespread deficiencies in cross-modal understanding and hallucinatory reasoning.

Technology Category

Application Category

📝 Abstract

The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis. Our evaluation of 55 models reveals a profound modality gap in interpreting specialized diagrams and a reasoning gap where models hallucinate logic despite correct predictions. Crucially, Korean models outperform significantly larger global models in local contexts, demonstrating that parameter scaling alone cannot resolve cultural dependencies. K-MetBench serves as a roadmap for developing reliable, culturally aware expert AI agents. The dataset is available at https://huggingface.co/datasets/soyeonbot/K-MetBench .

Problem

Research questions and friction points this paper is trying to address.

expert reasoning

locality

multimodality

meteorology

evaluation benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

multidimensional benchmark

expert reasoning

modality gap