KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the absence of a culturally and institutionally grounded multimodal evaluation benchmark for Korean in existing vision-language models. We propose KMMMU, the first native Korean multimodal benchmark, spanning nine academic disciplines and nine visual modalities, with a focus on information-dense questions framed within Korean-specific norms, official standards, and professional visual formats. Constructed from authentic Korean examination materials, the dataset includes both a Korean-unique subset and a high-difficulty subset to overcome the limitations of English-centric evaluations. Experimental results reveal that even the strongest open-source model achieves only 42.05% accuracy overall, while the best closed-source model scores 52.42% on the challenging subset, highlighting significant gaps in models’ understanding of localized knowledge and domain-specific conventions.

Technology Category

Application Category

📝 Abstract

We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.

Problem

Research questions and friction points this paper is trying to address.

multimodal understanding

Korean language

cultural context

evaluation benchmark

domain-specific standards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Korean-native benchmark

multimodal understanding

localized knowledge