KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

πŸ“… 2026-03-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

193K/year
πŸ€– AI Summary
This work addresses the absence of a culturally and institutionally grounded multimodal evaluation benchmark for Korean in existing vision-language models. We propose KMMMU, the first native Korean multimodal benchmark, spanning nine academic disciplines and nine visual modalities, with a focus on information-dense questions framed within Korean-specific norms, official standards, and professional visual formats. Constructed from authentic Korean examination materials, the dataset includes both a Korean-unique subset and a high-difficulty subset to overcome the limitations of English-centric evaluations. Experimental results reveal that even the strongest open-source model achieves only 42.05% accuracy overall, while the best closed-source model scores 52.42% on the challenging subset, highlighting significant gaps in models’ understanding of localized knowledge and domain-specific conventions.

Technology Category

Application Category

πŸ“ Abstract
We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.
Problem

Research questions and friction points this paper is trying to address.

multimodal understanding
Korean language
cultural context
evaluation benchmark
domain-specific standards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Korean-native benchmark
multimodal understanding
localized knowledge
domain-specific standards
non-English evaluation