JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

📅 2024-10-22

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This paper addresses the lack of culturally sensitive evaluation benchmarks for large language models (LLMs) in non-English languages—particularly Japanese. To fill this gap, we introduce JMMMU, the first large-scale, multidisciplinary, multimodal understanding benchmark explicitly designed for Japanese culture. Our method features: (1) a dual-subset design—Culture-Agnostic (CA) and Culture-Specific (CS)—to decouple linguistic competence from cultural understanding; and (2) a novel culture-aware dual-track evaluation paradigm integrating expert-authored items with translation-aligned counterparts to ensure fine-grained, comparable, and scalable assessment. Empirical results show that mainstream multimodal LLMs exhibit consistent performance degradation on the CA subset and significantly worse performance on the CS subset. Notably, several models achieve high CA scores but near-zero CS scores, indicating only superficial linguistic adaptation without genuine deep cultural comprehension.

Technology Category

Application Category

📝 Abstract

Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate comprehensive culture-aware evaluation, JMMMU features two complementary subsets: (i) culture-agnostic (CA) subset, where the culture-independent subjects (e.g., Math) are selected and translated into Japanese, enabling one-to-one comparison with its English counterpart MMMU; and (ii) culture-specific (CS) subset, comprising newly crafted subjects that reflect Japanese cultural context. Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation. Using the CS subset, we reveal their inadequate Japanese cultural understanding. Further, by combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding. We hope this work will not only help advance LMM performance in Japanese but also serve as a guideline to create high-standard, culturally diverse benchmarks for multilingual LMM development. The project page is https://mmmu-japanese-benchmark.github.io/JMMMU/.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LMMs on expert-level tasks in Japanese cultural context.

Identifies performance drops due to language and cultural differences.

Highlights shallow understanding of Japanese language lacking cultural depth.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed first large-scale Japanese benchmark JMMMU

Includes culture-agnostic and culture-specific subsets

Evaluates LMMs on Japanese cultural understanding

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models