JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing Japanese multimodal understanding benchmarks lack purely vision-driven evaluation protocols, limiting rigorous assessment of models’ visual reasoning capabilities. Method: We introduce JMMMU-Pro—the first fully image-based, multidisciplinary Japanese vision-language understanding benchmark—where questions and associated images are fused into single high-fidelity images containing Japanese OCR text; models must perform joint visual-textual reasoning solely from pixel inputs. To construct it, we propose the Vibe paradigm, integrating Nano Banana Pro for controllable text-to-image generation, human-in-the-loop prompt optimization, and layout-aware image composition. Contribution/Results: Extensive evaluation shows that leading open-source multimodal large language models exhibit substantial performance degradation on JMMMU-Pro, confirming its high discriminability and difficulty. JMMMU-Pro establishes a new standard for evaluating Japanese multimodal understanding, emphasizing genuine visual comprehension over textual shortcuts.

Technology Category

Application Category

📝 Abstract

This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Develops a Japanese multimodal benchmark for integrated visual-textual understanding

Proposes a scalable method using generative models to create visual question images

Evaluates open-source LMMs' performance on Japanese image-based reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses image generative model to create visual questions

Combines question text and image into single image

Human verification ensures benchmark quality and realism

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models