🤖 AI Summary
Existing Japanese multimodal understanding benchmarks lack purely vision-driven evaluation protocols, limiting rigorous assessment of models’ visual reasoning capabilities. Method: We introduce JMMMU-Pro—the first fully image-based, multidisciplinary Japanese vision-language understanding benchmark—where questions and associated images are fused into single high-fidelity images containing Japanese OCR text; models must perform joint visual-textual reasoning solely from pixel inputs. To construct it, we propose the Vibe paradigm, integrating Nano Banana Pro for controllable text-to-image generation, human-in-the-loop prompt optimization, and layout-aware image composition. Contribution/Results: Extensive evaluation shows that leading open-source multimodal large language models exhibit substantial performance degradation on JMMMU-Pro, confirming its high discriminability and difficulty. JMMMU-Pro establishes a new standard for evaluating Japanese multimodal understanding, emphasizing genuine visual comprehension over textual shortcuts.
📝 Abstract
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.