RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A critical shortage of Chinese multimodal image understanding benchmarks hinders progress in this domain. Method: We introduce RealBench, the first real-world-oriented Chinese multimodal image understanding benchmark, comprising 9,393 samples and 69,910 user-generated images spanning diverse scenarios, resolutions, and structural configurations. Contribution/Results: RealBench pioneers the incorporation of authentic UGC content, features complex image compositions, and integrates practical application tasks—significantly increasing evaluation difficulty. Rigorous human annotation and quality filtering ensure high data fidelity. We systematically evaluate 21 state-of-the-art multimodal large models, including both proprietary and open-source vision and video models. Experimental results reveal that even the best proprietary models achieve only limited performance on RealBench, while open-source models underperform by an average of 71.8%, underscoring RealBench’s substantial challenge and its pivotal role in addressing the long-standing gap in Chinese multimodal image understanding evaluation.

Technology Category

Application Category

📝 Abstract
While various multimodal multi-image evaluation datasets have been emerged, but these datasets are primarily based on English, and there has yet to be a Chinese multi-image dataset. To fill this gap, we introduce RealBench, the first Chinese multimodal multi-image dataset, which contains 9393 samples and 69910 images. RealBench distinguishes itself by incorporating real user-generated content, ensuring high relevance to real-world applications. Additionally, the dataset covers a wide variety of scenes, image resolutions, and image structures, further increasing the difficulty of multi-image understanding. Ultimately, we conduct a comprehensive evaluation of RealBench using 21 multimodal LLMs of different sizes, including closed-source models that support multi-image inputs as well as open-source visual and video models. The experimental results indicate that even the most powerful closed-source models still face challenges when handling multi-image Chinese scenarios. Moreover, there remains a noticeable performance gap of around 71.8% on average between open-source visual/video models and closed-source models. These results show that RealBench provides an important research foundation for further exploring multi-image understanding capabilities in the Chinese context.
Problem

Research questions and friction points this paper is trying to address.

Lack of Chinese multi-image datasets for real-world multimodal evaluation
Need to assess AI models on diverse Chinese visual scenarios
Performance gap between open-source and closed-source multimodal models
Innovation

Methods, ideas, or system contributions that make the work stand out.

First Chinese multimodal multi-image dataset
Uses real user-generated content for relevance
Comprehensive evaluation with 21 multimodal LLMs
🔎 Similar Papers
No similar papers found.
F
Fei Zhao
National Key Laboratory for Novel Software Technology, Nanjing University
Chengqiang Lu
Chengqiang Lu
USTC
Yufan Shen
Yufan Shen
Zhejiang University
MLLMGUI Agent
Q
Qimeng Wang
Xiaohongshu Inc.
Y
Yicheng Qian
Xiaohongshu Inc.
H
Haoxin Zhang
Xiaohongshu Inc.
Y
Yan Gao
Xiaohongshu Inc.
Y
Yi Wu
Xiaohongshu Inc.
Yao Hu
Yao Hu
浙江大学
Machine Learning
Z
Zhen Wu
National Key Laboratory for Novel Software Technology, Nanjing University
Shangyu Xing
Shangyu Xing
Master Student, Nanjing University
MultimodalityLarge Language Models
Xinyu Dai
Xinyu Dai
Nanjing University