PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

πŸ“… 2025-03-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing multilingual vision-language benchmarks suffer from linguistic bias, modality fragmentation, and insufficient safety evaluation. To address these limitations, we introduce PM4Benchβ€”the first multimodal benchmark supporting parallel evaluation across 10 languages, integrated image-text inputs, and multidimensional safety assessment. Its key contributions are: (1) a parallel multilingual design ensuring cross-lingual fairness; (2) a novel visual question answering paradigm that embeds textual content directly within images, thereby strengthening the synergy between OCR and cross-modal reasoning; and (3) a fine-grained taxonomy of safety risks coupled with a consistency-aware evaluation framework. Extensive experiments across 11 state-of-the-art LVLMs reveal substantial cross-lingual performance disparities, with OCR capability identified as the primary bottleneck. The benchmark dataset and evaluation code are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously"see","read", and"think", aligning with real-world applications. Additionally, PM extsuperscript{4}Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .
Problem

Research questions and friction points this paper is trying to address.

Addresses language biases in multilingual LVLM benchmarks
Integrates multimodal inputs for real-world application alignment
Incorporates safety evaluations overlooked in existing benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel corpus design across 10 languages
Vision setting with embedded text and queries
Incorporates safety evaluations for LVLMs
πŸ”Ž Similar Papers
No similar papers found.
J
Junyuan Gao
University of Chinese Academy of Sciences
Jiahe Song
Jiahe Song
SJTU&Shanghai AI Lab, PhD Candidate
LLMVLM
J
Jiang Wu
Shanghai Artificial Intelligence Laboratory
R
Runchuan Zhu
Peking University
G
Guanlin Shen
Shanghai Artificial Intelligence Laboratory
S
Shasha Wang
Shanghai Artificial Intelligence Laboratory
Xingjian Wei
Xingjian Wei
shanghai AI lab
data-centric-aiLLMVLMEngineer
Haote Yang
Haote Yang
PJLab
CVLLMMLLMAI4S
S
Songyang Zhang
Shanghai Artificial Intelligence Laboratory
W
Weijia Li
Sun Yat-Sen University, Shanghai Artificial Intelligence Laboratory
B
Bin Wang
Shanghai Artificial Intelligence Laboratory
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics
Lijun Wu
Lijun Wu
Shanghai AI Laboratory
MLLLMAI4Science
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence