PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing multilingual vision-language benchmarks suffer from linguistic bias, modality fragmentation, and insufficient safety evaluation. To address these limitations, we introduce PM4Bench—the first multimodal benchmark supporting parallel evaluation across 10 languages, integrated image-text inputs, and multidimensional safety assessment. Its key contributions are: (1) a parallel multilingual design ensuring cross-lingual fairness; (2) a novel visual question answering paradigm that embeds textual content directly within images, thereby strengthening the synergy between OCR and cross-modal reasoning; and (3) a fine-grained taxonomy of safety risks coupled with a consistency-aware evaluation framework. Extensive experiments across 11 state-of-the-art LVLMs reveal substantial cross-lingual performance disparities, with OCR capability identified as the primary bottleneck. The benchmark dataset and evaluation code are publicly released.

Technology Category

Application Category

📝 Abstract

Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously"see","read", and"think", aligning with real-world applications. Additionally, PM extsuperscript{4}Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .

Problem

Research questions and friction points this paper is trying to address.

Addresses language biases in multilingual LVLM benchmarks

Integrates multimodal inputs for real-world application alignment

Incorporates safety evaluations overlooked in existing benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel corpus design across 10 languages

Vision setting with embedded text and queries

Incorporates safety evaluations for LVLMs

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models