ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing multimodal reasoning benchmarks are heavily English-centric, lacking systematic evaluation of high-resource non-English languages—particularly Arabic—across diverse visual and textual modalities. Method: We introduce ArabVQA, the first Arabic-centric multimodal stepwise reasoning benchmark, covering 11 domains including visual reasoning, document understanding, and OCR. It comprises 1,356 samples and 5,119 human-annotated reasoning chains. We propose a structured evaluation framework grounded in reasoning chains, incorporating a tri-dimensional scoring metric—coherence, faithfulness, and cultural adaptability—and integrate open-source evaluation tools for reproducible diagnostics. Results: Experiments across 12 state-of-the-art multimodal large language models reveal pervasive issues of cultural misalignment and logical fragmentation in Arabic multimodal reasoning. ArabVQA is publicly released, filling a critical gap in fine-grained, non-English multimodal reasoning evaluation and advancing the development of trustworthy multilingual AI.

Technology Category

Application Category

📝 Abstract

As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation. It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions. We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems. We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. Code available at: https://github.com/mbzuai-oryx/ARB

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal reasoning in Arabic, overlooked by current benchmarks

Assessing coherence, faithfulness, and cultural grounding in Arabic LMMs

Addressing lack of diverse multimodal reasoning benchmarks for underrepresented languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

First Arabic multimodal reasoning benchmark

Evaluates 12 state-of-the-art LMMs

Includes 1,356 samples with 5,119 reasoning steps

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models