MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit insufficient capability in jointly resolving inherent ambiguities in natural language and vision; mainstream benchmarks overlook the potential for cross-modal mutual disambiguation and lack multilingual support. To address this, we propose MUCAR—the first benchmark dedicated to multilingual cross-modal ambiguity resolution—comprising two subsets: multilingual text–image mutual disambiguation and doubly ambiguous pair identification. We introduce the first systematic formalization of dual linguistic and visual ambiguity and establish a novel evaluation paradigm centered on cross-modal mutual clarification. Through controlled ambiguity construction, cross-modal consistency annotation, and zero-shot evaluation across 19 state-of-the-art models, we reveal a substantial performance gap between current MLLMs and human annotators—averaging 42.7% lower accuracy—highlighting a fundamental bottleneck in cross-modal collaborative reasoning.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models--encompassing both open-source and proprietary architectures--reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

Resolving multilingual cross-modal ambiguity in MLLMs
Addressing linguistic and visual ambiguities in multimodal contexts
Benchmarking models on dual-ambiguity image-text disambiguation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual dataset resolves ambiguity via visuals
Dual-ambiguity dataset pairs ambiguous images with texts
Benchmark evaluates cross-modal multilingual ambiguity resolution
🔎 Similar Papers
No similar papers found.
X
Xiaolong Wang
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Z
Zhaolu Kang
School of Software & Microelectronics, Peking University, Beijing, China
W
Wangyuxuan Zhai
Beijing Jiaotong University, Beijing, China
X
Xinyue Lou
Beijing Jiaotong University, Beijing, China
Yunghwei Lai
Yunghwei Lai
Institute for AI Industry Research, Tsinghua University
LLM Agent | AI Healthcare
Z
Ziyue Wang
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Yawen Wang
Yawen Wang
The University of Texas at Arlington
Gear DynamicsNoise and Vibration
K
Kaiyu Huang
Beijing Jiaotong University, Beijing, China
Yile Wang
Yile Wang
Shenzhen University
Natural Language Processing
P
Peng Li
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Y
Yang Liu
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China