CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cross-modal multi-hop reasoning (CMR) benchmarks suffer from two critical limitations: exclusion of the speech modality and severe imbalance in reasoning path distributions, leading to biased evaluation. To address these issues, we introduce CMR-Bench—the first comprehensive benchmark covering text, image, and speech modalities with balanced sampling of multi-hop reasoning paths—and propose ECV (Extract-Connect-Verify), a prompting framework that explicitly decouples and calibrates each reasoning step. Through systematic experiments, we uncover consistent failures of mainstream multimodal large language models (MLLMs) on specific path types, demonstrating that path imbalance significantly distorts model ranking. ECV substantially mitigates performance disparities across paths—yielding an average improvement of 12.7%—thereby enhancing model generalization and evaluation fairness. This work establishes a new paradigm for robust and equitable assessment of multimodal reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs), entailing the integration of information from multiple modalities to produce a coherent output for a given context. We argue that existing benchmarks for evaluating this ability have critical shortcomings: (1) they largely overlook the speech modality, and (2) they exhibit heavily biased reasoning path distributions, which can severely undermine fair evaluation. To address these limitations, we introduce a novel benchmark -- Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB) -- designed to assess tri-modal multi-hop reasoning while ensuring both unbiased and diverse reasoning paths. Our experiments with the new dataset reveal consistent model failures in specific reasoning sequences and show that biased benchmarks risk misrepresenting model performance. Finally, based on our extensive analysis, we propose a new ECV (Extract, Connect, Verify) prompting technique that effectively mitigates the performance gap across different reasoning paths. Overall, we call for more careful evaluation in CMR to advance the development of robust multimodal AI.
Problem

Research questions and friction points this paper is trying to address.

Addresses biased reasoning path distributions in multimodal evaluation
Introduces benchmark for tri-modal text-image-speech reasoning
Proposes technique to mitigate performance gaps across paths
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tri-modal benchmark integrating text, image, and speech
Balanced reasoning paths to eliminate evaluation bias
ECV prompting technique for improved cross-modal reasoning
🔎 Similar Papers
No similar papers found.
S
Seunghee Kim
Hanyang University
I
Ingyu Bang
Hanyang University
S
Seokgyu Jang
Hanyang University
Changhyeon Kim
Changhyeon Kim
Samsung Research
visual navigationSLAMvisual-LiDAR fusion
S
Sanghwan Bae
NAVER Cloud
J
Jihun Choi
Sony AI
R
Richeng Xuan
Beijing Academy of Artificial Intelligence
Taeuk Kim
Taeuk Kim
Assistant Professor, Hanyang University.
Natural Language ProcessingLarge Language ModelsMachine Learning