From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit significant limitations in jointly reasoning over multiple images and interleaved text, while existing benchmarks lack explicit modeling of interleaved image-text structures and fine-grained single-image–text relationships. To address this, we propose MIR—a novel benchmark enabling progressive multi-image interleaved reasoning. MIR introduces four key innovations: (1) a staged curriculum learning strategy, (2) synthetic construction of multi-image interleaved data, (3) instance-level annotation of reasoning steps, and (4) fine-grained region-level image–text alignment. Extensive experiments demonstrate that MIR consistently enhances cross-image textual association and logical reasoning capabilities across diverse MLLMs. Notably, models fine-tuned on MIR achieve substantial and consistent performance gains not only on MIR itself but also on general multimodal benchmarks (e.g., MMBench, SEED-Bench). This work establishes a new paradigm for complex-scenario multimodal cognition modeling and provides a scalable, principled evaluation framework for interleaved multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Multi-image Interleaved Reasoning aims to improve Multi-modal Large Language Models (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks. While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations. To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images. To enhance MLLMs ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an "easy to hard" approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks. Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models reasoning performance on MIR and other established benchmarks. We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs capability to handle complex inter-modal tasks.Our code and dataset are available at https://github.com/Shelly-coder239/MIRBench.
Problem

Research questions and friction points this paper is trying to address.

Enhancing MLLMs' joint comprehension of multiple images with interleaved textual contexts
Addressing the gap in benchmarks that overlook interleaved text-image relationships
Improving cross-modal correlation capture and complex scene understanding capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MIR benchmark for multi-image interleaved reasoning
Uses stage-wise curriculum learning from easy to hard
Proposes reasoning steps to enhance cross-modal comprehension
🔎 Similar Papers
No similar papers found.
H
Hang Du
Beijing University of Posts and Telecommunications
Jiayang Zhang
Jiayang Zhang
AI Research Engineer, The University of Sheffield
Healthcare AIAI for biomedicineMultimodal AI
Guoshun Nan
Guoshun Nan
Professor of Beijing University of Posts and Telecommunications
Multimodal LearningVideo LLM6G SecuritySemantic Communications
W
Wendi Deng
Beijing University of Posts and Telecommunications
Z
Zhenyan Chen
Beijing University of Posts and Telecommunications
C
Chenyang Zhang
Beijing University of Posts and Telecommunications
Wang Xiao
Wang Xiao
Beijing University of Posts and Telecommunications
S
Shan Huang
Beijing University of Posts and Telecommunications
Y
Yuqi Pan
Beijing University of Posts and Telecommunications
Tao Qi
Tao Qi
Tsinghua University
AI SecurityResponsible AI
Sicong Leng
Sicong Leng
Nanyang Technological University
Multi-modal Learning