More Images, More Problems? A Controlled Analysis of VLM Failure Modes

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models struggle to effectively aggregate cross-image information and track multiple concepts in multi-image input scenarios, limiting their reasoning capabilities. To address this, this work introduces MIMIC—the first controllable evaluation benchmark specifically designed for multi-image understanding—which systematically reveals failure modes of existing models. Furthermore, the authors propose a procedural multi-image data synthesis approach, an attention mask mechanism tailored for multi-image inputs, and a multi-image instruction tuning strategy. Experimental results demonstrate that the proposed method significantly enhances the model’s ability to integrate cross-image information, outperforming state-of-the-art approaches across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core weaknesses and their causes is still lacking. In this work, we introduce MIMIC (Multi-Image Model Insights and Challenges), a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs. Using MIMIC, we conduct a series of diagnostic experiments that reveal pervasive issues: LVLMs often fail to aggregate information across images and struggle to track or attend to multiple concepts simultaneously. To address these failures, we propose two novel complementary remedies. On the data side, we present a procedural data-generation strategy that composes single-image annotations into rich, targeted multi-image training examples. On the optimization side, we analyze layer-wise attention patterns and derive an attention-masking scheme tailored for multi-image inputs. Experiments substantially improved cross-image aggregation, while also enhancing performance on existing multi-image benchmarks, outperforming prior state of the art across tasks. Data and code will be made available at https://github.com/anurag-198/MIMIC.
Problem

Research questions and friction points this paper is trying to address.

Large Vision Language Models
multi-image understanding
failure modes
cross-image reasoning
visual language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-image reasoning
Vision-language models
Attention masking
Procedural data generation
Failure mode analysis
🔎 Similar Papers
No similar papers found.
A
Anurag Das
MPI for Informatics, Saarland Informatics Campus
Adrian Bulat
Adrian Bulat
Samsung AI Cambridge
Computer VisionDeep LearningMachine LearningArtificial Intelligence
Alberto Baldrati
Alberto Baldrati
PhD student, University of Florence, University of Pisa
MultimediaComputer VisionMachine Learning
I
Ioannis Maniadis Metaxas
Samsung AI, Cambridge
B
B. Schiele
MPI for Informatics, Saarland Informatics Campus
G
Georgios Tzimiropoulos
Samsung AI, Cambridge
B
Brais Martínez
Samsung AI, Cambridge