🤖 AI Summary
This work addresses the limited generalization capability of LLaVA-NeXT-interleave in multi-image understanding tasks. We propose Dense Channel Integration (DCI), a lightweight, plug-and-play cross-modal feature fusion connector that enhances semantic coherence and structured variation modeling via dense channel-wise interaction—without modifying the backbone architecture or requiring task-specific retraining. Comprehensive evaluation across 22 benchmark datasets demonstrates that the baseline model already achieves state-of-the-art (SOTA) performance on vision-intensive tasks such as VISION, NLVR2, and Fashion200K. With DCI integration, significant improvements are observed on MIT-States_PropertyCoherence and SlideVQA, marking the first empirical validation of plug-in architectures for multi-image reasoning, document understanding, and interactive multimodal communication. Our approach establishes an efficient, scalable adaptation paradigm for multi-image large language–vision models, enabling rapid deployment across diverse multimodal downstream tasks while preserving architectural modularity and computational efficiency.
📝 Abstract
This paper addresses two main objectives. Firstly, we demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks: Multi-Image Reasoning, Documents and Knowledge-Based Understanding and Interactive Multi-Modal Communication. Secondly, we add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model. We find that the standard model achieves the highest overall accuracy, excelling in vision-heavy tasks like VISION, NLVR2, and Fashion200K. Meanwhile, the DCI-enhanced version shows particular strength on datasets requiring deeper semantic coherence or structured change understanding such as MIT-States_PropertyCoherence and SlideVQA. Our results highlight the potential of combining powerful foundation models with plug-and-play techniques for Interleave tasks. The code is available at https://github.com/dinhvietcuong1996/icme25-inova.