Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limited generalization capability of LLaVA-NeXT-interleave in multi-image understanding tasks. We propose Dense Channel Integration (DCI), a lightweight, plug-and-play cross-modal feature fusion connector that enhances semantic coherence and structured variation modeling via dense channel-wise interaction—without modifying the backbone architecture or requiring task-specific retraining. Comprehensive evaluation across 22 benchmark datasets demonstrates that the baseline model already achieves state-of-the-art (SOTA) performance on vision-intensive tasks such as VISION, NLVR2, and Fashion200K. With DCI integration, significant improvements are observed on MIT-States_PropertyCoherence and SlideVQA, marking the first empirical validation of plug-in architectures for multi-image reasoning, document understanding, and interactive multimodal communication. Our approach establishes an efficient, scalable adaptation paradigm for multi-image large language–vision models, enabling rapid deployment across diverse multimodal downstream tasks while preserving architectural modularity and computational efficiency.

Technology Category

Application Category

📝 Abstract

This paper addresses two main objectives. Firstly, we demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks: Multi-Image Reasoning, Documents and Knowledge-Based Understanding and Interactive Multi-Modal Communication. Secondly, we add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model. We find that the standard model achieves the highest overall accuracy, excelling in vision-heavy tasks like VISION, NLVR2, and Fashion200K. Meanwhile, the DCI-enhanced version shows particular strength on datasets requiring deeper semantic coherence or structured change understanding such as MIT-States_PropertyCoherence and SlideVQA. Our results highlight the potential of combining powerful foundation models with plug-and-play techniques for Interleave tasks. The code is available at https://github.com/dinhvietcuong1996/icme25-inova.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLaVA-NeXT-interleave performance on 22 multi-task datasets

Compares DCI-enhanced model vs standard in vision/semantic tasks

Explores plug-and-play techniques for interleaved multi-image models

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLaVA-NeXT-interleave excels in multi-image tasks

DCI connector enhances semantic coherence understanding

Plug-and-play technique boosts interleave model performance

🔎 Similar Papers

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation