Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Vision-language models (VLMs) excel on single-image tasks but suffer significant performance degradation in multi-image reasoning due to difficulties in extracting salient information from complex cross-image visual features. To address this, we propose the Focus-Centric Vision Chain (FCVC), a novel paradigm that introduces a bottom-up, scalable focus-synthesis methodology to construct VISC-150K—the first large-scale, human-annotated multi-image reasoning dataset. FCVC incorporates a cross-image attention focusing mechanism and an end-to-end fine-tuning framework to enhance model perception, comprehension, and logical reasoning over multi-image inputs. Evaluated across seven mainstream multi-image benchmarks, FCVC yields average accuracy improvements of +3.16% on Qwen-VL and +2.24% on LLaVA-OneVision, without compromising general single-image capabilities. This work establishes a foundational framework for scalable multi-image reasoning and advances the state of the art in vision-language understanding beyond isolated image processing.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) achieve remarkable success in single-image tasks. However, real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline as models struggle to disentangle critical information scattered across complex visual features. In this work, we propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios. To facilitate this paradigm, we propose Focus-Centric Data Synthesis, a scalable bottom-up approach for synthesizing high-quality data with elaborate reasoning paths. Through this approach, We construct VISC-150K, a large-scale dataset with reasoning data in the form of Focus-Centric Visual Chain, specifically designed for multi-image tasks. Experimental results on seven multi-image benchmarks demonstrate that our method achieves average performance gains of 3.16% and 2.24% across two distinct model architectures, without compromising the general vision-language capabilities. our study represents a significant step toward more robust and capable vision-language systems that can handle complex visual scenarios.

Problem

Research questions and friction points this paper is trying to address.

Improving VLMs for multi-image context understanding

Enhancing perception in complex visual scenarios

Addressing performance decline in multi-image tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focus-Centric Visual Chain enhances multi-image perception

Focus-Centric Data Synthesis creates scalable reasoning data

VISC-150K dataset supports multi-image reasoning tasks

🔎 Similar Papers

No similar papers found.