🤖 AI Summary
Large vision-language models (VLMs) under cascade architectures suffer from insufficient vision–language alignment, leading to weak image-text matching discrimination and frequent object hallucination. To address this, we propose AGE-VLM—a novel framework that, for the first time, integrates fine-grained spatial priors extracted by Segment Anything Model (SAM) into a lightweight VLM. It employs an interleaved cross-attention mechanism to enhance visual grounding, coupled with visual feature distillation and attention-guided alignment. The method significantly improves small-scale models’ focus on salient image regions, outperforming or matching state-of-the-art efficient VLMs on multiple vision-dominant benchmarks. Hallucination rates decrease markedly, and image-text matching accuracy improves substantially. Our core contribution lies in the controllable injection of external spatial perception knowledge into the multimodal attention process, establishing a new paradigm for robust, efficient VLM alignment.
📝 Abstract
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs) to integrate visual and textual information. This paper presents a comprehensive analysis of attention patterns in efficient VLMs, revealing that concatenation-based architectures frequently fail to distinguish between semantically matching and non-matching image-text pairs. This is a key factor for object hallucination in these models. To address this, we introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers to instill vision capabilities in pretrained small language models. This enforces in VLM the ability "look" at the correct image regions by leveraging spatial knowledge distilled from the Segment Anything Model (SAM), significantly reducing hallucination. We validate our approach across different vision-centric benchmarks where our method is better or comparable to prior work on efficient VLMs. Our findings provide valuable insights for future research aimed at achieving enhanced visual and linguistic understanding in VLMs.