Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the tendency of multimodal large language models to diminish visual dependency during long-form text generation, which often leads to semantic inconsistency and hallucination. To mitigate this issue, the authors propose Vision Inference Former (VIF), a lightweight, end-to-end trainable module that continuously injects original visual representations into the decoding stage, thereby establishing a direct pathway from visual inputs to the output space. VIF overcomes the limitation of conventional connectors that treat visual and textual tokens equally, instead dynamically reinforcing visual guidance without significantly increasing computational overhead. The method demonstrates consistent performance gains across 14 diverse benchmarks spanning general reasoning, OCR, table understanding, and hallucination evaluation, underscoring its effectiveness and broad compatibility with mainstream architectures.

📝 Abstract

In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model's output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong-Xinpeng/VIF.

Problem

Research questions and friction points this paper is trying to address.

multimodal large language models

visual consistency

vision-language alignment

visual semantics

hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Inference Former

multimodal large language models

visual consistency