🤖 AI Summary
Mainstream multimodal large language models (MLLMs) rely on visual projectors to align visual and textual modalities; however, existing approaches treat visual embeddings merely as contextual cues and apply only autoregressive supervision on the text side, neglecting direct optimization of the visual embeddings themselves—thereby limiting semantic alignment accuracy.
Method: We propose an intrinsic visual embedding supervision paradigm: for the first time, we leverage refined visual embeddings from shallow layers of the LLM to guide visual projector training via backpropagation, establishing dual alignment objectives—semantic directional alignment (optimized via angular distance) and logits distribution matching (minimized via KL divergence).
Contribution/Results: Our method requires no additional annotations or auxiliary models. It achieves significant performance gains across multiple mainstream multimodal benchmarks, empirically validating that explicit supervision of visual embeddings substantially enhances cross-modal semantic consistency.
📝 Abstract
Mainstream Multimodal Large Language Models (MLLMs) achieve visual understanding by using a vision projector to bridge well-pretrained vision encoders and large language models (LLMs). The inherent gap between visual and textual modalities makes the embeddings from the vision projector critical for visual comprehension. However, current alignment approaches treat visual embeddings as contextual cues and merely apply auto-regressive supervision to textual outputs, neglecting the necessity of introducing equivalent direct visual supervision, which hinders the potential finer alignment of visual embeddings. In this paper, based on our analysis of the refinement process of visual embeddings in the LLM's shallow layers, we propose BASIC, a method that utilizes refined visual embeddings within the LLM as supervision to directly guide the projector in generating initial visual embeddings. Specifically, the guidance is conducted from two perspectives: (i) optimizing embedding directions by reducing angles between initial and supervisory embeddings in semantic space; (ii) improving semantic matching by minimizing disparities between the logit distributions of both visual embeddings. Without additional supervisory models or artificial annotations, BASIC significantly improves the performance of MLLMs across a wide range of benchmarks, demonstrating the effectiveness of our introduced direct visual supervision.