🤖 AI Summary
This work identifies and formally characterizes a previously overlooked issue in multimodal large language models: the degradation of visual representations in intermediate layers due to over-optimization toward text generation objectives, which compromises both global semantics and local structural fidelity. To address this, the authors propose a prediction regularization method that introduces feature reconstruction loss at intermediate layers, compelling the model to preserve initial visual features and thereby maintain robust internal visual capabilities. Notably, this approach requires no architectural modifications and seamlessly integrates into standard pretraining and fine-tuning pipelines. Extensive experiments demonstrate significant performance gains across diverse vision-language tasks, underscoring the critical role of well-preserved internal visual representations in effective multimodal understanding.
📝 Abstract
While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual features, thereby maintaining the inherent visual attributes of the MLLM's internal representations. Extensive experiments confirm that mitigating this visual degradation effectively boosts vision-language performance, underscoring the critical importance of fostering robust internal visual representations within MLLMs for comprehensive multimodal understanding.