Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from modality imbalance: low utilization of visual representations in deep layers degrades visual understanding and exacerbates hallucination. To address this, we propose LaVer, a novel training framework that— for the first time—introduces masked image modeling (MIM) and latent-variable-driven visual reconstruction directly within the joint latent semantic space of the LLM. This enables explicit, end-to-end visual supervision injected into the LLM’s hidden space. LaVer jointly optimizes vision–language representations through cross-modal latent alignment and decoder-side visual reconstruction loss. Evaluated on multiple dense vision-understanding benchmarks, LaVer significantly outperforms state-of-the-art methods, effectively mitigating hallucination while enhancing visual discriminability, perceptual accuracy, and model robustness.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks. Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations. This issue stems from the predominant reliance on next-text-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers. To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM. Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information. Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities. Code of LaVer is available at https://github.com/Fir-lat/LaVer.

Problem

Research questions and friction points this paper is trying to address.

Addresses modality imbalance in MLLMs

Enhances visual representation via masked modeling

Improves dense visual task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Visual Reconstruction training framework

Masked image modeling in joint latent space

Direct visual activation for enhanced attention

🔎 Similar Papers

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs