Visual Representation Alignment for Multimodal Large Language Models

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit limited performance on vision-intensive tasks—such as object counting and spatial reasoning—primarily due to coarse-grained textual supervision, which fails to provide fine-grained guidance for the visual pathway and leads to loss of critical visual details. To address this, we propose VIRAL, the first explicit visual representation alignment strategy that aligns MLLM internal representations with those of pretrained vision foundation models. VIRAL enhances fine-grained visual information retention via representation-space regularization and integrates external visual knowledge through a text–vision dual-path architecture optimized for multimodal fusion. Evaluated on major multimodal benchmarks, VIRAL achieves consistent performance gains across diverse vision-intensive tasks. Ablation studies confirm both the efficacy of the alignment mechanism and the soundness of the overall framework design. Our core contribution lies in directly injecting the representational capacity of vision foundation models into MLLM training—thereby bridging the gap between language-centric supervision and vision-intensive reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Addresses vision-centric task limitations in MLLMs

Aligns visual representations with foundation models

Retains fine-grained visual details for reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns visual representations with pre-trained models

Retains critical visual details from input encoder

Complements additional visual knowledge for reasoning

🔎 Similar Papers

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs