OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

📅 2024-12-12

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) rely solely on textual supervision during training, resulting in weak visual perception capabilities and insufficient spatial reasoning for embodied AI. To address this, we propose latent visual representation distillation: for the first time, high-fidelity visual representations from a frozen vision encoder are directly distilled into intermediate transformer layers of the LLM, jointly optimizing visual embedding prediction and language modeling without architectural modifications. Key contributions include: (1) pioneering the application of knowledge distillation to LLM latent layers to enhance visual understanding; (2) empirically demonstrating a strong positive correlation between latent-layer visual representation quality and downstream task performance; and (3) achieving average gains of +2.5% on CV-Bench and +8.7% on Depth estimation—outperforming both single- and multi-encoder baselines across all evaluated benchmarks.

Technology Category

Application Category

📝 Abstract

The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. In this work, we posit an overlooked opportunity to optimize the intermediate LLM representations through a vision perspective (objective), i.e., solely natural language supervision is sub-optimal for the MLLM's visual understanding ability. To that end, we propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations. Firstly, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next text-token prediction. Secondly, we investigate MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Moreover, upon probing our OLA-VLM, we observe improved representation quality owing to the embedding optimization. Thirdly, we demonstrate that our OLA-VLM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM. Particularly, OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench. Our code is open-sourced at https://github.com/SHI-Labs/OLA-VLM .

Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs prioritize language over visual perception signals

Current approaches undermine spatial reasoning for embodied AI tasks

Need to optimize both visual perception and language comprehension

Innovation

Methods, ideas, or system contributions that make the work stand out.

Infuses visual perception knowledge into LLM hidden representations

Optimizes predictive visual embedding and next token prediction

Improves visual representation quality through embedding distillation

🔎 Similar Papers

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders