LangBridge: Interpreting Image as a Combination of Language Embeddings

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in large vision-language models (LVLMs): the opacity of vision-language alignment mechanisms and the poor cross-large-language-model (LLM) reusability of adapters. We propose LangBridge, an interpretable alignment paradigm grounded in the vocabulary embedding subspace. Unlike conventional shallow MLP adapters—which require two-stage training and are tightly coupled to specific LLMs—LangBridge explicitly maps visual features into linear combinations of LLM vocabulary embeddings, enabling plug-and-play, zero-shot transfer across heterogeneous LLMs without pretraining. Experiments demonstrate that an adapter trained solely on Qwen2-0.5B achieves near-lossless zero-shot transfer to diverse LLMs—including LLaMA3-8B and Qwen2.5-14B—without performance degradation. LangBridge thus significantly enhances both the interpretability and generalizability of vision-language alignment, offering a model-agnostic, parameter-efficient solution for cross-LLM visual grounding.

Technology Category

Application Category

📝 Abstract
Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance. Overall, LangBridge enables interpretable vision-language alignment by grounding visual representations in LLM vocab embedding, while its plug-and-play design ensures efficient reuse across multiple LLMs with nearly no performance degradation. See our project page at https://LangBridge.github.io/
Problem

Research questions and friction points this paper is trying to address.

Understanding MLP's role in visual-language alignment
Eliminating retraining needs when switching LLM backbones
Achieving interpretable and transferable vision-language alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Projects visual embeddings into text subspaces
Maps visual tokens to LLM vocabulary combinations
Enables pretraining-free transfer across LLMs