Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) enforce homogeneous processing of visual and textual embeddings, overlooking their intrinsic heterogeneity—vision inputs are high-dimensional, structured, and context-rich, whereas text inputs are discrete and sequential. Method: We propose Decomposed Attention (D-Attn), the first framework to diagonalize visual self-attention and correct positional bias in text–vision cross-attention, coupled with an α-weighted fusion strategy that harmonizes multimodal representations without fine-tuning the LLM’s weights. Our method comprises decomposed causal self-attention, linear-complexity visual token computation (O(|V|)), bias-free positional encoding, and lightweight information fusion. Contribution/Results: On multi-image understanding benchmarks, D-Attn achieves over 2.1× faster inference while improving accuracy, effectively alleviating the computational bottleneck of visual attention.

Technology Category

Application Category

📝 Abstract

Large vision-and-language models (LVLMs) typically treat visual and textual embeddings as homogeneous inputs to a large language model (LLM). However, these inputs are inherently different: visual inputs are multi-dimensional and contextually rich, often pre-encoded by models like CLIP, while textual inputs lack this structure. In this paper, we propose Decomposed Attention (D-Attn), a novel method that processes visual and textual embeddings differently by decomposing the 1-D causal self-attention in LVLMs. After the attention decomposition, D-Attn diagonalizes visual-to-visual self-attention, reducing computation from $mathcal{O}(|V|^2)$ to $mathcal{O}(|V|)$ for $|V|$ visual embeddings without compromising performance. Moreover, D-Attn debiases positional encodings in textual-to-visual cross-attention, further enhancing visual understanding. Finally, we introduce an $alpha$-weighting strategy to merge visual and textual information, maximally preserving the pre-trained LLM's capabilities with minimal modifications. Extensive experiments and rigorous analyses validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs. Code, data, and models will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Differentiates visual and textual embeddings processing

Reduces computational complexity in vision-language models

Enhances visual understanding by debiasing positional encodings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed Attention reduces computation

D-Attn debiases positional encodings

α-weighting merges visual and textual

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment