🤖 AI Summary
Large Vision-Language Models (LVLMs) enforce homogeneous processing of visual and textual embeddings, overlooking their intrinsic heterogeneity—vision inputs are high-dimensional, structured, and context-rich, whereas text inputs are discrete and sequential. Method: We propose Decomposed Attention (D-Attn), the first framework to diagonalize visual self-attention and correct positional bias in text–vision cross-attention, coupled with an α-weighted fusion strategy that harmonizes multimodal representations without fine-tuning the LLM’s weights. Our method comprises decomposed causal self-attention, linear-complexity visual token computation (O(|V|)), bias-free positional encoding, and lightweight information fusion. Contribution/Results: On multi-image understanding benchmarks, D-Attn achieves over 2.1× faster inference while improving accuracy, effectively alleviating the computational bottleneck of visual attention.
📝 Abstract
Large vision-and-language models (LVLMs) typically treat visual and textual embeddings as homogeneous inputs to a large language model (LLM). However, these inputs are inherently different: visual inputs are multi-dimensional and contextually rich, often pre-encoded by models like CLIP, while textual inputs lack this structure. In this paper, we propose Decomposed Attention (D-Attn), a novel method that processes visual and textual embeddings differently by decomposing the 1-D causal self-attention in LVLMs. After the attention decomposition, D-Attn diagonalizes visual-to-visual self-attention, reducing computation from $mathcal{O}(|V|^2)$ to $mathcal{O}(|V|)$ for $|V|$ visual embeddings without compromising performance. Moreover, D-Attn debiases positional encodings in textual-to-visual cross-attention, further enhancing visual understanding. Finally, we introduce an $alpha$-weighting strategy to merge visual and textual information, maximally preserving the pre-trained LLM's capabilities with minimal modifications. Extensive experiments and rigorous analyses validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs. Code, data, and models will be publicly available.