Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the visual fading problem in multimodal large language models under long-context scenarios, where increased distances between visual and textual tokens lead to attenuated cross-modal attention. To mitigate this, the authors propose DIPE, a decoupled positional encoding mechanism that, for the first time, separates intra-modal and inter-modal positional encodings. Intra-modal encoding preserves relative positions to maintain local structural integrity, while inter-modal encoding introduces anchor-aware proximity to eliminate distance-based penalties. Built upon a multimodal adaptation of RoPE, DIPE enables distance-invariant cross-modal positional modeling. This approach significantly alleviates visual fading in long contexts, sustaining robust visual grounding capabilities without compromising performance on standard short-context benchmarks.

Technology Category

Application Category

📝 Abstract

Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.

Problem

Research questions and friction points this paper is trying to address.

visual fading

multimodal large language models

long-context

inter-modal attention

position encoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distance Invariant Position Encoding

Multimodal Large Language Models

Visual Fading