Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

In LVLMs, the cross-modal variant of RoPE erroneously enforces spurious relative positional dependencies between text and image tokens, causing inconsistent positional biases across image tokens due to their distinct spatial locations—thereby undermining modality alignment. To address this, we propose Circle-RoPE: a geometrically decoupled RoPE variant that maps image token indices onto a circular trajectory orthogonal to the linear text path, forming a conical positional encoding structure wherein all image tokens are equidistant from each text token—effectively disentangling cross-modal positional dependencies. Our contributions include: (1) a per-token distance metric for quantifying positional consistency; (2) the first geometry-aware, decoupled RoPE variant; and (3) a hierarchical interleaved RoPE application strategy. Circle-RoPE preserves full image spatial fidelity while substantially suppressing positional bias, yielding consistent performance gains across multiple LVLM benchmarks. Code is publicly available.

Technology Category

Application Category

📝 Abstract

Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model's overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at [https://github.com/lose4578/CircleRoPE](https://github.com/lose4578/CircleRoPE).

Problem

Research questions and friction points this paper is trying to address.

Reduces cross-modal positional biases in vision-language models

Ensures equal text-image token distances to avoid spurious alignments

Preserves intra-image spatial information while minimizing artificial biases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Per-Token Distance metric for cross-modal independence

Circle-RoPE with orthogonal circular image token mapping

Staggered layer strategy combining RoPE variants

🔎 Similar Papers

LieRE: Generalizing Rotary Position Encodings