IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of large vision-language models (LVLMs) in high-resolution image reasoning, where existing visual token pruning methods often degrade performance by neglecting spatial structure. The authors propose a training-free, prompt-aware pruning strategy that reveals, for the first time, how LVLMs implicitly construct a visual coordinate system via Rotary Position Embedding (RoPE). Leveraging this insight, they define Implicit Visual Coordinate (IVC) tokens critical for spatial reasoning and develop a method to identify them using RoPE’s rotation matrix properties, semantic seed discovery, and value vector similarity optimization to effectively select foreground tokens. Evaluated across four prominent LVLMs and twenty benchmarks, the approach achieves approximately 50% visual token compression while preserving at least 99% of original performance—sometimes even improving it.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks. A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs. While visual token pruning has emerged as a promising solution, existing methods that primarily focus on semantic relevance often discard tokens that are crucial for spatial reasoning. We address this gap through a novel insight into \emph{how LVLMs process spatial reasoning}. Specifically, we reveal that LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE), where specific token positions serve as \textbf{implicit visual coordinates} (IVC tokens) that are essential for spatial reasoning. Based on this insight, we propose \textbf{IVC-Prune}, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens. IVC tokens are identified by theoretically analyzing the mathematical properties of RoPE, targeting positions at which its rotation matrices approximate identity matrix or the $90^\circ$ rotation matrix. Foreground tokens are identified through a robust two-stage process: semantic seed discovery followed by contextual refinement via value-vector similarity. Extensive evaluations across four representative LVLMs and twenty diverse benchmarks show that IVC-Prune reduces visual tokens by approximately 50\% while maintaining $\geq$ 99\% of the original performance and even achieving improvements on several benchmarks. Source codes are available at https://github.com/FireRedTeam/IVC-Prune.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
inference cost
visual token pruning
spatial reasoning
high-resolution inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit Visual Coordinates
Rotary Position Embeddings
Vision Token Pruning
Spatial Reasoning
Training-Free Pruning
🔎 Similar Papers
No similar papers found.