IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the high computational cost of large vision-language models (LVLMs) in high-resolution image reasoning, where existing visual token pruning methods often degrade performance by neglecting spatial structure. The authors propose a training-free, prompt-aware pruning strategy that reveals, for the first time, how LVLMs implicitly construct a visual coordinate system via Rotary Position Embedding (RoPE). Leveraging this insight, they define Implicit Visual Coordinate (IVC) tokens critical for spatial reasoning and develop a method to identify them using RoPE’s rotation matrix properties, semantic seed discovery, and value vector similarity optimization to effectively select foreground tokens. Evaluated across four prominent LVLMs and twenty benchmarks, the approach achieves approximately 50% visual token compression while preserving at least 99% of original performance—sometimes even improving it.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks. A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs. While visual token pruning has emerged as a promising solution, existing methods that primarily focus on semantic relevance often discard tokens that are crucial for spatial reasoning. We address this gap through a novel insight into \emph{how LVLMs process spatial reasoning}. Specifically, we reveal that LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE), where specific token positions serve as \textbf{implicit visual coordinates} (IVC tokens) that are essential for spatial reasoning. Based on this insight, we propose \textbf{IVC-Prune}, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens. IVC tokens are identified by theoretically analyzing the mathematical properties of RoPE, targeting positions at which its rotation matrices approximate identity matrix or the $90^\circ$ rotation matrix. Foreground tokens are identified through a robust two-stage process: semantic seed discovery followed by contextual refinement via value-vector similarity. Extensive evaluations across four representative LVLMs and twenty diverse benchmarks show that IVC-Prune reduces visual tokens by approximately 50\% while maintaining $\geq$ 99\% of the original performance and even achieving improvements on several benchmarks. Source codes are available at https://github.com/FireRedTeam/IVC-Prune.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

inference cost

visual token pruning

spatial reasoning

high-resolution inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit Visual Coordinates

Rotary Position Embeddings

Vision Token Pruning