Look in the Middle: Structural Anchor Pruning for Scalable Visual RAG Indexing

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the high storage and computational costs of vision-language models in fine-grained document retrieval, which stem from excessively large index vectors. While existing training-free pruning methods suffer significant performance degradation under high compression ratios, we propose Structure Anchor Pruning (SAP)—a training-free approach that achieves efficient compression by identifying critical image patches in intermediate layers of the visual encoder. We further introduce the Oracle Score Retention (OSR) protocol to evaluate the contribution of each layer’s information to retrieval performance. Our analysis reveals that semantic structure anchors are consistently present in intermediate layers rather than the final layer, challenging the conventional paradigm of focusing pruning exclusively on the last layer. On the ViDoRe benchmark, SAP maintains high retrieval fidelity even at compression rates exceeding 90%, offering a highly scalable solution for vision-augmented RAG systems.

Technology Category

Application Category

📝 Abstract

Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive index vector size overheads. Training-free pruning solutions (e.g., EOS-attention based methods) can reduce index vector size by approximately 60% without model adaptation, but often underperform random selection in high-compression scenarios (>80%). Prior research (e.g., Light-ColPali) attributes this to the conclusion that visual token importance is inherently query-dependent, thereby questioning the feasibility of training-free pruning. In this work, we propose Structural Anchor Pruning (SAP), a training-free pruning method that identifies key visual patches from middle layers to achieve high performance compression. We also introduce Oracle Score Retention (OSR) protocol to evaluate how layer-wise information affects compression efficiency. Evaluations on the ViDoRe benchmark demonstrate that SAP reduces index vectors by over 90% while maintaining robust retrieval fidelity, providing a highly scalable solution for Visual RAG. Furthermore, our OSR-based analysis reveals that semantic structural anchor patches persist in the middle layers, unlike traditional pruning solutions that focus on the final layer where structural signals dissipate.

Problem

Research questions and friction points this paper is trying to address.

Visual Document Retrieval

Index Compression

Training-free Pruning

Vision-Language Models

Scalable Visual RAG

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structural Anchor Pruning

Visual RAG

training-free pruning