π€ AI Summary
This work addresses the high storage and computational costs of vision-language models in fine-grained document retrieval, which stem from excessively large index vectors. While existing training-free pruning methods suffer significant performance degradation under high compression ratios, we propose Structure Anchor Pruning (SAP)βa training-free approach that achieves efficient compression by identifying critical image patches in intermediate layers of the visual encoder. We further introduce the Oracle Score Retention (OSR) protocol to evaluate the contribution of each layerβs information to retrieval performance. Our analysis reveals that semantic structure anchors are consistently present in intermediate layers rather than the final layer, challenging the conventional paradigm of focusing pruning exclusively on the last layer. On the ViDoRe benchmark, SAP maintains high retrieval fidelity even at compression rates exceeding 90%, offering a highly scalable solution for vision-augmented RAG systems.
π Abstract
Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive index vector size overheads. Training-free pruning solutions (e.g., EOS-attention based methods) can reduce index vector size by approximately 60% without model adaptation, but often underperform random selection in high-compression scenarios (>80%). Prior research (e.g., Light-ColPali) attributes this to the conclusion that visual token importance is inherently query-dependent, thereby questioning the feasibility of training-free pruning. In this work, we propose Structural Anchor Pruning (SAP), a training-free pruning method that identifies key visual patches from middle layers to achieve high performance compression. We also introduce Oracle Score Retention (OSR) protocol to evaluate how layer-wise information affects compression efficiency. Evaluations on the ViDoRe benchmark demonstrate that SAP reduces index vectors by over 90% while maintaining robust retrieval fidelity, providing a highly scalable solution for Visual RAG. Furthermore, our OSR-based analysis reveals that semantic structural anchor patches persist in the middle layers, unlike traditional pruning solutions that focus on the final layer where structural signals dissipate.