Multi-Vector Index Compression in Any Modality

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

This work addresses the scalability challenge in multimodal late-interaction retrieval, where document representations as multiple vectors grow linearly with input length, hindering application to rich media such as images and videos. To mitigate storage and computational overhead under a fixed vector budget, the authors propose a query-agnostic compression approach. Its core innovation is Attention-Guided Clustering (AGC), which leverages attention mechanisms to identify semantically salient regions as cluster centers and aggregates them with learned weights, balancing compression flexibility and retrieval effectiveness. Integrated with sequence resampling, memory tokens, and hierarchical pooling, the method achieves superior retrieval performance over existing parameterized compression strategies across multiple benchmarks—including BEIR, ViDoRe, MSR-VTT, and MultiVENT 2.0—yielding more compact indexes while matching or even surpassing the performance of uncompressed models.

Technology Category

Application Category

📝 Abstract

We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: github.com/hanxiangqin/omni-col-press.

Problem

Research questions and friction points this paper is trying to address.

multi-vector retrieval

late interaction

index compression

modality

vector budget

Innovation

Methods, ideas, or system contributions that make the work stand out.

late interaction

multi-vector retrieval

index compression