🤖 AI Summary
This work addresses the challenge of large index sizes in multi-vector retrieval models, which stem from their long embedding sequences and hinder practical deployment. The study presents the first systematic evaluation of training-free token compression strategies that directly reduce the sequence dimensionality of multi-vector embeddings to lower memory overhead and query latency. By comparing token merging against token pruning, the authors demonstrate that merging achieves a superior trade-off: it substantially shrinks index size while more effectively preserving retrieval performance. These findings establish token merging as a practical and effective solution for enabling efficient multi-vector retrieval without compromising accuracy.
📝 Abstract
While multi-vector retrieval models outperform single-vector models of comparable size in retrieval quality, their practicality is limited by substantially larger index sizes, driven by the additional sequence-length dimension in their document embeddings. Because document embedding size dictates both memory overhead and query latency, compression is essential for deployment. In this work, we present an evaluation of training-free methods targeting the token sequence length, a dimension unique to multi-vector retrieval. Our findings suggest that token merging is strictly superior to token pruning for reducing index size while maintaining retrieval effectiveness.