🤖 AI Summary
To address high latency and low efficiency in multi-vector retrieval (e.g., ColBERT, XTR) for complex, multi-faceted queries, this paper proposes WARP_SELECT—a system-level, end-to-end accelerated retrieval engine. Our approach introduces three key innovations: (1) a dynamic similarity estimation mechanism that avoids full vector reconstruction; (2) implicit decompression coupled with a two-stage scoring strategy to significantly reduce computational redundancy; and (3) an XTR-compatible architecture leveraging optimized C++ kernels, a dedicated inference runtime, and approximate computation. Experiments demonstrate that WARP_SELECT achieves a 41× end-to-end latency reduction over the XTR reference implementation and is 3× faster than ColBERTv2-PLAID, while preserving retrieval quality—e.g., maintaining identical Recall@100. This work delivers a holistic solution for multi-vector retrieval that simultaneously advances speed, accuracy, and framework compatibility.
📝 Abstract
We study the efficiency of multi-vector retrieval methods like ColBERT and its recent variant XTR. We introduce WARP, a retrieval engine that drastically improves the efficiency of XTR-based ColBERT retrievers through three key innovations: (1) WARP$_ ext{SELECT}$ for dynamic similarity imputation, (2) implicit decompression to bypass costly vector reconstruction, and (3) a two-stage reduction process for efficient scoring. Combined with optimized C++ kernels and specialized inference runtimes, WARP reduces end-to-end latency by 41x compared to XTR's reference implementation and thereby achieves a 3x speedup over PLAID from the the official ColBERT implementation. We study the efficiency of multi-vector retrieval methods like ColBERT and its recent variant XTR. We introduce WARP, a retrieval engine that drastically improves the efficiency of XTR-based ColBERT retrievers through three key innovations: (1) WARP$_ ext{SELECT}$ for dynamic similarity imputation, (2) implicit decompression during retrieval, and (3) a two-stage reduction process for efficient scoring. Thanks also to highly-optimized C++ kernels and to the adoption of specialized inference runtimes, WARP can reduce end-to-end query latency relative to XTR's reference implementation by 41x. And it thereby achieves a 3x speedup over the official ColBERTv2 PLAID engine, while preserving retrieval quality.