🤖 AI Summary
This work addresses the high indexing overhead of late-interaction retrieval models such as ColBERT, which store dense embeddings for every document token. To mitigate this issue, the authors propose a token pruning method grounded in geometric estimation via Voronoi cells. For the first time, Voronoi regions in the high-dimensional embedding space are leveraged to quantify token importance by measuring each token’s influence region, thereby establishing a formal and interpretable pruning criterion. Experimental results across multiple retrieval benchmarks demonstrate that the proposed strategy substantially reduces index size while maintaining or even improving retrieval effectiveness. Furthermore, the approach provides an interpretable analytical tool for understanding token-level contributions in late-interaction architectures.
📝 Abstract
Late-interaction models like ColBERT offer a competitive performance across various retrieval tasks, but require storing a dense embedding for each document token, leading to a substantial index storage overhead. Past works address this by attempting to prune low-importance token embeddings based on statistical and empirical measures, but they often either lack formal grounding or are ineffective. To address these shortcomings, we introduce a framework grounded in hyperspace geometry and cast token pruning as a Voronoi cell estimation problem in the embedding space. By interpreting each token's influence as a measure of its Voronoi region, our approach enables principled pruning that retains retrieval quality while reducing index size. Through our experiments, we demonstrate that this approach serves not only as a competitive pruning strategy but also as a valuable tool for improving and interpreting token-level behavior within dense retrieval systems.