๐ค AI Summary
To address the challenge of efficiently serving large-scale late-interaction retrieval models (e.g., ColBERT) under tight budget and limited GPU memory constraints, this paper proposes a lightweight serving framework tailored for high-concurrency deployment. Our method introduces two key innovations: (1) a memory-mapped indexing mechanism that enables on-demand loading of ultra-large vector indices, eliminating the need for full in-memory residency; and (2) a multi-stage hybrid scoring architecture integrating coarse filtering, fine-grained re-ranking, and late-interaction modeling to jointly optimize efficiency and accuracy. Experiments demonstrate that our approach reduces index memory footprint by 90%, substantially lowers query latency, and supports over one thousand concurrent requests on commodity CPU serversโmarking the first practical, cost-effective, and high-throughput industrial deployment of late-interaction models.
๐ Abstract
We study serving retrieval models, specifically late interaction models like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a novel serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT's query latency and supporting many concurrent queries in parallel.