ColBERT-Serve: Efficient Multi-stage Memory-Mapped Scoring

๐Ÿ“… 2025-04-21
๐Ÿ›๏ธ European Conference on Information Retrieval
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenge of efficiently serving large-scale late-interaction retrieval models (e.g., ColBERT) under tight budget and limited GPU memory constraints, this paper proposes a lightweight serving framework tailored for high-concurrency deployment. Our method introduces two key innovations: (1) a memory-mapped indexing mechanism that enables on-demand loading of ultra-large vector indices, eliminating the need for full in-memory residency; and (2) a multi-stage hybrid scoring architecture integrating coarse filtering, fine-grained re-ranking, and late-interaction modeling to jointly optimize efficiency and accuracy. Experiments demonstrate that our approach reduces index memory footprint by 90%, substantially lowers query latency, and supports over one thousand concurrent requests on commodity CPU serversโ€”marking the first practical, cost-effective, and high-throughput industrial deployment of late-interaction models.

Technology Category

Application Category

๐Ÿ“ Abstract
We study serving retrieval models, specifically late interaction models like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a novel serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT's query latency and supporting many concurrent queries in parallel.
Problem

Research questions and friction points this paper is trying to address.

Efficient serving of retrieval models under memory constraints
Reducing RAM usage for ColBERT index deployment
Improving query latency and concurrency in retrieval systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-mapping strategy reduces RAM usage
Multi-stage architecture with hybrid scoring
Supports many concurrent queries efficiently
๐Ÿ”Ž Similar Papers
No similar papers found.