ColBERT-Serve: Efficient Multi-stage Memory-Mapped Scoring

📅 2025-04-21

🏛️ European Conference on Information Retrieval

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the challenge of efficiently serving large-scale late-interaction retrieval models (e.g., ColBERT) under tight budget and limited GPU memory constraints, this paper proposes a lightweight serving framework tailored for high-concurrency deployment. Our method introduces two key innovations: (1) a memory-mapped indexing mechanism that enables on-demand loading of ultra-large vector indices, eliminating the need for full in-memory residency; and (2) a multi-stage hybrid scoring architecture integrating coarse filtering, fine-grained re-ranking, and late-interaction modeling to jointly optimize efficiency and accuracy. Experiments demonstrate that our approach reduces index memory footprint by 90%, substantially lowers query latency, and supports over one thousand concurrent requests on commodity CPU servers—marking the first practical, cost-effective, and high-throughput industrial deployment of late-interaction models.

Technology Category

Application Category

📝 Abstract

We study serving retrieval models, specifically late interaction models like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a novel serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT's query latency and supporting many concurrent queries in parallel.

Problem

Research questions and friction points this paper is trying to address.

Efficient serving of retrieval models under memory constraints

Reducing RAM usage for ColBERT index deployment

Improving query latency and concurrency in retrieval systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-mapping strategy reduces RAM usage

Multi-stage architecture with hybrid scoring

Supports many concurrent queries efficiently

🔎 Similar Papers

No similar papers found.