Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper systematically investigates the scalability of sparse text embeddings—based on SPLADE—in ultra-large-scale retrieval (MSMARCO-v2, 138 million documents). Addressing the efficiency-effectiveness trade-off of existing approximate search methods (e.g., Seismic hashing and graph-based indexing) for billion-scale sparse vectors, it provides the first empirical analysis of index construction cost, query latency, and recall. The contributions are threefold: (1) identifying critical bottlenecks in real-world industrial-scale sparse retrieval—namely, memory explosion in graph indexes and progressive accuracy degradation in Seismic hashing; (2) proposing an approximate nearest neighbor (ANN) optimization pathway specifically tailored to the structural properties of sparse embeddings; and (3) establishing the first end-to-end evaluation benchmark for billion-scale sparse embeddings, delivering empirically grounded insights and design principles for efficient, production-ready sparse retrieval systems.

Technology Category

Application Category

📝 Abstract
Learned sparse text embeddings have gained popularity due to their effectiveness in top-k retrieval and inherent interpretability. Their distributional idiosyncrasies, however, have long hindered their use in real-world retrieval systems. That changed with the recent development of approximate algorithms that leverage the distributional properties of sparse embeddings to speed up retrieval. Nonetheless, in much of the existing literature, evaluation has been limited to datasets with only a few million documents such as MSMARCO. It remains unclear how these systems behave on much larger datasets and what challenges lurk in larger scales. To bridge that gap, we investigate the behavior of state-of-the-art retrieval algorithms on massive datasets. We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval. We extensively evaluate Splade embeddings of 138M passages from MsMarco-v2 and report indexing time and other efficiency and effectiveness metrics.
Problem

Research questions and friction points this paper is trying to address.

Sparse Text Embeddings
Efficiency and Effectiveness
Seismic Method and Graph Approach
Innovation

Methods, ideas, or system contributions that make the work stand out.

Seismic Method
Sparse Embeddings
Large-Scale Retrieval
🔎 Similar Papers
No similar papers found.