Semi-Parametric Retrieval via Binary Bag-of-Tokens Index

📅 2024-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
To address challenges in information retrieval—including low indexing efficiency, high computational cost, and delayed index updates—this paper proposes SiDR, a semi-parametric decoupled framework that separates indexing logic from model parameters. SiDR supports two lightweight indexing modes: (i) embedding-based indexing, preserving neural retrieval accuracy, and (ii) binary bag-of-tokens indexing, achieving BM25-level indexing complexity while substantially outperforming BM25 in effectiveness. It further introduces a late parametric re-ranking mechanism enabling millisecond-scale index preparation. Evaluated across 16 benchmark datasets, SiDR consistently surpasses both neural and term-based retrieval baselines. Under embedding indexing, it delivers superior retrieval performance with comparable training overhead; under token indexing, it reduces indexing time by over 90% and significantly outperforms BM25 on all in-domain tasks.

Technology Category

Application Category

📝 Abstract
Information retrieval has transitioned from standalone systems into essential components across broader applications, with indexing efficiency, cost-effectiveness, and freshness becoming increasingly critical yet often overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval (SiDR), a bi-encoder retrieval framework that decouples retrieval index from neural parameters to enable efficient, low-cost, and parameter-agnostic indexing for emerging use cases. Specifically, in addition to using embeddings as indexes like existing neural retrieval methods, SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness. Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an embedding-based index, SiDR exceeds the performance of conventional neural retrievers while maintaining similar training complexity; (ii) When using a tokenization-based index, SiDR drastically reduces indexing cost and time, matching the complexity of traditional term-based retrieval, while consistently outperforming BM25 on all in-domain datasets; (iii) Additionally, we introduce a late parametric mechanism that matches BM25 index preparation time while outperforming other neural retrieval baselines in effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Decouples retrieval index from neural parameters
Achieves BM25-like indexing complexity with better effectiveness
Reduces indexing cost and time significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples retrieval index from neural parameters
Supports non-parametric tokenization index for search
Introduces late parametric mechanism for efficiency
🔎 Similar Papers
No similar papers found.