🤖 AI Summary
This work investigates how scaling large language models (LLMs) impacts the performance and scalability—under fixed computational budgets—of sparse versus dense retrieval. Using Llama-3 decoder variants (1B, 3B, 8B), we systematically evaluate contrastive learning (CL) and knowledge distillation (KD) fine-tuning objectives across MSMARCO, TREC Deep Learning, and BEIR benchmarks. We make three key findings: (1) sparse retrieval consistently outperforms dense retrieval and exhibits greater robustness to noisy supervision; (2) CL enables strong scaling behavior with model size, whereas KD yields negligible scaling gains; and (3) a novel CL+KD joint training strategy achieves state-of-the-art results across all benchmarks on the 8B sparse model. These results uncover fundamental scaling laws for LLM-based retrieval, revealing that architectural sparsity combined with CL-driven optimization is critical for efficient, high-performance retrieval. The study establishes a new paradigm for designing scalable, compute-efficient neural retrievers.
📝 Abstract
Scaling large language models (LLMs) has shown great potential for improving retrieval model performance; however, previous studies have mainly focused on dense retrieval trained with contrastive loss (CL), neglecting the scaling behavior of other retrieval paradigms and optimization techniques, such as sparse retrieval and knowledge distillation (KD). In this work, we conduct a systematic comparative study on how different retrieval paradigms (sparse vs. dense) and fine-tuning objectives (CL vs. KD vs. their combination) affect retrieval performance across different model scales. Using MSMARCO passages as the training dataset, decoder-only LLMs (Llama-3 series: 1B, 3B, 8B), and a fixed compute budget, we evaluate various training configurations on both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks. Our key findings reveal that: (1) Scaling behaviors emerge clearly only with CL, where larger models achieve significant performance gains, whereas KD-trained models show minimal improvement, performing similarly across the 1B, 3B, and 8B scales. (2) Sparse retrieval models consistently outperform dense retrieval across both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks, and they demonstrate greater robustness to imperfect supervised signals. (3) We successfully scale sparse retrieval models with the combination of CL and KD losses at 8B scale, achieving state-of-the-art (SOTA) results in all evaluation sets.