Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the scaling behavior of retrieval capability in large language models (LLMs) with respect to pretraining compute (FLOPs). We systematically evaluate zero-shot retrieval performance across models ranging from 125M to 7B parameters on the BEIR benchmark, covering diverse model sizes and training data scales. Results reveal a stable, predictable power-law scaling relationship between retrieval effectiveness and pretraining FLOPs. Moreover, in-context learning (ICL) ability exhibits a strong positive correlation with retrieval performance (r > 0.98), indicating that intrinsic retrieval capability is inherently scalable with compute. To our knowledge, this is the first work to empirically establish the compute scalability of LLMs’ retrieval ability. Our findings provide both theoretical grounding and empirical evidence for designing efficient, fine-tuning-free retrievers grounded in pretrained LLMs.

Technology Category

Application Category

📝 Abstract
How does retrieval performance scale with pretraining FLOPs? We benchmark retrieval performance across LLM model sizes from 125 million parameters to 7 billion parameters pretrained on datasets ranging from 1 billion tokens to more than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. We also show that In-Context Learning scores are strongly correlated with retrieval scores across retrieval tasks. Finally, we highlight the implications this has for the development of LLM-based retrievers.
Problem

Research questions and friction points this paper is trying to address.

Scaling retrieval performance with pretraining FLOPs
Benchmarking LLM retrieval across model sizes
Correlating in-context learning with retrieval scores
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling retrieval performance with pretraining FLOPs
Benchmarking LLMs from 125M to 7B parameters
Correlating In-Context Learning with retrieval scores
🔎 Similar Papers