Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study investigates the scaling behavior of retrieval capability in large language models (LLMs) with respect to pretraining compute (FLOPs). We systematically evaluate zero-shot retrieval performance across models ranging from 125M to 7B parameters on the BEIR benchmark, covering diverse model sizes and training data scales. Results reveal a stable, predictable power-law scaling relationship between retrieval effectiveness and pretraining FLOPs. Moreover, in-context learning (ICL) ability exhibits a strong positive correlation with retrieval performance (r > 0.98), indicating that intrinsic retrieval capability is inherently scalable with compute. To our knowledge, this is the first work to empirically establish the compute scalability of LLMs’ retrieval ability. Our findings provide both theoretical grounding and empirical evidence for designing efficient, fine-tuning-free retrievers grounded in pretrained LLMs.

Technology Category

Application Category

📝 Abstract

How does retrieval performance scale with pretraining FLOPs? We benchmark retrieval performance across LLM model sizes from 125 million parameters to 7 billion parameters pretrained on datasets ranging from 1 billion tokens to more than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. We also show that In-Context Learning scores are strongly correlated with retrieval scores across retrieval tasks. Finally, we highlight the implications this has for the development of LLM-based retrievers.

Problem

Research questions and friction points this paper is trying to address.

Scaling retrieval performance with pretraining FLOPs

Benchmarking LLM retrieval across model sizes

Correlating in-context learning with retrieval scores

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling retrieval performance with pretraining FLOPs

Benchmarking LLMs from 125M to 7B parameters

Correlating In-Context Learning with retrieval scores

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models