🤖 AI Summary
This study investigates the scaling behavior of retrieval capability in large language models (LLMs) with respect to pretraining compute (FLOPs). We systematically evaluate zero-shot retrieval performance across models ranging from 125M to 7B parameters on the BEIR benchmark, covering diverse model sizes and training data scales. Results reveal a stable, predictable power-law scaling relationship between retrieval effectiveness and pretraining FLOPs. Moreover, in-context learning (ICL) ability exhibits a strong positive correlation with retrieval performance (r > 0.98), indicating that intrinsic retrieval capability is inherently scalable with compute. To our knowledge, this is the first work to empirically establish the compute scalability of LLMs’ retrieval ability. Our findings provide both theoretical grounding and empirical evidence for designing efficient, fine-tuning-free retrievers grounded in pretrained LLMs.
📝 Abstract
How does retrieval performance scale with pretraining FLOPs? We benchmark retrieval performance across LLM model sizes from 125 million parameters to 7 billion parameters pretrained on datasets ranging from 1 billion tokens to more than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. We also show that In-Context Learning scores are strongly correlated with retrieval scores across retrieval tasks. Finally, we highlight the implications this has for the development of LLM-based retrievers.