Exploring Training and Inference Scaling Laws in Generative Retrieval

📅 2025-03-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Generative retrieval—where large language models (LLMs) autoregressively generate document identifiers—lacks a clear understanding of how model size, training data volume, and inference compute jointly scale. Method: We conduct the first systematic study of this triadic scaling relationship, introducing a continuous evaluation metric that integrates contrastive entropy and generative loss to enable robust, architecture-agnostic comparisons. Using a unified framework across LLaMA (decoder-only) and T5 (encoder-decoder), we combine n-gram analysis with large-scale ablation experiments. Contributions/Results: All three resources—model scale, data volume, and inference compute—exhibit strong positive correlations with retrieval performance. LLaMA consistently outperforms T5 across multiple configurations. Crucially, n-gram modeling adheres strictly to power-law scaling behavior, offering a novel, interpretable foundation for generative retrieval. Our findings establish principled guidelines for resource-aware model design and deployment in generative retrieval systems.

Technology Category

Application Category

📝 Abstract
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models (LLMs) generate target documents directly from a query. As a novel paradigm, the mechanisms that underpin its performance and scalability remain largely unexplored. We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance. We propose a novel evaluation metric inspired by contrastive entropy and generation loss, providing a continuous performance signal that enables robust comparisons across diverse generative retrieval methods. Our experiments show that n-gram-based methods align strongly with training and inference scaling laws. We find that increasing model size, training data scale, and inference-time compute all contribute to improved performance, highlighting the complementary roles of these factors in enhancing generative retrieval. Across these settings, LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval. Our findings underscore that model sizes, data availability, and inference computation interact to unlock the full potential of generative retrieval, offering new insights for designing and optimizing future systems.
Problem

Research questions and friction points this paper is trying to address.

Investigates scaling laws in generative retrieval performance
Proposes new metric for comparing retrieval methods
Explores impact of model size and compute
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic investigation of training and inference scaling laws
Novel evaluation measure using contrastive entropy and generation loss
Larger decoder-only models outperform others in generative retrieval
🔎 Similar Papers
No similar papers found.
Hongru Cai
Hongru Cai
National University of Singapore
Information RetrievalPersonalizationLanguage Agents
Y
Yongqi Li
The Hong Kong Polytechnic University, Hong Kong SAR, China
Ruifeng Yuan
Ruifeng Yuan
Ph.D from the Hong Kong Polytechnic University
Nature language processing
W
Wenjie Wang
University of Science and Technology of China, Hefei, China
Z
Zhen Zhang
Nanyang Technological University, Singapore
W
Wenjie Li
The Hong Kong Polytechnic University, Hong Kong SAR, China
T
Tat-Seng Chua
National University of Singapore, Singapore