Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Facing the performance plateau of large language models (LLMs) caused by the exhaustion of high-quality training data, this work investigates inference-time scaling—a paradigm that enhances model capabilities without retraining. We propose a dual-axis taxonomy: the *output dimension* (encompassing chain-of-thought reasoning, tree/beam search, and decoding strategies) and the *input dimension* (including query expansion, re-ranking, retrieval-augmented generation (RAG), and multimodal/long-context enhancements). This framework systematically unifies existing techniques across key optimization pathways—reasoning refinement, retrieval augmentation, and search-space control—yielding consistent improvements on downstream tasks. Our core contribution is the first structured, extensible taxonomy of inference-time scaling, enabling efficient, low-overhead, deployment-level optimizations that circumvent data scarcity and training bottlenecks.

Technology Category

Application Category

📝 Abstract

The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity by shifting focus to inference-time scaling

Systematically surveying output-focused and input-focused scaling techniques

Organizing reasoning, search and RAG methods to enhance LLM performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses complex multi-step reasoning for generation

Applies search and decoding methods like MCTS

Focuses on RAG and query expansion techniques

🔎 Similar Papers

GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning