Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

πŸ“… 2025-05-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the excessive computational overhead in test-time scaling (TTS) caused by reliance on external process reward models (PRMs) or Best-of-N sampling, this paper proposes a lightweight self-guided tree search framework. The method eliminates external verifiers and instead leverages intrinsic token-level confidence scores and step-level novelty signals from large language models (LLMs) to guide search. It introduces a novel reinforcement learning-based fine-tuning strategy that significantly improves the reliability of the model’s internal confidence estimation, achieving PRM-level filtering capability. Integrated with KV cache optimization, the approach attains mathematical reasoning performance on a 1.5B-parameter model comparable to that of 32B–70B models, while accelerating inference by 8Γ—, reducing GPU memory consumption by 4–5Γ—, and cutting KV cache usage by 50%.

Technology Category

Application Category

πŸ“ Abstract
Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in Test-Time Scaling for LLMs
Enhancing reasoning without external Process Reward Models
Improving efficiency and accuracy in mathematical reasoning benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight tree search with intrinsic signals
Reinforcement learning for confidence reliability
Reduces GPU memory usage significantly
πŸ”Ž Similar Papers
No similar papers found.