🤖 AI Summary
Large language models (LLMs) exhibit over-optimistic bias during inference—particularly in beam search—where token-level Q-value overestimation leads to error accumulation. This work establishes, for the first time, a theoretical connection between supervised fine-tuning (SFT) and offline reinforcement learning, revealing that LLMs implicitly learn token-level Q-functions and systematically overestimate them. To address this, we propose Supervised Optimistic Correction (SOC): an SFT extension incorporating a Q-value auxiliary loss that enforces implicit value regularization, thereby increasing confidence in expert responses and suppressing error propagation. Evaluated on rigorous mathematical reasoning benchmarks—including GSM8K, MATH, and GAOKAO—SOC consistently improves beam search performance across multiple open-source LLMs, significantly mitigating error amplification without requiring reinforcement learning or additional inference-time computation.
📝 Abstract
In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit $Q$-function for inference. Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated $Q$-value estimations of suboptimal steps. To address this limitation, we propose Supervised Optimism Correction(SOC), which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularization to boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses. Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.