Supervised Optimism Correction: Be Confident When LLMs Are Sure

📅 2025-04-10

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Large language models (LLMs) exhibit over-optimistic bias during inference—particularly in beam search—where token-level Q-value overestimation leads to error accumulation. This work establishes, for the first time, a theoretical connection between supervised fine-tuning (SFT) and offline reinforcement learning, revealing that LLMs implicitly learn token-level Q-functions and systematically overestimate them. To address this, we propose Supervised Optimistic Correction (SOC): an SFT extension incorporating a Q-value auxiliary loss that enforces implicit value regularization, thereby increasing confidence in expert responses and suppressing error propagation. Evaluated on rigorous mathematical reasoning benchmarks—including GSM8K, MATH, and GAOKAO—SOC consistently improves beam search performance across multiple open-source LLMs, significantly mitigating error amplification without requiring reinforcement learning or additional inference-time computation.

Technology Category

Application Category

📝 Abstract

In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit $Q$-function for inference. Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated $Q$-value estimations of suboptimal steps. To address this limitation, we propose Supervised Optimism Correction(SOC), which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularization to boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses. Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.

Problem

Research questions and friction points this paper is trying to address.

Addresses over-optimism in LLM beam search inference

Proposes token-level Q-value correction during fine-tuning

Improves model confidence in expert-demonstrated responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Links supervised fine-tuning to offline RL

Introduces auxiliary loss for Q-value estimation

Regularizes values to suppress over-optimism

🔎 Similar Papers

No similar papers found.