๐ค AI Summary
This work addresses the inefficiency of large language models in complex reasoning, often caused by โoverthinking,โ and the limitations of existing reinforcement learning approaches due to suboptimal reward design and inefficient sampling. The authors propose the VPG-EA framework, which formulates efficient reasoning as a variational inference problem. It employs a parameter-shared dual-stream architecture to learn, respectively, a posterior distribution guided by reference answers and a prior policy. Cross-view evaluation is introduced to identify pseudo-efficient reasoning paths, and variational distillation enables unidirectional transfer of efficient patterns from the posterior to the prior. Notably, this approach establishes an efficiency-aware evidence lower bound grounded in cognitive science principles. Experiments on DeepSeek-R1-Distill-Qwen-1.5B and 7B demonstrate substantial improvements, with the composite efficiency metric ฮตยณ increasing by 8.73% and 12.37% over the strongest baseline, respectively.
๐ Abstract
Although large language models rely on chain-of-thought for complex reasoning, the overthinking phenomenon severely degrades inference efficiency. Existing reinforcement learning methods compress reasoning chains by designing elaborate reward functions, which renders high-quality samples extremely sparse in the exploration space and creates a sampling bottleneck for the prior policy. Inspired by cognitive science, we theoretically prove that a posterior distribution guided by reference answers achieves higher expected utility than the prior distribution, thus capable of breaking through the sampling bottleneck of high-quality samples. However, the posterior distribution is unavailable during inference. To this end, we formalize efficient reasoning as a variational inference problem and introduce an efficiency-aware evidence lower bound as the theoretical foundation. Based on this, we propose the VPG-EA framework. It adopts a parameter-shared dual-stream architecture to instantiate both the posterior distribution and the prior policy; after filtering out pseudo-efficient paths via cross-view evaluation, it unidirectionally transfers the posterior's efficient patterns to the prior policy through variational distillation. Experiments on DeepSeek-R1-Distill-Qwen-1.5B and 7B scales demonstrate that VPG-EA improves the comprehensive efficiency metric epsilon cubed by 8.73% and 12.37% over the strongest baselines on each model size, respectively.