🤖 AI Summary
Chain-of-thought (CoT) generation in large language models lacks explicit reward signals, hindering direct optimization of reasoning quality.
Method: This paper proposes a reward-free CoT optimization paradigm that treats CoT as a latent variable and constructs a differentiable, optimizable variational lower bound via Jensen’s inequality—eliminating the need for parameterized posteriors or external reward models.
Contribution/Results: The approach unifies supervised fine-tuning and online reinforcement learning within a single objective. Evaluated on mathematical reasoning tasks, it matches the performance of policy-gradient methods augmented with external rewards, demonstrating effectiveness, simplicity, and scalability for general reasoning optimization. To our knowledge, this is the first Jensen-based, reward-free CoT optimization framework, offering a novel pathway for modeling implicit reasoning processes through variational inference.
📝 Abstract
We propose a way to optimize chain-of-thought with reinforcement learning, but without external reward function. Our algorithm relies on viewing chain-of-thought as latent variable as part of a probabilistic inference problem. Contrary to the full evidence lower bound, we propose to apply a much simpler Jensen's lower bound, which derives tractable objectives with simple algorithmic components (e.g., without the need for parametric approximate posterior), making it more conducive to modern large-scale training. The lower bound approach naturally interpolates other methods such as supervised fine-tuning and online reinforcement learning, whose practical trade-offs we will illustrate. Finally, we show that on mathematical reasoning problems, optimizing with Jensen's lower bound is as effective as policy gradient with external reward. Taken together, our results showcase as a proof of concept to this new algorithmic paradigm's potential to more generic applications.