Learning to chain-of-thought with Jensen's evidence lower bound

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Chain-of-thought (CoT) generation in large language models lacks explicit reward signals, hindering direct optimization of reasoning quality. Method: This paper proposes a reward-free CoT optimization paradigm that treats CoT as a latent variable and constructs a differentiable, optimizable variational lower bound via Jensen’s inequality—eliminating the need for parameterized posteriors or external reward models. Contribution/Results: The approach unifies supervised fine-tuning and online reinforcement learning within a single objective. Evaluated on mathematical reasoning tasks, it matches the performance of policy-gradient methods augmented with external rewards, demonstrating effectiveness, simplicity, and scalability for general reasoning optimization. To our knowledge, this is the first Jensen-based, reward-free CoT optimization framework, offering a novel pathway for modeling implicit reasoning processes through variational inference.

Technology Category

Application Category

📝 Abstract
We propose a way to optimize chain-of-thought with reinforcement learning, but without external reward function. Our algorithm relies on viewing chain-of-thought as latent variable as part of a probabilistic inference problem. Contrary to the full evidence lower bound, we propose to apply a much simpler Jensen's lower bound, which derives tractable objectives with simple algorithmic components (e.g., without the need for parametric approximate posterior), making it more conducive to modern large-scale training. The lower bound approach naturally interpolates other methods such as supervised fine-tuning and online reinforcement learning, whose practical trade-offs we will illustrate. Finally, we show that on mathematical reasoning problems, optimizing with Jensen's lower bound is as effective as policy gradient with external reward. Taken together, our results showcase as a proof of concept to this new algorithmic paradigm's potential to more generic applications.
Problem

Research questions and friction points this paper is trying to address.

Optimize chain-of-thought without external reward function
Apply Jensen's lower bound for tractable inference objectives
Compare effectiveness with policy gradient on mathematical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimize chain-of-thought via reinforcement learning
Use Jensen's lower bound for tractable objectives
Interpolate supervised fine-tuning and online RL
🔎 Similar Papers
No similar papers found.
Yunhao Tang
Yunhao Tang
Member of technical staff @ Anthropic
Reinforcement Learning
S
Sid Wang
Meta GenAI
R
R'emi Munos
Meta FAIR