Coupled Variational Reinforcement Learning for Language Model General Reasoning

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing verifier-free reinforcement learning (RL) methods sample reasoning paths solely based on the problem input, leading to a decoupling between reasoning paths and final answers—causing inefficient exploration and logical inconsistency. Method: We propose a variational RL framework that jointly models the prior and posterior distributions over problems, reasoning paths, and answers. By integrating variational inference with RL for the first time, our approach employs hybrid sampling to construct a compound distribution that enforces strong consistency between reasoning processes and final answers. It leverages intrinsic LLM token probabilities as verifier-free rewards, unifying optimization of path generation and answer prediction. Contribution/Results: On mathematical and general reasoning benchmarks, our method achieves a 12.4% absolute improvement over baseline verifier-free approaches and outperforms the current state-of-the-art verifier-free method by 2.3%, significantly enhancing the reliability and logical coherence of language model reasoning.

Technology Category

Application Category

📝 Abstract

While reinforcement learning have achieved impressive progress in language model reasoning, they are constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the intrinsic probabilities of LLMs generating reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose extit{{Co}upled {V}ariational {R}einforcement {L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient exploration in verifier-free RL for reasoning

Improves coherence between reasoning traces and final answers

Enhances language model general reasoning via coupled variational inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Couples prior and posterior distributions via variational inference

Uses hybrid sampling for efficient reasoning trace exploration

Optimizes composite distribution to ensure thought-answer coherence

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study