RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning (RL) methods for enhancing large language models’ (LLMs) reasoning capabilities suffer from inefficient sampling of high-quality reasoning trajectories, leading to reinforcement of suboptimal reasoning patterns. Method: We propose RAVR—a novel framework that, for the first time, incorporates explanatory reconstruction from cognitive science into LLM reasoning. RAVR employs answer-conditioned reasoning generation to substantially reduce open-ended exploration overhead; introduces a variational surrogate mechanism for end-to-end training; and integrates systematic backtracking with reasoning-behavior regularization. We theoretically prove that answer conditioning improves the expected utility of reasoning paths. Contribution/Results: Experiments demonstrate that RAVR significantly outperforms strong baselines on both general and mathematical reasoning benchmarks. It effectively mitigates reasoning hesitation, enhances conclusion integration, and promotes problem-specific strategy generation—achieving state-of-the-art performance while improving reasoning consistency and interpretability.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM's current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning for tasks beyond current competence
Using answer-conditioned reasoning to improve path sampling
Transforming intractable problems into learnable reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-answer-guided variational reasoning framework
Conditioning reasoning paths on known answers
Transforms intractable problems into learnable ones
🔎 Similar Papers
No similar papers found.
T
Tianqianjin Lin
Zhejiang University
X
Xi Zhao
Alibaba Group
Xingyao Zhang
Xingyao Zhang
Microsoft
Rujiao Long
Rujiao Long
Tsinghua University, Alibaba
OCRVLM
Y
Yi Xu
Alibaba Group
Zhuoren Jiang
Zhuoren Jiang
Zhejiang University
Information Science & Library ScienceComputational Social ScienceNLPGNNIR
W
Wenbo Su
Alibaba Group
B
Bo Zheng
Alibaba Group