ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing reinforcement learning (RL) post-training methods heavily rely on the model’s initial capacity to generate high-quality positive samples, rendering them ineffective for early-stage training or high-difficulty reasoning tasks where valid solution paths are hard to discover. Method: We propose Self-Explanation Policy Optimization (ExPO), a self-explaining RL framework that conditions reasoning trajectory generation on ground-truth answers—eliminating dependence on external expert demonstrations—and employs a result validator to filter effective trajectories, thereby dynamically refining the policy distribution. Contribution/Results: ExPO overcomes both the initial-performance bottleneck and the failure of expert demonstrations. On challenging benchmarks such as MATH level-5, it substantially outperforms state-of-the-art demonstration-driven methods, especially when the base model’s initial reasoning capability is weak. Moreover, ExPO achieves superior exploration efficiency and faster convergence, demonstrating improved robustness and scalability in complex reasoning domains.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training, which improves reasoning by optimizing model outputs based on reward or preference signals. GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier. However, these methods depend heavily on the model's initial ability to produce positive samples. They primarily refine what the model already knows (distribution sharpening) rather than enabling the model to solve problems where it initially fails. This limitation is especially problematic in early-stage RL training and on challenging reasoning tasks, where positive samples are unlikely to be generated. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model's likelihood of predicting the correct answer. Based on these insights, we propose $ extbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most.

Problem

Research questions and friction points this paper is trying to address.

Improving reasoning in models with self-exploration

Overcoming initial failure in hard reasoning tasks

Generating effective positive samples for learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Explanation Policy Optimization for reasoning

Generates likely and correct answer-aligned samples

Improves learning efficiency on challenging reasoning tasks

🔎 Similar Papers

No similar papers found.