🤖 AI Summary
This work addresses the challenges of sparse rewards in reinforcement learning for reasoning tasks, which often lead to ambiguous credit assignment and exploration collapse, as well as the susceptibility of conventional policy distillation to noisy gradients from low-quality trajectories. To overcome these limitations, the authors propose a knowledge-enhanced preference optimization framework that introduces a novel quality-gating mechanism to perform policy distillation exclusively on high-quality trajectories. Furthermore, the method integrates teacher prompting into a knowledge-augmented rejection sampling strategy to guide efficient exploration. This approach significantly improves training stability, reasoning coherence, and out-of-distribution generalization, achieving state-of-the-art performance on a medical visual question answering single-source generalization benchmark compared to existing reinforcement learning and distillation baselines.
📝 Abstract
Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.''Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified post-training framework that integrates: (i) a quality-gated on-policy distillation objective that selectively applies dense teacher guidance only to high-quality trajectories, and (ii) a knowledge-enhanced exploration strategy that leverages hints learned from a teacher model to rejectively sample reward-positive on-policy trajectories for RL, thereby mitigating exploration collapse. Evaluated on a challenging medical visual question answering benchmark under single-source generalization, KEPO demonstrates improved training stability, more coherent reasoning behaviors, and superior out-of-distribution performance over reinforcement learning and on-policy distillation baselines.