🤖 AI Summary
Traditional reinforcement learning relies on sparse scalar rewards, making it difficult to extract effective and interpretable learning signals from failure trajectories. This work proposes a multi-round feedback-guided reinforcement learning framework that introduces, for the first time in RL, a structured language feedback mechanism: dynamic regeneration is triggered exclusively on failed samples, fine-grained language feedback is embedded into the reasoning process, and intra-round and cross-round learning signals are jointly optimized. The proposed approach substantially improves training efficiency, outperforming both supervised fine-tuning and conventional RL with value-based rewards (RLVR) on the OpenR1-Math benchmark, while also demonstrating strong out-of-domain generalization capabilities.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in multiple domains, yet outcome-only scalar rewards are often sparse and uninformative, especially on failed samples, where they merely indicate failure and provide no insight into why the reasoning fails. In this paper, we investigate how to leverage richer verbal feedback to guide RLVR training on failed samples, and how to convert such feedback into a trainable learning signal. Specifically, we propose a multi-turn feedback-guided reinforcement learning framework. It builds on three mechanisms: (1) dynamic multi-turn regeneration guided by feedback, triggered only on failed samples, (2) two complementary learning signals for within-turn and cross-turn optimization, and (3) structured feedback injection into the model's reasoning process. Trained on sampled OpenR1-Math, the approach outperforms supervised fine-tuning and RLVR baselines in-domain and generalizes well out-of-domain.