🤖 AI Summary
This work proposes a multi-turn reinforcement learning framework with textual feedback (RLTF) to address the limitations of sparse scalar rewards and the high cost of full demonstration supervision in large language model post-training. RLTF leverages natural language feedback—provided by humans or automated evaluators—as an efficient supervisory signal that lies between scalar rewards and complete demonstrations. The authors introduce two novel approaches, self-distillation (RLTF-SD) and feedback modeling (RLTF-FM), which effectively internalize feedback during training, enabling the model to produce higher-quality single-turn outputs at inference time without relying on external feedback. Extensive experiments on reasoning puzzles, competitive mathematics, and creative writing tasks demonstrate that RLTF significantly outperforms strong baselines, highlighting the potential of textual feedback as a scalable signal for reinforcement learning in language models.
📝 Abstract
The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.