Language Models Can Learn from Verbal Feedback Without Scalar Rewards

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing LLM feedback learning relies on scalar rewards, leading to loss of semantic richness in natural language feedback and reward-scale imbalance. This paper proposes Feedback-Conditioned Policy (FCP), the first approach to treat natural language feedback directly as a generation condition—not a reward signal—thereby reformulating feedback learning as a conditional language modeling task. Methodologically, FCP integrates offline supervised fine-tuning (on response-feedback pairs), online feedback sampling, and policy self-bootstrapping, entirely eliminating the need for scalar reward design. Experiments demonstrate that FCP significantly enhances model understanding of and responsiveness to complex, fine-grained feedback, enabling sustained improvement in generation quality across multi-turn interactions. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.

Problem

Research questions and friction points this paper is trying to address.

Learning from verbal feedback without scalar reward compression

Treating feedback as conditioning signal for language models

Reframing feedback learning as conditional generation problem

Innovation

Methods, ideas, or system contributions that make the work stand out.

Treats verbal feedback as conditioning signal

Uses feedback-conditional policy with maximum likelihood

Online bootstrapping refines policy with fresh feedback

🔎 Similar Papers

No similar papers found.

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

AI Research Scientist, Language - Monetization GenAI