RL with Learnable Textual Feedback: A Bilevel Approach

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of low sample efficiency in sparse-reward reinforcement learning and the underutilization of textual feedback for direct policy improvement. It introduces the first framework that models learnable natural language feedback as a Stackelberg bilevel optimization problem, proposing the Bilevel Natural Language Actor-Critic (Bi-NAC) algorithm. Bi-NAC jointly trains feedback generation and policy optimization, wherein a critic dynamically produces task-oriented textual feedback to enhance policy performance, and an actor efficiently leverages this feedback for learning. Experiments demonstrate that the method significantly outperforms fixed-critic and standard reinforcement learning baselines on the MATH-500, MBPP, and GPQA benchmarks, achieving 46.6% on MATH-500 with a 2B model and 49.3% on GPQA with a 6B model.
📝 Abstract
Reinforcement learning with verifiable rewards can improve LLM reasoning, but learning remains sample-inefficient when terminal rewards are sparse. This has motivated a growing line of work on RL with textual feedback, where a critic model generates natural language feedback to guide a reasoning model (the actor), augmenting scalar rewards with richer learning signals. However, existing methods typically treat feedback as fixed or auxiliary, which misses a key property: feedback should not merely be correct, but should improve the policy (actor model) when provided in context. This motivates a paradigm of learnable textual feedback for RL. Yet the learnability and usefulness of feedback depend on the policy's ability to learn from it, making RL with learnable feedback an inherently bilevel problem. We formalize this coupling as a Stackelberg bilevel program and derive Bilevel Natural Language Actor-Critic (Bi-NAC), which jointly trains a critic to generate reward-improving feedback and an actor to exploit it. Across MATH-500, MBPP, and GPQA, Bi-NAC improves sample and parameter efficiency over RL and fixed-critic baselines: our 2B model outperforms the 3B GRPO baseline, achieving 46.6% versus 41.4% on MATH-500, while our 6B model surpasses the 7B GRPO baseline, achieving 49.3% versus 43.6% on GPQA.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
textual feedback
sample efficiency
bilevel optimization
language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

learnable textual feedback
bilevel reinforcement learning
natural language actor-critic
sample efficiency
Stackelberg optimization