🤖 AI Summary
To address the instability of policy gradient training, low sample efficiency, and difficulty in exploring natural-language action spaces for LLM-based agents in sparse-reward and long-horizon tasks, this paper proposes the Natural Language Actor-Critic (NLAC) framework. Methodologically, NLAC introduces a generative large language model as a *natural language critic*, which produces structured textual feedback—rather than scalar rewards—to provide fine-grained, interpretable optimization signals to the actor. Critically, it integrates off-policy reinforcement learning to enable stable, policy-gradient-free policy updates. Our key contributions include: (i) the first formalization of a natural language critic for LLM agents; (ii) a novel off-policy training paradigm that decouples value estimation from action generation; and (iii) empirical validation demonstrating superior performance over baselines in reasoning, web navigation, and tool-use tasks—achieving higher data efficiency, training stability, and cross-task generalization.
📝 Abstract
Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.