Asymmetric Actor-Critic for Multi-turn LLM Agents

πŸ“… 2026-03-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of ensuring reliable behavior of large language model (LLM) agents in one-shot multi-turn interactions. The authors propose an asymmetric actor-critic framework that employs a fixed, closed-source LLM as the actor to generate responses, while a lightweight, open-source small model serves as the critic to monitor and intervene in the dialogue trajectory in real timeβ€”without requiring actor retries or fine-tuning. This approach introduces, for the first time, an asymmetric generation-verification mechanism into multi-turn LLM agent systems, enabling effective supervision of non-trainable closed-source models. The method also features a data generation pipeline that supports critic fine-tuning without modifying the actor. Experiments on Ο„-bench and UserBench demonstrate that the framework significantly outperforms strong baselines, with the lightweight critic achieving supervision performance comparable to or even exceeding that of larger closed-source models, and further gains in task success rates after fine-tuning.
πŸ“ Abstract
Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi-turn interactions remains challenging. In many real-world applications, agents must succeed in one-shot settings where retries are impossible. Existing approaches either rely on reflection or post-hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor-critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open-source critic provides runtime supervision, monitoring the actor's actions and intervening within the same interaction trajectory. Unlike training-based actor-critic methods, our framework supervises a fixed actor operating in open-ended conversational environments. The design leverages a generation-verification asymmetry: while high-quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine-tuning without modifying the actor. Experiments on $Ο„$-bench and UserBench show that our approach significantly improves reliability and task success over strong single-agent baselines. Moreover, lightweight open-source critics rival or surpass larger proprietary models in the critic role, and critic fine-tuning yields additional gains over several state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

multi-turn interactions
reliable behavior
one-shot settings
conversational agents
proprietary LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

asymmetric actor-critic
multi-turn LLM agents
runtime supervision
critic fine-tuning
one-shot reliability
πŸ”Ž Similar Papers
No similar papers found.