$ au^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing dialogue AI evaluation benchmarks assume unidirectional control—where only the AI invokes tools—overlooking real-world collaborative scenarios requiring active user participation and joint environment modification (e.g., telecom network operations). Method: We propose $τ^2$-bench, the first benchmark supporting bidirectional control. It formalizes dual-agent dynamic collaboration via Decentralized Partially Observable Markov Decision Processes (Dec-POMDP), integrates a programmable task generator, and introduces a tool-constrained, high-fidelity user simulator. It further enables fine-grained root-cause analysis of reasoning errors and collaborative communication failures. Contribution/Results: Experiments reveal significant performance degradation of state-of-the-art dialogue agents under bidirectional control, empirically validating collaboration bottlenecks. $τ^2$-bench is reproducible, extensible, and diagnosable—establishing a foundational infrastructure for next-generation dialogue agent evaluation.

Technology Category

Application Category

📝 Abstract

Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce $ au^2$-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, $ au^2$-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents in dual-control conversational environments

Addressing gaps in single-control AI benchmarks

Testing agent coordination and user guidance capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-control environment for agent-user interaction

Compositional task generator for diverse tasks

Reliable user simulator with constrained behavior

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation