ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue Systems

📅 2026-01-17

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Current evaluation benchmarks for task-oriented dialogue systems lack systematic support for assessing agent behaviors such as multi-goal coordination, long-term memory, and proactive execution. To address this gap, this work proposes the ATOD benchmark along with a synthetic dialogue generation pipeline that automatically produces richly annotated conversations requiring long-horizon reasoning. Furthermore, we introduce the ATOD-Eval evaluation framework, which for the first time formally defines and quantifies key dimensions of agent behavior. This framework enables fine-grained assessment across task completion, agent capabilities, and response quality, offering reproducible offline and online evaluation protocols. Notably, our memory-based evaluator achieves a superior trade-off between accuracy and efficiency, significantly outperforming existing methods.

Technology Category

Application Category

📝 Abstract

Recent advances in task-oriented dialogue (TOD) systems, driven by large language models (LLMs) with extensive API and tool integration, have enabled conversational agents to coordinate interleaved goals, maintain long-horizon context, and act proactively through asynchronous execution. These capabilities extend beyond traditional TOD systems, yet existing benchmarks lack systematic support for evaluating such agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-term reasoning. ATOD captures key characteristics of advanced TOD, including multi-goal coordination, dependency management, memory, adaptability, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that translates these dimensions into fine-grained metrics and supports reproducible offline and online evaluation. We further present a strong agentic memory-based evaluator for benchmarking on ATOD. Experiments show that ATOD-Eval enables comprehensive assessment across task completion, agentic capability, and response quality, and that the proposed evaluator offers a better accuracy-efficiency tradeoff compared to existing memory- and LLM-based approaches under this evaluation setting.

Problem

Research questions and friction points this paper is trying to address.

Task-Oriented Dialogue

Agentic Behavior

Evaluation Benchmark

Long-Horizon Reasoning

Multi-Goal Coordination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Task-Oriented Dialogue

Evaluation Framework

Synthetic Dialogue Generation