ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue Systems

📅 2026-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation benchmarks for task-oriented dialogue systems lack systematic support for assessing agent behaviors such as multi-goal coordination, long-term memory, and proactive execution. To address this gap, this work proposes the ATOD benchmark along with a synthetic dialogue generation pipeline that automatically produces richly annotated conversations requiring long-horizon reasoning. Furthermore, we introduce the ATOD-Eval evaluation framework, which for the first time formally defines and quantifies key dimensions of agent behavior. This framework enables fine-grained assessment across task completion, agent capabilities, and response quality, offering reproducible offline and online evaluation protocols. Notably, our memory-based evaluator achieves a superior trade-off between accuracy and efficiency, significantly outperforming existing methods.

Technology Category

Application Category

📝 Abstract
Recent advances in task-oriented dialogue (TOD) systems, driven by large language models (LLMs) with extensive API and tool integration, have enabled conversational agents to coordinate interleaved goals, maintain long-horizon context, and act proactively through asynchronous execution. These capabilities extend beyond traditional TOD systems, yet existing benchmarks lack systematic support for evaluating such agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-term reasoning. ATOD captures key characteristics of advanced TOD, including multi-goal coordination, dependency management, memory, adaptability, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that translates these dimensions into fine-grained metrics and supports reproducible offline and online evaluation. We further present a strong agentic memory-based evaluator for benchmarking on ATOD. Experiments show that ATOD-Eval enables comprehensive assessment across task completion, agentic capability, and response quality, and that the proposed evaluator offers a better accuracy-efficiency tradeoff compared to existing memory- and LLM-based approaches under this evaluation setting.
Problem

Research questions and friction points this paper is trying to address.

Task-Oriented Dialogue
Agentic Behavior
Evaluation Benchmark
Long-Horizon Reasoning
Multi-Goal Coordination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Task-Oriented Dialogue
Evaluation Framework
Synthetic Dialogue Generation
Long-Horizon Reasoning
Memory-Based Evaluator
🔎 Similar Papers
No similar papers found.
Y
Yifei Zhang
Amazon
H
H. Nayyeri
Amazon
R
R. Khaziev
Amazon
Emine Yilmaz
Emine Yilmaz
University College London
Information RetrievalNatural Language ProcessingMachine Learning
Gokhan Tur
Gokhan Tur
University of Illinois Urbana-Champaign
Conversational AILanguage UnderstandingLarge Language Models
D
Dilek Hakkani-Tur
Amazon, University of Illinois Urbana–Champaign
H
Hari Thadakamalla
Amazon