TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

๐Ÿ“… 2025-04-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing task-oriented dialogue (TOD) evaluation methods struggle to detect intermediate interaction errors in large language model (LLM)-driven systems, and their reliance on coarse-grained dialogue-level metrics results in low correlation with human judgments. To address this, we propose TD-EVALโ€”a novel two-stage collaborative evaluation framework that integrates (i) turn-level, three-dimensional diagnostic assessment (coherence, knowledge consistency, and policy compliance) and (ii) dialogue-level LLM-based pairwise arena evaluation (TOD Agent Arena). Leveraging multi-dimensional turn scoring, preference comparison, and TOD-specific prompt engineering, TD-EVAL achieves significant improvements over both conventional and LLM-based baselines on MultiWOZ 2.4 and ฯ„-Bench, boosting Kendallโ€™s ฯ„ by 19.3%. It accurately identifies intermediate errors and supports plug-and-play evaluation without system-specific fine-tuning.

Technology Category

Application Category

๐Ÿ“ Abstract
Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce TD-EVAL (Turn and Dialogue-level Evaluation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn level, we evaluate each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design TOD Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and { au}-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TD-EVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with a plug-and-play framework for future research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating task-oriented dialogue systems lacks precision with current methods
Traditional metrics miss critical intermediate errors in user-agent interactions
Need for unified turn-level and dialogue-level evaluation framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines turn-level precision with dialogue-level comparisons
Evaluates conversation cohesion, knowledge consistency, policy compliance
Uses TOD Agent Arena for pairwise dialogue-level quality
๐Ÿ”Ž Similar Papers
No similar papers found.