clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of instruction-tuned LLM-based task-oriented dialogue systems often decouple user simulators from system architectures, yielding configuration-specific insights without cross-cutting generalizability. To address this, we propose Clem:ToDD—a first-of-its-kind, decoupled, and reproducible evaluation framework for LLM dialogue systems. It standardizes data, metrics, and computational constraints while supporting modular user simulators, pluggable system interfaces, and a unified evaluation pipeline. This enables fair, controlled comparisons and causal attribution across model scales, architectural designs, and prompting strategies. Applying Clem:ToDD to re-evaluate state-of-the-art systems and three novel architectures, we quantitatively isolate the independent effects of model scale, architecture choice, and prompt engineering on task completion rate, robustness, and inference efficiency—providing the first systematic, empirical decomposition of these critical factors.

Technology Category

Application Category

📝 Abstract
The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation-either focusing on a single user simulator or a specific system design-limiting the generalisability of insights across architectures and configurations. In this work, we propose clem todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem todd's flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-based dialogue systems in isolation limits generalizability
Lack of systematic benchmarking for user simulators and dialogue systems
Need consistent conditions to compare architectures and configurations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flexible framework for evaluating dialogue systems
Plug-and-play integration of user simulators
Uniform datasets and evaluation metrics
🔎 Similar Papers