TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing TOD benchmarks oversimplify complex real-world service instructions—such as fine-grained conditional constraints and multi-level condition-action statements—thus failing to adequately assess large language models’ (LLMs) comprehension and execution of procedural instructions. To address this, we introduce TOD-ProcBench, the first multi-turn task-oriented dialogue benchmark explicitly designed for evaluating complex instruction following. We propose a three-task evaluation framework encompassing instruction retrieval, violation detection, and conditional generation. Leveraging the high-quality ABCD dataset, we construct reliable test instances via multi-level condition-action modeling and human-verified instruction perturbation. Comprehensive experiments across multilingual and multimodal settings reveal substantial deficiencies in state-of-the-art LLMs’ adherence to intricate constraints. This work establishes a novel benchmark and diagnostic toolkit to advance controllability and reliability in task-oriented dialogue systems.

Technology Category

Application Category

📝 Abstract
In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs' complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs' abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs' ability to follow complex process instructions in task-oriented dialogues
Evaluating how LLMs handle intricate fine-grained constraints in multi-turn conversations
Assessing LLMs' capabilities in identifying and avoiding instruction-violating responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multi-level condition-action instruction statements
Designs three tasks for comprehensive instruction-following evaluation
Synthesizes instruction-violating responses for robustness testing
🔎 Similar Papers
No similar papers found.