🤖 AI Summary
Existing LLM evaluation frameworks predominantly focus on static, single-turn tasks and fail to capture the dynamic, interactive nature of multi-turn dialogues.
Method: We propose the first automated benchmark framework specifically designed for evaluating LLM-based dialogue agents in dynamic, multi-turn settings. It employs generative user simulation to synthesize realistic dialogue trajectories and systematically assesses agent capabilities across three dimensions: information extraction, context awareness, and adaptive interaction. The framework supports configurable user behavior modeling, dynamic context representation, context-sensitive evaluation metrics, and integrated few-shot/one-shot robustness testing.
Contribution/Results: Validated in a loan-application scenario, our framework demonstrates that adaptive interaction strategies significantly improve information extraction accuracy under ambiguous user responses. It is scalable, fully automated, and exhibits strong cross-domain transfer potential—establishing a novel paradigm for rigorous, behaviorally grounded evaluation of LLM dialogue systems.
📝 Abstract
The rapid evolution of large language models (LLMs) has transformed conversational agents, enabling complex human-machine interactions. However, evaluation frameworks often focus on single tasks, failing to capture the dynamic nature of multi-turn dialogues. This paper introduces a dynamic benchmarking framework to assess LLM-based conversational agents through interactions with synthetic users. The framework integrates generative agent simulation to evaluate performance on key dimensions: information extraction, context awareness, and adaptive engagement. By simulating various aspects of user behavior, our work provides a scalable, automated, and flexible benchmarking approach. Experimental evaluation - within a loan application use case - demonstrates the framework's effectiveness under one-shot and few-shot extraction conditions. Results show that adaptive strategies improve data extraction accuracy, especially when handling ambiguous responses. Future work will extend its applicability to broader domains and incorporate additional metrics (e.g., conversational coherence, user engagement). This study contributes a structured, scalable approach to evaluating LLM-based conversational agents, facilitating real-world deployment.