ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing proactive dialogue evaluation efforts are confined to narrow domains or tasks, resulting in fragmented capability assessment and a lack of standardized benchmarks. To address this, we propose ProactiveEval—the first unified, cross-domain evaluation framework for proactive dialogue—decoupling proactivity into two core dimensions: goal planning and conversational guidance, while enabling automated construction of high-challenge evaluation environments. Leveraging task-decomposed metrics, domain-adaptive data generation, and large-scale LLM response analysis, we systematically evaluate 22 models across 328 test cases spanning six domains. Experimental results show that DeepSeek-R1 achieves top performance in goal planning, whereas Claude-3.7-Sonnet excels in conversational guidance, underscoring reasoning capability as the key driver of proactive behavior. ProactiveEval establishes a new benchmark and empirical foundation for quantifying and optimizing model proactivity.

Technology Category

Application Category

📝 Abstract

Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models' proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.

Problem

Research questions and friction points this paper is trying to address.

Evaluating proactive dialogue abilities in large language models

Addressing fragmented domain-specific evaluation limitations

Developing unified framework for proactive conversation assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for proactive dialogue evaluation

Decomposes dialogue into target planning and guidance

Automatically generates diverse evaluation data

🔎 Similar Papers

No similar papers found.