🤖 AI Summary
Existing single-score evaluation metrics struggle to comprehensively assess the complex characteristics of multi-party dialogue generation, particularly in speaker modeling, content quality, and consistency. This work proposes MPCEval, the first multidimensional disentangled evaluation framework tailored for this task, which explicitly distinguishes between local (next-turn prediction) and global (full-dialogue generation) objectives and introduces reference-free automatic metrics. By integrating role modeling, semantic consistency, and dialogue structure analysis, MPCEval demonstrates effectiveness across multiple public and real-world datasets. The framework systematically reveals distinctive performance patterns of mainstream models along key dimensions—including participant balance, content progression, novelty, and role consistency—thereby underscoring the critical importance of multidimensional evaluation for accurately understanding model capabilities.
📝 Abstract
Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two-party dialogue, multi-party settings introduce distinct challenges, including complex turn-taking, role-dependent speaker behavior, long-range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker--content consistency, and explicitly distinguishes local next-turn prediction from global full-conversation generation. It provides novel, quantitative, reference-free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations. The results reveal systematic, dimension-specific model characteristics in participation balance, content progression and novelty, and speaker--content consistency, demonstrating that evaluation objectives critically shape model assessment and that single-score evaluation obscures fundamental differences in multi-party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at https://github.com/Owen-Yang-18/MPCEval.