🤖 AI Summary
Existing research on spoken role-playing conversational agents (RPCAs) faces two critical bottlenecks: (1) neglect of paralinguistic features—such as intonation and prosody—in character expression, and (2) a longstanding absence of standardized, role-consistency–oriented evaluation benchmarks. To address these, we introduce VoxRole, the first comprehensive benchmark dedicated to spoken role-playing, comprising 1,228 characters from 261 films and 65.6 hours of multi-turn spoken dialogues. We propose a two-stage automated construction pipeline: first, high-precision audio-text alignment; second, large language model–driven generation of fine-grained, multidimensional character profiles that explicitly encode paralinguistic cues and long-term identity consistency. Systematic evaluation of state-of-the-art spoken dialogue models on VoxRole reveals severe limitations in sustaining long-term role consistency—thereby filling a critical gap in the evaluation landscape and providing a reproducible, extensible benchmark and methodology for future research.
📝 Abstract
Recent significant advancements in Large Language Models (LLMs) have greatly propelled the development of Role-Playing Conversational Agents (RPCAs). These systems aim to create immersive user experiences through consistent persona adoption. However, current RPCA research faces dual limitations. First, existing work predominantly focuses on the textual modality, entirely overlooking critical paralinguistic features including intonation, prosody, and rhythm in speech, which are essential for conveying character emotions and shaping vivid identities. Second, the speech-based role-playing domain suffers from a long-standing lack of standardized evaluation benchmarks. Most current spoken dialogue datasets target only fundamental capability assessments, featuring thinly sketched or ill-defined character profiles. Consequently, they fail to effectively quantify model performance on core competencies like long-term persona consistency. To address this critical gap, we introduce VoxRole, the first comprehensive benchmark specifically designed for the evaluation of speech-based RPCAs. The benchmark comprises 13335 multi-turn dialogues, totaling 65.6 hours of speech from 1228 unique characters across 261 movies. To construct this resource, we propose a novel two-stage automated pipeline that first aligns movie audio with scripts and subsequently employs an LLM to systematically build multi-dimensional profiles for each character. Leveraging VoxRole, we conduct a multi-dimensional evaluation of contemporary spoken dialogue models, revealing crucial insights into their respective strengths and limitations in maintaining persona consistency.