SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research on role-playing agents focuses predominantly on text modality, overlooking the critical role of speech in authentic human-agent interaction and lacking systematic evaluation frameworks for Speech-based Role-Playing Agents (SRPAs). Method: We introduce SpeechRole-Data, a large-scale speech role-playing dataset comprising 98 character types and 112,000 spoken dialogues, and propose SpeechRole-Eval—the first multidimensional benchmark assessing vocal expressiveness, role consistency, and interactive capability. Our methodology integrates high-fidelity text-to-speech synthesis with authentic human recordings, supporting both single-turn and multi-turn, as well as end-to-end and cascaded modeling paradigms. Contribution/Results: We publicly release the full dataset, codebase, and baseline models. Empirical analysis reveals substantial architectural disparities in vocal style consistency and role coherence, thereby bridging a fundamental gap in SRPA evaluation and establishing foundational resources for multimodal role-playing research.

Technology Category

Application Category

📝 Abstract
Recently, role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance. Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios. In particular, there is a lack of systematic evaluation for Speech Role-Playing Agents (SRPAs). To address this gap, we construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations. Each role demonstrates distinct vocal characteristics, including timbre and prosody, thereby enabling more sophisticated speech role-playing. Furthermore, we propose SpeechRole-Eval, a multidimensional evaluation benchmark that systematically assesses SRPAs performance in key aspects such as fundamental interaction ability, speech expressiveness, and role-playing fidelity. Experimental results reveal the advantages and challenges of both cascaded and end-to-end speech role-playing agents in maintaining vocal style consistency and role coherence. We release all data, code, and baseline models to provide a solid foundation for speech-driven multimodal role-playing research and to foster further developments in this field.
Problem

Research questions and friction points this paper is trying to address.

Lack of speech modality focus in role-playing agent research
Absence of systematic evaluation for Speech Role-Playing Agents
Need for diverse vocal characteristics in speech role-playing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with 98 diverse roles
Multidimensional evaluation benchmark for SRPAs
Cascaded and end-to-end agent analysis
🔎 Similar Papers
No similar papers found.
C
Changhao Jiang
Fudan University
J
Jiajun Sun
Fudan University
Y
Yifei Cao
Fudan University
J
Jiabao Zhuang
Fudan University
H
Hui Li
Fudan University
Xiaoran Fan
Xiaoran Fan
Fudan University
M
Ming Zhang
Fudan University
J
Junjie Ye
Fudan University
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
J
Jingqi Tong
Fudan University
Yilong Wu
Yilong Wu
Fudan University
Natural Language Processing
B
Baoyu Fan
IEIT Systems Co Ltd
Z
Zhen Wang
Douyin Co., Ltd.
T
Tao Liang
Douyin Co., Ltd.
Z
Zhihui Fei
Douyin Co., Ltd.
M
Mingyang Wan
Douyin Co., Ltd.
G
Guojun Ma
Douyin Co., Ltd.
Tao Ji
Tao Ji
中国人民大学
T
Tao Gui
Fudan University
Q
Qi Zhang
Fudan University
X
Xuanjing Huang
Fudan University