AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the lack of high-quality multimodal datasets and standardized evaluation protocols for Audio Role-Playing (ARP), this work introduces the first large-scale, finely aligned TV drama audio-text dataset—comprising 13 series, >1,000 hours of audio, 115+ distinct characters, and over one million dialogues—with semantic content and acoustic features synchronized at fine granularity. We further propose ARP-Eval, the first dedicated evaluation framework for ARP, enabling dual-dimensional assessment of response quality and character fidelity. Leveraging GLM-4-Voice, we develop a series of ARP-Model variants by incorporating speaker identity annotations and context-aware metadata alignment techniques. Experiments demonstrate state-of-the-art performance: ARP-Model achieves 0.31 in Acoustic Personalization and 0.36 in Content Personalization—representing a 38% improvement over baselines—and matches the performance of MiniCPM-O-2.6.

Technology Category

Application Category

📝 Abstract

The creation of high-quality multimodal datasets remains fundamental for advancing role-playing capabilities in large language models (LLMs). While existing works predominantly focus on text-based persona simulation, Audio Role-Playing (ARP) presents unique challenges due to the need for synchronized alignment of semantic content and vocal characteristics. To address this gap, we propose AudioRole, a meticulously curated dataset from 13 TV series spanning 1K+ hours with 1M+ character-grounded dialogues, providing synchronized audio-text pairs annotated with speaker identities and contextual metadata. In addition, to demonstrate the effectiveness of the dataset, we introduced ARP-Eval, a dual-aspect evaluation framework that assesses both response quality and role fidelity. Empirical validation showing GLM-4-Voice trained on AudioRole (which we called ARP-Model) achieve an average Acoustic Personalization score of 0.31, significantly outperforming the original GLM-4-voice and the more powerful model MiniCPM-O-2.6, which specifically supports role-playing in one-shot scenarios. The ARP-Model also achieves a Content Personalization score of 0.36, surpassing the untrained original model by about 38% and maintaining the same level as MiniCPM-O-2.6. AudioRole features dialogues from over 115 main characters, 6 trained ARP-Models that role-play different characters, and evaluation protocols. Together, they provide an essential resource for advancing audio-grounded role-playing research.

Problem

Research questions and friction points this paper is trying to address.

Developing multimodal datasets for audio role-playing in LLMs

Addressing semantic and vocal synchronization challenges in ARP

Creating evaluation frameworks for response quality and role fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created AudioRole dataset with synchronized audio-text pairs

Introduced ARP-Eval framework for dual-aspect evaluation

Trained GLM-4-Voice model achieving improved personalization scores

🔎 Similar Papers

BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model

2024-08-20arXiv.orgCitations: 6