๐ค AI Summary
This work addresses the limitations of existing role-playing research, which is predominantly confined to textual modalities and thus unable to support natural spoken interaction. To overcome this, we propose ActorMind, a novel multi-agent reasoning framework specifically designed for voice-based role-playing. ActorMind integrates four specialized agentsโEye, Ear, Brain, and Mouthโthat collaboratively synthesize character profiles, situational context, and vocal emotional cues through a chain-of-thought mechanism to generate expressive, personality-rich spoken responses. We also introduce ActorMindBench, the first hierarchical benchmark for voice role-playing, comprising 7,653 audio utterances across 313 scenarios and six character archetypes. Experimental results demonstrate that ActorMind significantly enhances the authenticity and expressiveness of voice-driven role-playing interactions.
๐ Abstract
Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.