ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the limitations of existing role-playing research, which is predominantly confined to textual modalities and thus unable to support natural spoken interaction. To overcome this, we propose ActorMind, a novel multi-agent reasoning framework specifically designed for voice-based role-playing. ActorMind integrates four specialized agents—Eye, Ear, Brain, and Mouth—that collaboratively synthesize character profiles, situational context, and vocal emotional cues through a chain-of-thought mechanism to generate expressive, personality-rich spoken responses. We also introduce ActorMindBench, the first hierarchical benchmark for voice role-playing, comprising 7,653 audio utterances across 313 scenarios and six character archetypes. Experimental results demonstrate that ActorMind significantly enhances the authenticity and expressiveness of voice-driven role-playing interactions.

Technology Category

Application Category

📝 Abstract

Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.

Problem

Research questions and friction points this paper is trying to address.

speech role-playing

role-playing

spoken dialogue

human-machine interaction

verbal traits

Innovation

Methods, ideas, or system contributions that make the work stand out.

speech role-playing

multi-agent reasoning

chain-of-thought