🤖 AI Summary
This work addresses the challenge of maintaining stylistic consistency in multi-turn role-playing speech synthesis with large audio language models (LALMs), a problem exacerbated by the absence of objective evaluation metrics. To this end, the authors propose Mean Continuation Log-Probability (MCLP) as a quantitative measure of style consistency and, for the first time, employ it as a reward signal in reinforcement learning to guide LALMs toward better alignment between generated speech and character-specific instructions through in-context learning. Evaluated on a newly curated RP-TTS dataset featuring fine-grained annotations of roles and scenarios, the proposed approach significantly outperforms existing LALM baselines across both subjective and objective metrics, demonstrating enhanced consistency and controllability in character voice generation.
📝 Abstract
Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we propose Mean Continuation Log-Probability (MCLP) as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. Critically, we leverage the In-Context Learning capability of pre-trained LALMs to formulate MCLP via a continuation log-probability prediction. This metric quantifies stylistic consistency by measuring the likelihood of the ground-truth speech conditioned on the generated speech. Furthermore, we employ MCLP as a reinforcement learning reward to enhance the style alignment between generated speech and Role-Play instructions. To facilitate evaluation, we construct an RP-TTS dataset with rich scene and character annotations. Experimental results demonstrate that our method significantly outperforms strong LALM baselines on both objective and subjective metrics.