Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenge of maintaining stylistic consistency in multi-turn role-playing speech synthesis with large audio language models (LALMs), a problem exacerbated by the absence of objective evaluation metrics. To this end, the authors propose Mean Continuation Log-Probability (MCLP) as a quantitative measure of style consistency and, for the first time, employ it as a reward signal in reinforcement learning to guide LALMs toward better alignment between generated speech and character-specific instructions through in-context learning. Evaluated on a newly curated RP-TTS dataset featuring fine-grained annotations of roles and scenarios, the proposed approach significantly outperforms existing LALM baselines across both subjective and objective metrics, demonstrating enhanced consistency and controllability in character voice generation.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we propose Mean Continuation Log-Probability (MCLP) as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. Critically, we leverage the In-Context Learning capability of pre-trained LALMs to formulate MCLP via a continuation log-probability prediction. This metric quantifies stylistic consistency by measuring the likelihood of the ground-truth speech conditioned on the generated speech. Furthermore, we employ MCLP as a reinforcement learning reward to enhance the style alignment between generated speech and Role-Play instructions. To facilitate evaluation, we construct an RP-TTS dataset with rich scene and character annotations. Experimental results demonstrate that our method significantly outperforms strong LALM baselines on both objective and subjective metrics.

Problem

Research questions and friction points this paper is trying to address.

Role-Play TTS

Stylistic Consistency

Large Audio Language Models

Evaluation Metric

Expressive Speech Synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mean Continuation Log-Probability

Large Audio Language Models

Role-Play TTS