๐ค AI Summary
This work addresses the lack of systematic evaluation of speaking-style intensity control in spoken language models within multi-turn dialogues. To bridge this gap, we introduce StyleBenchโthe first benchmark specifically designed for evaluating style control in conversational speech synthesis. StyleBench comprises a multi-turn dialogue dataset annotated along four stylistic dimensions: emotion, speech rate, volume, and pitch, and incorporates a user-prompt-driven mechanism for fine-grained style intensity control. Through comprehensive stylistic annotations and automated evaluation metrics, StyleBench establishes a standardized framework that reveals a significant performance gap between current spoken language models and general-purpose large language models in terms of controllable style generation. This benchmark provides both a diagnostic tool and a foundation to guide future research in controllable and expressive spoken dialogue systems.
๐ Abstract
Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration.