Audio-Aware Large Language Models as Judges for Speaking Styles

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing evaluation methods for speech language models (SLMs) lack fine-grained, multidimensional assessment of prosodic and paralinguistic attributes—such as emotion, loudness, speaking rate, stress, pitch, and nonverbal cues. Method: This work pioneers the systematic use of audio-perceptive large language models (ALLMs)—specifically GPT-4o-audio and Gemini-2.5-pro—as automated evaluators, benchmarked against human raters, to assess four SLMs across speech instruction-following and role-playing tasks along six stylistic dimensions. Contribution/Results: Gemini-2.5-pro achieves a Krippendorff’s α of 0.72 with human annotators—comparable to inter-human agreement (α = 0.75)—demonstrating the viability of ALLMs as reliable automatic evaluators. The study further exposes critical limitations in current SLMs’ fine-grained prosodic control and natural conversational fluency. It establishes a novel, reproducible evaluation paradigm and open benchmark framework for speech generation, enabling standardized, multidimensional assessment of voice style fidelity.

Technology Category

Application Category

📝 Abstract

Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

Problem

Research questions and friction points this paper is trying to address.

Assessing speaking styles in speeches using ALLMs

Evaluating SLMs on style control and dialogue generation

Comparing ALLM and human judges for speech assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-aware LLMs assess speaking styles automatically

Compare GPT-4o-audio and Gemini-2.5-pro judges

Evaluate emotion, pace, pitch in generated speeches

🔎 Similar Papers

No similar papers found.