SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study addresses the lack of joint, interpretable semantic and acoustic evaluation for speech-to-speech (S2S) large models. Methodologically, we propose the first end-to-end, multi-dimensional interpretable evaluation framework: (1) a semantic-acoustic joint modeling paradigm integrating speech representations and textual semantics; (2) chain-of-reasoning supervision to enhance decision interpretability; (3) synthesis of a high-quality preference dataset, SpeechFeedback, to mitigate annotation scarcity; and (4) a two-stage training strategy combining multimodal feature fusion with rule-guided reinforcement learning. Our framework achieves 82.79% agreement with human evaluations—significantly outperforming cascaded approaches (+7.42%) and SLM baselines (+26.20%). It establishes a new benchmark and practical tool for rigorous, interpretable S2S model assessment.

Technology Category

Application Category

📝 Abstract

Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose exttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce extit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively.

Problem

Research questions and friction points this paper is trying to address.

Evaluating speech-to-speech LLMs with both semantic and acoustic dimensions

Enhancing explainability through rationale-based supervision in speech evaluation

Addressing speech preference data scarcity with synthetic datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly assesses semantic and acoustic dimensions

Leverages rationale-based supervision for explainability

Uses synthetic dataset and two-stage training paradigm

🔎 Similar Papers

No similar papers found.