SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of joint, interpretable semantic and acoustic evaluation for speech-to-speech (S2S) large models. Methodologically, we propose the first end-to-end, multi-dimensional interpretable evaluation framework: (1) a semantic-acoustic joint modeling paradigm integrating speech representations and textual semantics; (2) chain-of-reasoning supervision to enhance decision interpretability; (3) synthesis of a high-quality preference dataset, SpeechFeedback, to mitigate annotation scarcity; and (4) a two-stage training strategy combining multimodal feature fusion with rule-guided reinforcement learning. Our framework achieves 82.79% agreement with human evaluations—significantly outperforming cascaded approaches (+7.42%) and SLM baselines (+26.20%). It establishes a new benchmark and practical tool for rigorous, interpretable S2S model assessment.

Technology Category

Application Category

📝 Abstract
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose exttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce extit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively.
Problem

Research questions and friction points this paper is trying to address.

Evaluating speech-to-speech LLMs with both semantic and acoustic dimensions
Enhancing explainability through rationale-based supervision in speech evaluation
Addressing speech preference data scarcity with synthetic datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly assesses semantic and acoustic dimensions
Leverages rationale-based supervision for explainability
Uses synthetic dataset and two-stage training paradigm
🔎 Similar Papers
No similar papers found.