🤖 AI Summary
This study addresses the lack of joint, interpretable semantic and acoustic evaluation for speech-to-speech (S2S) large models. Methodologically, we propose the first end-to-end, multi-dimensional interpretable evaluation framework: (1) a semantic-acoustic joint modeling paradigm integrating speech representations and textual semantics; (2) chain-of-reasoning supervision to enhance decision interpretability; (3) synthesis of a high-quality preference dataset, SpeechFeedback, to mitigate annotation scarcity; and (4) a two-stage training strategy combining multimodal feature fusion with rule-guided reinforcement learning. Our framework achieves 82.79% agreement with human evaluations—significantly outperforming cascaded approaches (+7.42%) and SLM baselines (+26.20%). It establishes a new benchmark and practical tool for rigorous, interpretable S2S model assessment.
📝 Abstract
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose exttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce extit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42% and 26.20%, respectively.