Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the limitations of current automatic evaluation methods for speech-to-speech (S2S) models, which rely on costly and opaque audio language models (ALMs) that struggle to balance efficiency, interpretability, and alignment with human judgments. To overcome these challenges, the authors propose TRACE, a novel framework that generates structured textual blueprints from acoustic cues to guide a purely text-based large language model (LLM) in performing disentangled reasoning across three dimensions—content, audio quality, and paralinguistics—and subsequently fuses these assessments into an overall score. Leveraging a newly introduced human chain-of-thought (HCoT) annotation protocol, TRACE endows LLMs with fine-grained audio evaluation capabilities for the first time. Experiments demonstrate that TRACE achieves superior correlation with human ratings compared to both ALM-based and transcript-only LLM approaches, while substantially reducing computational costs.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.

Problem

Research questions and friction points this paper is trying to address.

Speech-to-Speech evaluation

Audio Language Models

Large Language Models

human-aligned evaluation

dimensional assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-to-Speech evaluation

Large Language Model

Audio Cues