SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

📅 2024-06-19
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack effective modeling capabilities for paralinguistic and environmental cues—such as emotion, accent, age, and background noise—in spoken dialogue understanding, and no systematic benchmark or evaluation framework exists for this domain. To address this gap, we introduce SD-Eval, the first benchmark dedicated to non-lexical dimensions of spoken dialogue understanding, comprising 7,303 utterances (8.76 hours) and formally defining paralinguistic understanding tasks. We propose a three-dimensional evaluation framework integrating objective metrics, human subjective assessment, and LLM-based scoring; correlation analysis shows LLM-based scores achieve 0.82 Pearson correlation with human judgments—37% higher than conventional metrics. Models trained on SD-Eval consistently outperform baselines across all evaluation dimensions, demonstrating the benchmark’s efficacy in advancing paralinguistic reasoning in LLMs.

Technology Category

Application Category

📝 Abstract
Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a process similar to that of SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate that LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at https://github.com/amphionspace/SD-Eval.
Problem

Research questions and friction points this paper is trying to address.

dialogue understanding
language models
evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

SD-Eval
Emotion and Accent Recognition
Human-like Evaluation Standard
🔎 Similar Papers
No similar papers found.