Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limitations of conventional metrics in evaluating conversational automatic speech recognition (ASR) systems under challenging conditions such as overlapping speech, far-field noise, and multi-speaker scenarios, where semantic errors are poorly captured by edit-distance–based measures. To this end, the authors propose a fine-grained evaluation framework that introduces tcpSemER—a metric based on semantic embedding similarity rather than word-level edit distance—and decomposes tcpWER into overlapping and non-overlapping segments to jointly assess semantic fidelity and robustness to overlap. Through systematic comparisons across three datasets, they find that large language model–based approaches outperform modular pipelines in two-party conversations but suffer significant performance degradation as the number of speakers increases or overlap intensifies, whereas modular systems demonstrate greater robustness under these conditions.

Technology Category

Application Category

📝 Abstract

Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas modular pipelines remain more robust.

Problem

Research questions and friction points this paper is trying to address.

Conversational ASR

Overlapping Speech

Semantic Fidelity

Speaker Count

Robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

tcpSemER

semantic similarity

overlapping speech