Improving endpoint detection in end-to-end streaming ASR for conversational speech

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address high endpoint detection (EP) latency and frequent false triggers in streaming end-to-end automatic speech recognition (ASR) based on the Transducer architecture for conversational speech, this paper proposes a collaborative optimization framework. First, we introduce a novel end-of-word token to explicitly model word-level termination boundaries. Second, we design a delay-penalized loss function that jointly optimizes ASR accuracy and EP temporal precision. Third, we incorporate a lightweight auxiliary voice activity detection (VAD) network to enhance frame-level VAD robustness. These components synergistically improve EP real-time performance and reliability. Evaluated on the Switchboard dataset, our method reduces average EP latency by 23% and false trigger rate by 31% compared to a baseline delay-penalized approach. Both user-perceived latency and endpoint accuracy show significant improvement, demonstrating the effectiveness and practicality of our framework in realistic conversational scenarios.

Technology Category

Application Category

📝 Abstract

ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.

Problem

Research questions and friction points this paper is trying to address.

Address delayed emission in streaming ASR outputs

Improve endpoint detection accuracy for conversational speech

Reduce perceived latency in ASR endpointing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduce end-of-word token with delay penalty

Use auxiliary network for speech activity detection

Apply methods on Switchboard speech corpus

🔎 Similar Papers

No similar papers found.