Improving endpoint detection in end-to-end streaming ASR for conversational speech

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high endpoint detection (EP) latency and frequent false triggers in streaming end-to-end automatic speech recognition (ASR) based on the Transducer architecture for conversational speech, this paper proposes a collaborative optimization framework. First, we introduce a novel end-of-word token to explicitly model word-level termination boundaries. Second, we design a delay-penalized loss function that jointly optimizes ASR accuracy and EP temporal precision. Third, we incorporate a lightweight auxiliary voice activity detection (VAD) network to enhance frame-level VAD robustness. These components synergistically improve EP real-time performance and reliability. Evaluated on the Switchboard dataset, our method reduces average EP latency by 23% and false trigger rate by 31% compared to a baseline delay-penalized approach. Both user-perceived latency and endpoint accuracy show significant improvement, demonstrating the effectiveness and practicality of our framework in realistic conversational scenarios.

Technology Category

Application Category

📝 Abstract
ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.
Problem

Research questions and friction points this paper is trying to address.

Address delayed emission in streaming ASR outputs
Improve endpoint detection accuracy for conversational speech
Reduce perceived latency in ASR endpointing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduce end-of-word token with delay penalty
Use auxiliary network for speech activity detection
Apply methods on Switchboard speech corpus
🔎 Similar Papers
No similar papers found.
C
C. Anandh
Indian Institute of Technology Madras, India
K
Karthik Pandia Durai
Uniphore Software Systems, India
J
Jeena Prakash
Uniphore Software Systems, India
M
Manickavela Arumugam
Uniphore Software Systems, India
K
Kadri Hacioglu
Uniphore Software Systems, USA
S
S. Dubagunta
Uniphore Software Systems, India
Andreas Stolcke
Andreas Stolcke
Distinguished AI Scientist, Uniphore
Speech Processing
A
Aravind Ganapathiraju
Uniphore Software Systems, India