Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition

📅 2025-01-01

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the inference speed bottleneck caused by frame-level autoregression in CTC-WFST joint decoding, this paper proposes a windowed decoding mechanism leveraging the spike characteristics of CTC outputs. We introduce the novel hypothesis that “spike neighborhoods carry semantic information” and design Spike Window Decoding (SWD), reducing decoding complexity from linear in the number of frames to linear in the number of spikes. SWD integrates CTC output analysis, WFST-based modeling, dynamic spike-window construction, and streaming optimizations. Evaluated on AISHELL-1 and a large-scale internal dataset, SWD achieves state-of-the-art recognition accuracy while significantly accelerating inference. This work establishes the first efficient, practical, and accuracy-preserving end-to-end ASR decoding paradigm for CTC-WFST integration.

Technology Category

Application Category

📝 Abstract

Recently, end-to-end automatic speech recognition has become the mainstream approach in both industry and academia. To optimize system performance in specific scenarios, the Weighted Finite-State Transducer (WFST) is extensively used to integrate acoustic and language models, leveraging its capacity to implicitly fuse language models within static graphs, thereby ensuring robust recognition while also facilitating rapid error correction. However, WFST necessitates a frame-by-frame search of CTC posterior probabilities through autoregression, which significantly hampers inference speed. In this work, we thoroughly investigate the spike property of CTC outputs and further propose the conjecture that adjacent frames to non-blank spikes carry semantic information beneficial to the model. Building on this, we propose the Spike Window Decoding algorithm, which greatly improves the inference speed by making the number of frames decoded in WFST linearly related to the number of spiking frames in the CTC output, while guaranteeing the recognition performance. Our method achieves SOTA recognition accuracy with significantly accelerates decoding speed, proven across both AISHELL-1 and large-scale In-House datasets, establishing a pioneering approach for integrating CTC output with WFST.

Problem

Research questions and friction points this paper is trying to address.

WFST

CTC

Speech Recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spike Window Decoding

Speech Recognition

CTC-WFST Integration

🔎 Similar Papers

No similar papers found.