IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition

📅 2026-01-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the inefficiency of weighted finite-state transducer (WFST) decoding in connectionist temporal classification (CTC) models, which relies on frame-by-frame autoregressive search. The study is the first to explicitly distinguish the structural roles of blank frames—carrying positional information—and non-blank frames—conveying semantic content—in CTC outputs. Building on this insight, the authors propose two efficient decoding algorithms, IOO and KOO, that significantly accelerate WFST-based decoding while preserving end-to-end speech recognition accuracy. Experimental results on AISHELL-1, LibriSpeech, and a large-scale internal dataset demonstrate that the proposed approach maintains state-of-the-art recognition performance while substantially reducing decoding latency.

Technology Category

Application Category

📝 Abstract

End-to-end automatic speech recognition has become the dominant paradigm in both academia and industry. To enhance recognition performance, the Weighted Finite-State Transducer (WFST) is widely adopted to integrate acoustic and language models through static graph composition, providing robust decoding and effective error correction. However, WFST decoding relies on a frame-by-frame autoregressive search over CTC posterior probabilities, which severely limits inference efficiency. Motivated by establishing a more principled compatibility between WFST decoding and CTC modeling, we systematically study the two fundamental components of CTC outputs, namely blank and non-blank frames, and identify a key insight: blank frames primarily encode positional information, while non-blank frames carry semantic content. Building on this observation, we introduce Keep-Only-One and Insert-Only-One, two decoding algorithms that explicitly exploit the structural roles of blank and non-blank frames to achieve significantly faster WFST-based inference without compromising recognition accuracy. Experiments on large-scale in-house, AISHELL-1, and LibriSpeech datasets demonstrate state-of-the-art recognition accuracy with substantially reduced decoding latency, enabling truly efficient and high-performance WFST decoding in modern speech recognition systems.

Problem

Research questions and friction points this paper is trying to address.

WFST

end-to-end ASR

CTC decoding

inference efficiency

autoregressive search

Innovation

Methods, ideas, or system contributions that make the work stand out.

WFST decoding

CTC modeling

blank/non-blank frame analysis