🤖 AI Summary
To address inaccurate alignments and severe peak bias in end-to-end ASR—particularly with CTC-based methods—this paper introduces one-dimensional optimal transport (OT) theory into seq2seq speech-text alignment for the first time. We propose a differentiable Sequence Optimal Transport Distance (SOTD) modeling framework and a novel OT-based Temporal Consistency (OTTC) loss function, enabling joint optimization of alignment learning and ASR modeling. Unlike conventional alignment approaches, our method is fully differentiable and yields sharp temporal boundaries. Experiments on TIMIT, AMI, and LibriSpeech demonstrate significant improvements in alignment accuracy. While word error rate (WER) is slightly higher than CTC’s, the method excels in tasks demanding high-precision temporal alignment—e.g., clinical speech analysis and second-language acquisition—where fine-grained phoneme- or subword-level synchronization is critical. This work establishes a new paradigm for speech-text fine-grained alignment in ASR.
📝 Abstract
Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance, though with a trade-off in ASR performance when compared to CTC. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community.