๐ค AI Summary
This work addresses the challenge of generating accurate word-level timestamps in end-to-end automatic speech recognition (ASR) and speech-to-text translation (AST) without relying on external alignment. We propose a unified modeling paradigm centered on a learnable special token, `<|timestamp|>`, enabling alignment-free timestamp prediction. For the first time, this approach realizes multilingual ASR/AST joint modeling within the Canary architecture, eliminating dependence on conventional forced alignment. Our method integrates knowledge distillation from the NeMo Forced Aligner, autoregressive sequence modeling, and multilingual joint training. Evaluated on four languages, it achieves timestamp precision and recall of 80โ90%, ASR word-level timing errors of 20โ120 ms, and AST errors of ~200 ms, with negligible degradation in word error rate. The core innovation lies in pioneering end-to-end tokenized word-level timestamp modelingโenabling alignment-free, multitask-unified, high-accuracy, low-latency temporal prediction.
๐ Abstract
We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new<|timestamp|>token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.