Word Level Timestamp Generation for Automatic Speech Recognition and Translation

๐Ÿ“… 2025-05-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of generating accurate word-level timestamps in end-to-end automatic speech recognition (ASR) and speech-to-text translation (AST) without relying on external alignment. We propose a unified modeling paradigm centered on a learnable special token, `<|timestamp|>`, enabling alignment-free timestamp prediction. For the first time, this approach realizes multilingual ASR/AST joint modeling within the Canary architecture, eliminating dependence on conventional forced alignment. Our method integrates knowledge distillation from the NeMo Forced Aligner, autoregressive sequence modeling, and multilingual joint training. Evaluated on four languages, it achieves timestamp precision and recall of 80โ€“90%, ASR word-level timing errors of 20โ€“120 ms, and AST errors of ~200 ms, with negligible degradation in word error rate. The core innovation lies in pioneering end-to-end tokenized word-level timestamp modelingโ€”enabling alignment-free, multitask-unified, high-accuracy, low-latency temporal prediction.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new<|timestamp|>token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.
Problem

Research questions and friction points this paper is trying to address.

Predicting word-level timestamps for speech recognition
Eliminating need for external alignment mechanisms
Extending timestamp prediction to speech translation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-driven word-level timestamp prediction in Canary model
Uses NeMo Forced Aligner as teacher model
Introduces <|timestamp|> token for direct prediction
๐Ÿ”Ž Similar Papers
No similar papers found.