Word Level Timestamp Generation for Automatic Speech Recognition and Translation

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the challenge of generating accurate word-level timestamps in end-to-end automatic speech recognition (ASR) and speech-to-text translation (AST) without relying on external alignment. We propose a unified modeling paradigm centered on a learnable special token, `<|timestamp|>`, enabling alignment-free timestamp prediction. For the first time, this approach realizes multilingual ASR/AST joint modeling within the Canary architecture, eliminating dependence on conventional forced alignment. Our method integrates knowledge distillation from the NeMo Forced Aligner, autoregressive sequence modeling, and multilingual joint training. Evaluated on four languages, it achieves timestamp precision and recall of 80–90%, ASR word-level timing errors of 20–120 ms, and AST errors of ~200 ms, with negligible degradation in word error rate. The core innovation lies in pioneering end-to-end tokenized word-level timestamp modeling—enabling alignment-free, multitask-unified, high-accuracy, low-latency temporal prediction.

Technology Category

Application Category

📝 Abstract

We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new<|timestamp|>token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.

Problem

Research questions and friction points this paper is trying to address.

Predicting word-level timestamps for speech recognition

Eliminating need for external alignment mechanisms

Extending timestamp prediction to speech translation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-driven word-level timestamp prediction in Canary model

Uses NeMo Forced Aligner as teacher model

Introduces <|timestamp|> token for direct prediction

🔎 Similar Papers

No similar papers found.