🤖 AI Summary
This paper addresses critical post-processing challenges in commercial ASR outputs—namely, missing punctuation, inconsistent capitalization, and unnormalized numerals/abbreviations—by proposing the first end-to-end multi-objective text normalization framework. Methodologically, it departs from rule-based and hybrid approaches, introducing a lightweight two-stage fully neural architecture: (1) a multi-task token classifier jointly predicting punctuation, capitalization, and inverse text normalization labels; and (2) a sequence-to-sequence model for fine-grained correction. Both stages are jointly trained and seamlessly integrated into the Universal-2 ASR system. Experiments demonstrate significant improvements over strong baselines across objective metrics (e.g., +5.1 F1 points, −40% inference latency) and subjective listening evaluations. The framework further exhibits superior cross-domain generalization and enhanced hallucination suppression.
📝 Abstract
This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.