🤖 AI Summary
Existing punctuation restoration models underperform on spontaneous speech transcripts—characterized by disfluencies such as repetitions and self-corrections—thereby degrading downstream applications like machine translation, text-to-speech, and summarization. This work introduces Cadence, the first general-purpose punctuation restoration model for multilingual spoken text, and the first to adapt pretrained large language models (LLMs) to this task, supporting English and 22 Indian languages. Cadence jointly models multilingual linguistic patterns and disfluency features, and natively supports both plain text and ASR output as input. On cross-lingual and cross-domain benchmarks, it substantially outperforms state-of-the-art methods, especially for low-resource languages and rare punctuation marks (e.g., question marks and em dashes). Deployed in a large-scale, low-resource NLP pipeline, Cadence demonstrates strong real-world efficacy and generalization capability.
📝 Abstract
Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech, especially in the presence of disfluencies such as false starts and backtracking. These limitations hinder the performance of downstream tasks like translation, text to speech, summarization, etc. where sentence boundaries are critical for preserving quality. In this work, we introduce Cadence, a generalist punctuation restoration model adapted from a pretrained large language model. Cadence is designed to handle both clean written text and highly spontaneous spoken transcripts. It surpasses the previous state of the art in performance while expanding support from 14 to all 22 Indian languages and English. We conduct a comprehensive analysis of model behavior across punctuation types and language families, identifying persistent challenges under domain shift and with rare punctuation marks. Our findings demonstrate the efficacy of utilizing pretrained language models for multilingual punctuation restoration and highlight Cadence practical value for low resource NLP pipelines at scale.