Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Existing punctuation restoration models underperform on spontaneous speech transcripts—characterized by disfluencies such as repetitions and self-corrections—thereby degrading downstream applications like machine translation, text-to-speech, and summarization. This work introduces Cadence, the first general-purpose punctuation restoration model for multilingual spoken text, and the first to adapt pretrained large language models (LLMs) to this task, supporting English and 22 Indian languages. Cadence jointly models multilingual linguistic patterns and disfluency features, and natively supports both plain text and ASR output as input. On cross-lingual and cross-domain benchmarks, it substantially outperforms state-of-the-art methods, especially for low-resource languages and rare punctuation marks (e.g., question marks and em dashes). Deployed in a large-scale, low-resource NLP pipeline, Cadence demonstrates strong real-world efficacy and generalization capability.

Technology Category

Application Category

📝 Abstract

Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech, especially in the presence of disfluencies such as false starts and backtracking. These limitations hinder the performance of downstream tasks like translation, text to speech, summarization, etc. where sentence boundaries are critical for preserving quality. In this work, we introduce Cadence, a generalist punctuation restoration model adapted from a pretrained large language model. Cadence is designed to handle both clean written text and highly spontaneous spoken transcripts. It surpasses the previous state of the art in performance while expanding support from 14 to all 22 Indian languages and English. We conduct a comprehensive analysis of model behavior across punctuation types and language families, identifying persistent challenges under domain shift and with rare punctuation marks. Our findings demonstrate the efficacy of utilizing pretrained language models for multilingual punctuation restoration and highlight Cadence practical value for low resource NLP pipelines at scale.

Problem

Research questions and friction points this paper is trying to address.

Restoring punctuation accurately in spontaneous speech transcripts

Improving downstream NLP tasks requiring sentence boundaries

Expanding multilingual support for punctuation restoration models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts pretrained large language model

Handles written text and speech transcripts

Supports 23 languages including English

🔎 Similar Papers

Punctuation Restoration Improves Structure Understanding without Supervision