🤖 AI Summary
Disfluent speech phenomena—such as fillers, repetitions, and self-corrections—significantly degrade downstream task performance in speech-driven systems. To address this, we introduce DRES, the first LLM-oriented benchmark for disfluency correction, constructed from manually refined Switchboard transcripts to isolate linguistic disfluencies from ASR errors and acoustic artifacts, yielding a controlled, reproducible text-level evaluation suite. Through systematic zero-shot, few-shot, and fine-tuned evaluations across diverse LLM scales and architectures, we find that segment-based processing markedly improves correction accuracy; reasoning-oriented models tend to over-delete fluent content; and fine-tuning often harms generalization. We further identify three novel, LLM-specific disfluency correction error patterns, establish a semantic upper bound on correction quality, and propose nine actionable deployment guidelines (R1–R9). This work provides both a methodological foundation and empirically grounded best practices for disfluency normalization in LLM-based speech understanding systems.
📝 Abstract
Disfluencies -- such as"um,""uh,"interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.