DRES: Benchmarking LLMs for Disfluency Removal

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Disfluent speech phenomena—such as fillers, repetitions, and self-corrections—significantly degrade downstream task performance in speech-driven systems. To address this, we introduce DRES, the first LLM-oriented benchmark for disfluency correction, constructed from manually refined Switchboard transcripts to isolate linguistic disfluencies from ASR errors and acoustic artifacts, yielding a controlled, reproducible text-level evaluation suite. Through systematic zero-shot, few-shot, and fine-tuned evaluations across diverse LLM scales and architectures, we find that segment-based processing markedly improves correction accuracy; reasoning-oriented models tend to over-delete fluent content; and fine-tuning often harms generalization. We further identify three novel, LLM-specific disfluency correction error patterns, establish a semantic upper bound on correction quality, and propose nine actionable deployment guidelines (R1–R9). This work provides both a methodological foundation and empirically grounded best practices for disfluency normalization in LLM-based speech understanding systems.

Technology Category

Application Category

📝 Abstract
Disfluencies -- such as"um,""uh,"interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.
Problem

Research questions and friction points this paper is trying to address.

Removing disfluencies like 'um' and 'uh' from speech transcripts
Benchmarking LLMs for disfluency removal across different architectures
Improving accuracy in speech-driven systems and conversational agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled text-level benchmark isolating disfluency removal
Systematically evaluating LLMs across scales and architectures
Fine-tuning achieves near state-of-the-art precision recall
🔎 Similar Papers
No similar papers found.