DRES: Benchmarking LLMs for Disfluency Removal

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Disfluent speech phenomena—such as fillers, repetitions, and self-corrections—significantly degrade downstream task performance in speech-driven systems. To address this, we introduce DRES, the first LLM-oriented benchmark for disfluency correction, constructed from manually refined Switchboard transcripts to isolate linguistic disfluencies from ASR errors and acoustic artifacts, yielding a controlled, reproducible text-level evaluation suite. Through systematic zero-shot, few-shot, and fine-tuned evaluations across diverse LLM scales and architectures, we find that segment-based processing markedly improves correction accuracy; reasoning-oriented models tend to over-delete fluent content; and fine-tuning often harms generalization. We further identify three novel, LLM-specific disfluency correction error patterns, establish a semantic upper bound on correction quality, and propose nine actionable deployment guidelines (R1–R9). This work provides both a methodological foundation and empirically grounded best practices for disfluency normalization in LLM-based speech understanding systems.

Technology Category

Application Category

📝 Abstract

Disfluencies -- such as"um,""uh,"interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.

Problem

Research questions and friction points this paper is trying to address.

Removing disfluencies like 'um' and 'uh' from speech transcripts

Benchmarking LLMs for disfluency removal across different architectures

Improving accuracy in speech-driven systems and conversational agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled text-level benchmark isolating disfluency removal

Systematically evaluating LLMs across scales and architectures

Fine-tuning achieves near state-of-the-art precision recall

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks