Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional token-level evaluation metrics (e.g., F1) fail to expose fine-grained failure patterns of models in disfluency removal—particularly for linguistically distinct phenomena such as EDITED, INTJ, and PRN. Method: We propose Z-Scores, a linguistically grounded span-level metric that introduces the first linguistics-driven diagnostic framework for disfluency types, coupled with a deterministic alignment module enabling precise type-aware mapping between generated outputs and original transcripts at the annotated span level. Contribution/Results: Z-Scores enables category-specific error analysis, uncovering systematic deficiencies obscured by conventional metrics. Experiments demonstrate that Z-Scores reliably identifies latent weaknesses of large language models in handling INTJ and PRN disfluencies—deficiencies previously undetected by standard evaluation—thereby guiding targeted model refinement and yielding measurable performance gains.

Technology Category

Application Category

📝 Abstract
Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions -- such as tailored prompts or data augmentation -- yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.
Problem

Research questions and friction points this paper is trying to address.

Traditional word-based metrics fail to reveal why disfluency removal models succeed or fail
Existing metrics cannot expose systematic weaknesses across different disfluency categories
Current evaluation methods lack category-specific diagnostics for targeted model improvements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing Z-Scores for span-level disfluency evaluation
Using deterministic alignment for robust text-transcript mapping
Providing category-specific diagnostics to identify failure modes
🔎 Similar Papers
No similar papers found.