Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Conventional token-level evaluation metrics (e.g., F1) fail to expose fine-grained failure patterns of models in disfluency removal—particularly for linguistically distinct phenomena such as EDITED, INTJ, and PRN. Method: We propose Z-Scores, a linguistically grounded span-level metric that introduces the first linguistics-driven diagnostic framework for disfluency types, coupled with a deterministic alignment module enabling precise type-aware mapping between generated outputs and original transcripts at the annotated span level. Contribution/Results: Z-Scores enables category-specific error analysis, uncovering systematic deficiencies obscured by conventional metrics. Experiments demonstrate that Z-Scores reliably identifies latent weaknesses of large language models in handling INTJ and PRN disfluencies—deficiencies previously undetected by standard evaluation—thereby guiding targeted model refinement and yielding measurable performance gains.

Technology Category

Application Category

📝 Abstract

Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions -- such as tailored prompts or data augmentation -- yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.

Problem

Research questions and friction points this paper is trying to address.

Traditional word-based metrics fail to reveal why disfluency removal models succeed or fail

Existing metrics cannot expose systematic weaknesses across different disfluency categories

Current evaluation methods lack category-specific diagnostics for targeted model improvements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing Z-Scores for span-level disfluency evaluation

Using deterministic alignment for robust text-transcript mapping

Providing category-specific diagnostics to identify failure modes

🔎 Similar Papers

Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

2024-03-01arXiv.orgCitations: 26

💼 Related Jobs

AI Language Engineer

Cresta

$90,000–$160,000 + Offers Equity

United States (Remote) / US (Remote)

Research Scientist Intern, Multimodal AI (PhD)