🤖 AI Summary
Conventional token-level evaluation metrics (e.g., F1) fail to expose fine-grained failure patterns of models in disfluency removal—particularly for linguistically distinct phenomena such as EDITED, INTJ, and PRN. Method: We propose Z-Scores, a linguistically grounded span-level metric that introduces the first linguistics-driven diagnostic framework for disfluency types, coupled with a deterministic alignment module enabling precise type-aware mapping between generated outputs and original transcripts at the annotated span level. Contribution/Results: Z-Scores enables category-specific error analysis, uncovering systematic deficiencies obscured by conventional metrics. Experiments demonstrate that Z-Scores reliably identifies latent weaknesses of large language models in handling INTJ and PRN disfluencies—deficiencies previously undetected by standard evaluation—thereby guiding targeted model refinement and yielding measurable performance gains.
📝 Abstract
Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions -- such as tailored prompts or data augmentation -- yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.