Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Traditional disfluency detection methods are largely confined to coarse-grained classification tasks, lacking the fine-grained clinical insights required for speech therapy; audio-only models further suffer from high false-positive rates due to their neglect of textual context. This paper proposes the first zero-shot joint decoding framework that simultaneously performs phoneme transcription and disfluency event detection (e.g., repetitions, pauses, fillers), without fine-tuning pretrained encoders (e.g., WavLM) or requiring additional annotations. Our core innovation is a lightweight, interpretable decoding graph built upon weighted finite-state transducers (WFSTs), explicitly modeling articulatory anomalies. Evaluated on both synthetic and real-world speech disorder datasets, our method achieves new state-of-the-art performance: a 12.3% reduction in phoneme error rate and a 9.7% improvement in disfluency detection F1-score—significantly outperforming both classification-based and end-to-end baselines.

Technology Category

Application Category

📝 Abstract

Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems.

Problem

Research questions and friction points this paper is trying to address.

Detects and transcribes speech dysfluency without prior training

Improves clinical insight over traditional classification methods

Combines phoneme transcription and dysfluency detection in one framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot decoder for phoneme transcription and dysfluency detection

Works with upstream encoders like WavLM without additional training

Lightweight and interpretable with explicit pronunciation behavior modeling

🔎 Similar Papers

No similar papers found.