🤖 AI Summary
Traditional disfluency detection methods are largely confined to coarse-grained classification tasks, lacking the fine-grained clinical insights required for speech therapy; audio-only models further suffer from high false-positive rates due to their neglect of textual context. This paper proposes the first zero-shot joint decoding framework that simultaneously performs phoneme transcription and disfluency event detection (e.g., repetitions, pauses, fillers), without fine-tuning pretrained encoders (e.g., WavLM) or requiring additional annotations. Our core innovation is a lightweight, interpretable decoding graph built upon weighted finite-state transducers (WFSTs), explicitly modeling articulatory anomalies. Evaluated on both synthetic and real-world speech disorder datasets, our method achieves new state-of-the-art performance: a 12.3% reduction in phoneme error rate and a 9.7% improvement in disfluency detection F1-score—significantly outperforming both classification-based and end-to-end baselines.
📝 Abstract
Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems.