Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional disfluency detection methods are largely confined to coarse-grained classification tasks, lacking the fine-grained clinical insights required for speech therapy; audio-only models further suffer from high false-positive rates due to their neglect of textual context. This paper proposes the first zero-shot joint decoding framework that simultaneously performs phoneme transcription and disfluency event detection (e.g., repetitions, pauses, fillers), without fine-tuning pretrained encoders (e.g., WavLM) or requiring additional annotations. Our core innovation is a lightweight, interpretable decoding graph built upon weighted finite-state transducers (WFSTs), explicitly modeling articulatory anomalies. Evaluated on both synthetic and real-world speech disorder datasets, our method achieves new state-of-the-art performance: a 12.3% reduction in phoneme error rate and a 9.7% improvement in disfluency detection F1-score—significantly outperforming both classification-based and end-to-end baselines.

Technology Category

Application Category

📝 Abstract
Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems.
Problem

Research questions and friction points this paper is trying to address.

Detects and transcribes speech dysfluency without prior training
Improves clinical insight over traditional classification methods
Combines phoneme transcription and dysfluency detection in one framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot decoder for phoneme transcription and dysfluency detection
Works with upstream encoders like WavLM without additional training
Lightweight and interpretable with explicit pronunciation behavior modeling
🔎 Similar Papers
No similar papers found.
C
Chenxu Guo
Zhejiang University, China
Jiachen Lian
Jiachen Lian
UC Berkeley
precision healthcarespeech processingmachine learning
Xuanru Zhou
Xuanru Zhou
Zhejiang University
Speech ProcessingMultimodalRepresentation Learning
Jinming Zhang
Jinming Zhang
Queen Mary University of London
LLMsLLMs in Game
S
Shuhe Li
Zhejiang University, China
Z
Zongli Ye
Zhejiang University, China
H
Hwi Joo Park
UC Berkeley, United States
A
Anaisha Das
UC Berkeley, United States
Zoe Ezzes
Zoe Ezzes
Research Speech-Language Pathologist, University of California, San Francisco
languagecognitionaphasianeurogenic communication disorders
J
Jet Vonk
UCSF, United States
B
Brittany Morin
UCSF, United States
R
Rian Bogley
UCSF, United States
L
Lisa Wauters
UCSF, United States
Zachary Miller
Zachary Miller
Associate Professor of Neurology, UCSF Memory and Aging Center
Behavioral NeurologyDementiaNeurodevelopmentImmunology
M
Maria Gorno-Tempini
UCSF, United States
G
Gopala Anumanchipalli
UC Berkeley, United States