π€ AI Summary
Stuttering detection is hindered by the scarcity of high-quality annotated speech data. To address this, we propose LLM-Dysβthe first large language model (LLM)-enhanced, large-scale dysfluency corpus, covering 11 types of word- and phoneme-level stuttering phenomena. Our method innovatively employs an LLM-driven, multi-level stuttering modeling framework that jointly leverages text-to-speech (TTS) synthesis and multi-granularity annotation, significantly improving prosodic naturalness and contextual diversity of synthetic speech. Leveraging this corpus, we develop an ASR-agnostic end-to-end stuttering detection framework, achieving state-of-the-art performance across multiple benchmarks. All data, models, and code are publicly released, establishing a reproducible and scalable foundational resource for clinical speech analysis.
π Abstract
Speech dysfluency detection is crucial for clinical diagnosis and language assessment, but existing methods are limited by the scarcity of high-quality annotated data. Although recent advances in TTS model have enabled synthetic dysfluency generation, existing synthetic datasets suffer from unnatural prosody and limited contextual diversity. To address these limitations, we propose LLM-Dys -- the most comprehensive dysfluent speech corpus with LLM-enhanced dysfluency simulation. This dataset captures 11 dysfluency categories spanning both word and phoneme levels. Building upon this resource, we improve an end-to-end dysfluency detection framework. Experimental validation demonstrates state-of-the-art performance. All data, models, and code are open-sourced at https://github.com/Berkeley-Speech-Group/LLM-Dys.