UTI-LLM: A Personalized Articulatory-Speech Therapy Assistance System Based on Multimodal Large Language Model

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech therapy systems lack real-time articulatory feedback, while multimodal large language models (MLLMs) face bottlenecks in speech rehabilitation—including insufficient acquisition and fusion of articulatory information, coarse-grained tongue motion trajectory analysis, and scarcity of domain-specific data. To address these challenges, we introduce the first high-quality, ultrasound tongue imaging–speech paired dataset specifically designed for speech therapy. We propose a spatiotemporal fusion training strategy integrating ultrasound video and acoustic signals, and develop a novel MLLM that jointly processes ultrasound imagery, raw speech waveforms, and textual dialogue. Leveraging cross-modal alignment and data-driven fine-tuning, our model enables fine-grained spatiotemporal modeling of tongue dynamics and generates clinically interpretable, real-time articulatory feedback. Evaluated in authentic clinical settings, the system significantly improves phoneme error detection accuracy and feedback latency/quality, thereby enhancing rehabilitation precision, interactivity, and scalability.

Technology Category

Application Category

📝 Abstract
Speech therapy plays a critical role in training speech disorders caused by neurological impairments such as stroke. However, traditional manual and computer-assisted systems are limited in real-time accessibility and articulatory motion feedback, constraining their practical utility. Recent advances in multimodal large language models (MLLMs) have demonstrated significant potential in healthcare, particularly through their ability to integrate multimodal data for adaptive assessment and therapeutic feedback. Nevertheless, challenges including insufficient acquisition and fusion of articulatory information, inadequate parsing of articulatory organ motion trajectories, and the scarcity of high-quality domain-specific datasets hinder the application of MLLMs in speech therapy. To address these limitations, we propose an MLLM-based speech rehabilitation assistance system that synergistically leverages ultrasound tongue imaging and speech signals to deliver precise, interactive articulatory feedback. We construct a high-quality domain-specific dataset comprising UTI-speech dialogue pairs. This dataset facilitates fine-tuning to enhance the model's clinical adaptability. Building on this dataset, our methods achieves spatiotemporal fusion training strategy of ultrasound videos and speech signals, enabling fine-grained articulatory impairment analysis and ultimately generating actionable feedback.
Problem

Research questions and friction points this paper is trying to address.

Addresses insufficient articulatory data acquisition and fusion in speech therapy
Overcomes inadequate parsing of articulatory organ motion trajectories
Solves scarcity of high-quality domain-specific datasets for MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal fusion of ultrasound and speech signals
Spatiotemporal training strategy for fine-grained analysis
Domain-specific dataset for clinical adaptability enhancement
🔎 Similar Papers
No similar papers found.
Yudong Yang
Yudong Yang
Tsinghua University
Multimodal LLMSpeech Processing
X
Xiaokang Liu
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
S
Shaofeng zhao
Department of Rehabilitation Medicine, The Eighth Affiliated Hospital of Sun Yat-sen University, Shenzhen, China
R
Rongfeng Su
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
N
Nan Yan
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Key Laboratory of Biomedical Imaging Science and System, Chinese Academy of Sciences, Shenzhen, China
L
Lan Wang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Key Laboratory of Biomedical Imaging Science and System, Chinese Academy of Sciences, Shenzhen, China