🤖 AI Summary
This study investigates the temporal perception mechanisms underlying tone processing in self-supervised speech models (SSL) applied to low-resource tonal languages—Burmese, Thai, Lao, and Vietnamese. Using probe-based analysis and gradient-based attribution within a multilingual transfer fine-tuning framework, we quantify the temporal sensitivity range of tone representations. We establish, for the first time, language-specific baseline tone durations of 100–180 ms across all four languages. Crucially, we find that downstream tasks strongly shape temporal modeling preferences: automatic speech recognition (ASR) induces spontaneous alignment to language-specific tone durations, whereas prosody-related tasks lead to over-reliance on extended temporal spans (>300 ms). These findings reveal a task-driven temporal attention mechanism for tone processing and provide interpretable, empirical guidance for designing and adapting SSL models to low-resource tonal languages.
📝 Abstract
Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.