How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This study investigates the temporal perception mechanisms underlying tone processing in self-supervised speech models (SSL) applied to low-resource tonal languages—Burmese, Thai, Lao, and Vietnamese. Using probe-based analysis and gradient-based attribution within a multilingual transfer fine-tuning framework, we quantify the temporal sensitivity range of tone representations. We establish, for the first time, language-specific baseline tone durations of 100–180 ms across all four languages. Crucially, we find that downstream tasks strongly shape temporal modeling preferences: automatic speech recognition (ASR) induces spontaneous alignment to language-specific tone durations, whereas prosody-related tasks lead to over-reliance on extended temporal spans (>300 ms). These findings reveal a task-driven temporal attention mechanism for tone processing and provide interpretable, empirical guidance for designing and adapting SSL models to low-resource tonal languages.

Technology Category

Application Category

📝 Abstract

Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.

Problem

Research questions and friction points this paper is trying to address.

Examining temporal focus of tone representation in SSL speech models

Analyzing tone transfer variations across different downstream tasks

Investigating SSL model performance on low-resource tonal languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probes analyze SSL models' tone representation span

Gradient analysis reveals transfer varies by task

Task shapes temporal focus in low-resource tone modeling

🔎 Similar Papers

No similar papers found.