SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenge in low-resource tonal languages where speech representations often struggle to simultaneously achieve speaker invariance and tone sensitivity, thereby limiting cross-gender lexical recognition and tone discrimination performance. The authors propose SITA, a lightweight adaptation method that fine-tunes pretrained encoders such as XLS-R through staged multi-objective training. By integrating cross-gender contrastive learning, tone-repulsion loss, and CTC-based knowledge distillation, SITA enhances speaker-invariant representations while preventing tone collapse. Evaluated on a Hmong word corpus, the approach significantly improves cross-gender lexical retrieval accuracy without compromising ASR performance. Further transfer experiments on Mandarin Chinese demonstrate the method’s generalizability across tonal languages.

Technology Category

Application Category

📝 Abstract

Tonal low-resource languages are widely spoken yet remain underserved by modern speech technology. A key challenge is learning representations that are robust to nuisance variation such as gender while remaining tone-aware for different lexical meanings. To address this, we propose SITA, a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders. SITA uses staged multi-objective training: (i) a cross-gender contrastive objective encourages lexical consistency across speakers, while a tone-repulsive loss prevents tone collapse by explicitly separating same-word different-tone realizations; and (ii) an auxiliary Connectionist Temporal Classification (CTC)-based ASR objective with distillation stabilizes recognition-relevant structure. We evaluate primarily on Hmong, a highly tonal and severely under-resourced language where off-the-shelf multilingual encoders fail to represent tone effectively. On a curated Hmong word corpus, SITA improves cross-gender lexical retrieval accuracy, while maintaining usable ASR accuracy relative to an ASR-adapted XLS-R teacher. We further observe similar gains when transferring the same recipe to Mandarin, suggesting SITA is a general, plug-in approach for adapting multilingual speech encoders to tonal languages.

Problem

Research questions and friction points this paper is trying to address.

tonal languages

speaker-invariance

tone-awareness

low-resource speech

speech representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

speaker-invariance

tone-awareness

contrastive learning