SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in low-resource tonal languages where speech representations often struggle to simultaneously achieve speaker invariance and tone sensitivity, thereby limiting cross-gender lexical recognition and tone discrimination performance. The authors propose SITA, a lightweight adaptation method that fine-tunes pretrained encoders such as XLS-R through staged multi-objective training. By integrating cross-gender contrastive learning, tone-repulsion loss, and CTC-based knowledge distillation, SITA enhances speaker-invariant representations while preventing tone collapse. Evaluated on a Hmong word corpus, the approach significantly improves cross-gender lexical retrieval accuracy without compromising ASR performance. Further transfer experiments on Mandarin Chinese demonstrate the method’s generalizability across tonal languages.

Technology Category

Application Category

📝 Abstract
Tonal low-resource languages are widely spoken yet remain underserved by modern speech technology. A key challenge is learning representations that are robust to nuisance variation such as gender while remaining tone-aware for different lexical meanings. To address this, we propose SITA, a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders. SITA uses staged multi-objective training: (i) a cross-gender contrastive objective encourages lexical consistency across speakers, while a tone-repulsive loss prevents tone collapse by explicitly separating same-word different-tone realizations; and (ii) an auxiliary Connectionist Temporal Classification (CTC)-based ASR objective with distillation stabilizes recognition-relevant structure. We evaluate primarily on Hmong, a highly tonal and severely under-resourced language where off-the-shelf multilingual encoders fail to represent tone effectively. On a curated Hmong word corpus, SITA improves cross-gender lexical retrieval accuracy, while maintaining usable ASR accuracy relative to an ASR-adapted XLS-R teacher. We further observe similar gains when transferring the same recipe to Mandarin, suggesting SITA is a general, plug-in approach for adapting multilingual speech encoders to tonal languages.
Problem

Research questions and friction points this paper is trying to address.

tonal languages
speaker-invariance
tone-awareness
low-resource speech
speech representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

speaker-invariance
tone-awareness
contrastive learning
tone-repulsive loss
multi-objective training
🔎 Similar Papers
No similar papers found.
Tianyi Xu
Tianyi Xu
Tulane University
Reinforcement LearningNetwork OptimizaitonStatisticsNLP(LLM)Operations research
X
Xuan Ouyang
Department of Computer Sciences, University of Wisconsin–Madison, Madison, WI, USA
Binwei Yao
Binwei Yao
Ph.D. Student University of Wisconsin-Madison
S
Shoua Xiong
School of Nursing, University of Wisconsin–Madison, Madison, WI, USA
S
Sara M. Misurelli
Department of Otolaryngology, University of Wisconsin–Madison, Madison, WI, USA
M
Maichou Lor
School of Nursing, University of Wisconsin–Madison, Madison, WI, USA
Junjie Hu
Junjie Hu
Assistant Professor, University of Wisconsin-Madison
Natural Language ProcessingMachine Learning