Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
It remains unclear whether multilingual speech foundation models genuinely align cross-lingual representations at the semantic level—or merely exploit phonetic similarity. Method: We introduce phoneme-controlled retrieval and word-level synonym/homophone contrastive experiments to empirically assess alignment granularity; further, we design an early-exit mechanism for speech encoders to guide low-resource ASR optimization. Contribution/Results: We provide the first empirical evidence that speech encoders simultaneously achieve phoneme-level and semantic-level alignment. In zero-audio spoken-language translation retrieval, model accuracy remains stable despite absent acoustic cues. When transferred to seven low-resource languages unsupported by Whisper, the early-exit mechanism consistently improves ASR accuracy—especially for transparent orthography languages. Our work establishes a new paradigm for enhancing semantic interpretability of speech representations and adapting them to low-resource settings.

Technology Category

Application Category

📝 Abstract
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs. Such an alignment has also been observed in speech foundation models. However, it remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech. Building on prior work on spoken translation retrieval, we perform pronunciation-controlled experiments to observe if cross-lingual alignment can indeed occur in such models on a semantic basis, instead of relying on phonetic similarities. Our findings indicate that even in the absence of phonetic cues, spoken translation retrieval accuracy remains relatively stable. We follow up with a controlled experiment on a word-level dataset of cross-lingual synonyms and near-homophones, confirming the existence of both phonetic and semantic knowledge in the encoder. Finally, we qualitatively examine the transcriptions produced by early exiting the encoder, where we observe that speech translation produces semantic errors that are characterized by phonetic similarities to corresponding words in the source language. We apply this insight from early exiting to speech recognition in seven low-resource languages unsupported by the Whisper model, and achieve improved accuracy in all languages examined, particularly for languages with transparent orthographies.
Problem

Research questions and friction points this paper is trying to address.

Investigates cross-lingual alignment in speech foundation models
Examines semantic vs phonetic alignment in spoken translation retrieval
Improves speech recognition accuracy in low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pronunciation-controlled experiments for cross-lingual alignment
Word-level dataset analysis for phonetic and semantic knowledge
Early exiting technique improves low-resource speech recognition
🔎 Similar Papers
R
Ryan Soh-Eun Shim
MaiNLP, Center for Information and Language Processing, LMU Munich, Germany; Munich Center for Machine Learning (MCML), Munich, Germany
D
Domenico De Cristofaro
ALPS, Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
C
Chengzhi Hu
MaiNLP, Center for Information and Language Processing, LMU Munich, Germany
Alessandro Vietti
Alessandro Vietti
Associate Professor of Linguistics, Free University of Bozen-Bolzano
laboratory phonologyphoneticssociolinguisticsquantitative linguistics
Barbara Plank
Barbara Plank
Professor, LMU Munich, Visiting Prof ITU Copenhagen
Natural Language ProcessingComputational LinguisticsMachine LearningTransfer Learning