Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
How do modality adapters (MAs) in speech large models (SLMs) transform encoder outputs into language-model-compatible representations? Method: We systematically analyze MA intermediate representations across prominent SLMs—including SALMONN, Qwen2-Audio, and Phi-4-Multimodal-Instruct—using Whisper-family encoders and a nearest-neighbor decoding token analysis approach. Contribution/Results: We identify, for the first time, two fundamentally distinct representation strategies employed by MAs: (1) semantic representations mediated through English, and (2) phonetic representations encoded directly in English tokens. These strategies are determined by the pretraining objective of the underlying speech encoder (e.g., ASR versus self-supervised learning). This dichotomy provides a unified explanation for observed cross-lingual generalization disparities across SLMs and establishes a theoretical foundation for MA architecture design and multilingual speech understanding modeling.

Technology Category

Application Category

📝 Abstract
Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don't, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.
Problem

Research questions and friction points this paper is trying to address.

Investigates how modality adapters transform speech representations in language models
Compares phonetic versus semantic representation strategies across three SLMs
Explores how speech encoder training affects multilingual processing capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality adapters map speech to LM representations
Models use English interlingua for multilingual meaning
Some models represent phonetics using English words
🔎 Similar Papers
2024-07-22arXiv.orgCitations: 4