Revisiting Modality Invariance in a Multilingual Speech-Text Model via Neuron-Level Analysis

📅 2026-01-24

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This study investigates whether the multilingual speech-to-text foundation model SeamlessM4T v2 internally maintains modality-invariant representations—i.e., consistent encoding of the same language across speech and text modalities. Through neuron-level analyses—including mean average precision ranking, median-replacement interventions during inference, assessment of activation magnitude disparities, and cross-modal decoding behavior—the work systematically examines the localization, causal influence, and concentration of language and modality information within the model. The findings reveal, for the first time, that modality invariance in SeamlessM4T v2 is incomplete: the decoder struggles to recover language identity from compressed representations, and cross-attention key-value projections exhibit highly localized modality-selective structures. Language information is more severely degraded in speech-to-text conversion, with non-dominant modalities relying on a sparse set of highly activated neurons, rendering cross-modal and cross-lingual performance notably fragile.

Technology Category

Application Category

📝 Abstract

Multilingual speech-text foundation models aim to process language uniformly across both modality and language, yet it remains unclear whether they internally represent the same language consistently when it is spoken versus written. We investigate this question in SeamlessM4T v2 through three complementary analyses that probe where language and modality information is encoded, how selective neurons causally influence decoding, and how concentrated this influence is across the network. We identify language- and modality-selective neurons using average-precision ranking, investigate their functional role via median-replacement interventions at inference time, and analyze activation-magnitude inequality across languages and modalities. Across experiments, we find evidence of incomplete modality invariance. Although encoder representations become increasingly language-agnostic, this compression makes it more difficult for the shared decoder to recover the language of origin when constructing modality-agnostic representations, particularly when adapting from speech to text. We further observe sharply localized modality-selective structure in cross-attention key and value projections. Finally, speech-conditioned decoding and non-dominant scripts exhibit higher activation concentration, indicating heavier reliance on a small subset of neurons, which may underlie increased brittleness across modalities and languages.

Problem

Research questions and friction points this paper is trying to address.

modality invariance

multilingual speech-text model

neuron-level analysis

language representation

cross-modal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

modality invariance

neuron-level analysis

multilingual speech-text model