Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This study investigates the similarities and differences in factual memory mechanisms of multimodal language models across textual and spoken modalities, with a focus on whether the encoding, storage, and retrieval of factual knowledge are consistent between them. For speech language models based on discrete speech tokens—such as SpiritLM—it introduces causal mediation analysis into multimodal factual recall research for the first time, enabling a systematic comparison between text-to-text and speech-to-text factual recall processes. The findings reveal that factual memory in the speech modality only partially inherits mechanisms from the text modality, highlighting the incompleteness of cross-modal transfer. This insight offers new directions for improving knowledge representation and reasoning capabilities in spoken-language AI systems.
📝 Abstract
In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.
Problem

Research questions and friction points this paper is trying to address.

factual recall
multimodal language models
speech language models
modality transfer
knowledge encoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech Language Models
Causal Mediation Analysis
factual recall
multimodal representation
cross-modal transfer
🔎 Similar Papers