🤖 AI Summary
This study investigates the similarities and differences in factual memory mechanisms of multimodal language models across textual and spoken modalities, with a focus on whether the encoding, storage, and retrieval of factual knowledge are consistent between them. For speech language models based on discrete speech tokens—such as SpiritLM—it introduces causal mediation analysis into multimodal factual recall research for the first time, enabling a systematic comparison between text-to-text and speech-to-text factual recall processes. The findings reveal that factual memory in the speech modality only partially inherits mechanisms from the text modality, highlighting the incompleteness of cross-modal transfer. This insight offers new directions for improving knowledge representation and reasoning capabilities in spoken-language AI systems.
📝 Abstract
In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models.
Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.