Causal Tracing of Audio-Text Fusion in Large Audio Language Models

📅 2026-03-14

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study addresses the unclear mechanisms of audio-text fusion in current large audio language models. It presents the first systematic investigation into where and how multimodal integration occurs within these models, conducting fine-grained analyses of DeSTA, Qwen, and Voxtral through causal tracing, inter-layer interventions, and token-level hidden state examinations. The findings reveal that the final sequence tokens act as an information bottleneck, while intermediate tokens integrate cross-modal information via attention-like query mechanisms. Moreover, distinct fusion strategies are employed across different architectures, and the work identifies critical pathways and temporal dynamics through which task-relevant acoustic features are extracted and processed.

Technology Category

Application Category

📝 Abstract

Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehension. By conducting layer-wise and token-wise analyses across DeSTA, Qwen, and Voxtral, we evaluate the causal effects of individual hidden states. Layer-wise analysis identifies different fusion strategies, from progressive integration in DeSTA to abrupt late-stage fusion in Qwen. Token-wise analysis shows that the final sequence token acts as an informational bottleneck where the network decisively retrieves relevant information from the audio. We also observe an attention-like query mechanism at intermediate token positions that triggers the model to pull task-relevant audio context. These findings provide a clear characterization of when and where multi-modal integration occurs within LALMs.

Problem

Research questions and friction points this paper is trying to address.

audio-text fusion

large audio language models

causal tracing

multimodal integration

information flow

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal tracing

audio-text fusion

large audio language models