🤖 AI Summary
This study addresses the unclear cross-modal interaction mechanisms in existing audio-visual large language models (AVLLMs), particularly the lack of understanding regarding where audio and visual information is encoded within each other’s token representations. Through probing analyses, token-level representation dissection, and cross-modal attention investigations, this work systematically examines multimodal information flow and reveals—for the first time—that AVLLMs concentrate fused multimodal information in specific “cross-modal sink tokens,” whose distribution is notably non-uniform. Building on this insight, the authors propose a training-free critical token intervention strategy that effectively strengthens the model’s reliance on multimodal consistency, substantially mitigating hallucinatory generation.
📝 Abstract
Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal-hub.