🤖 AI Summary
This work addresses the challenge of enabling audio encoders to efficiently convey semantic information to large language models (LLMs), where cross-modal information utilization remains suboptimal. To this end, we propose a novel “probe-based understanding” paradigm: an LLM actively interprets audio representations via dedicated attention submodules, augmented by delayed audio fusion and complementary multi-encoder integration. Built upon the Pengi/LLaVA architecture, our model is trained end-to-end using mechanistic interpretability analysis and a three-stage unified training strategy on 5.6 million audio–text pairs. Experimental results demonstrate consistent improvements of 10–60% over strong baselines across diverse audio understanding tasks. Crucially, this study provides the first systematic empirical validation of LLMs’ capacity to actively probe and interpret audio representations—demonstrating both efficacy and interpretability. Our approach establishes a new framework for audio–LLM co-modeling, advancing multimodal foundation models beyond passive feature aggregation.
📝 Abstract
The integration of audio perception capabilities into Large Language Models (LLMs) has enabled significant advances in Audio-LLMs. Although application-focused developments, particularly in curating training data for specific capabilities e.g., audio reasoning, have progressed rapidly, the underlying mechanisms that govern efficient transfer of rich semantic representations from audio encoders to LLMs remain under-explored. We conceptualize effective audio-LLM interaction as the LLM's ability to proficiently probe the audio encoder representations to satisfy textual queries. This paper presents a systematic investigation on how architectural design choices can affect that. Beginning with a standard Pengi/LLaVA-style audio-LLM architecture, we propose and evaluate several modifications guided by hypotheses derived from mechanistic interpretability studies and LLM operational principles. Our experiments demonstrate that: (1) delaying audio integration until the LLM's initial layers establish textual context that enhances its ability to probe the audio representations for relevant information; (2) the LLM can proficiently probe audio representations exclusively through LLM layer's attention submodule, without requiring propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently integrated ensemble of diverse audio encoders provides richer, complementary representations, thereby broadening the LLM's capacity to probe a wider spectrum of audio information. All hypotheses are evaluated using an identical three-stage training curriculum on a dataset of 5.6 million audio-text pairs, ensuring controlled comparisons. Our final architecture, which incorporates all proposed modifications, achieves relative improvements from 10% to 60% over the baseline, validating our approach to optimizing cross-modal information transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/