🤖 AI Summary
To address shallow audio–text modality fusion in multimodal large language models (MLLMs), which limits semantic reasoning capabilities, this paper proposes an interleaved instruction-tuning framework that dynamically embeds audio tokens into textual prompt sequences to enable fine-grained cross-modal alignment. We introduce SHARD—the first benchmark dataset specifically designed for audio semantic reasoning—covering synonym and hypernym identification tasks. Extensive experiments integrate zero-shot prompting, audio tokenization, and multimodal in-context learning on the LTU model. Results demonstrate a substantial improvement in semantic reasoning performance (+12.3%), albeit with a modest degradation in audio classification accuracy (−3.1%), revealing for the first time a fundamental trade-off between deep modality fusion and functional balance. Our core contributions are: (1) the interleaved fine-tuning paradigm; (2) the SHARD benchmark; and (3) a systematic characterization of the reasoning–classification trade-off.
📝 Abstract
Standard training for Multi-modal Large Language Models (MLLMs) involves concatenating non-textual information, like vision or audio, with a text prompt. This approach may not encourage deep integration of modalities, limiting the model's ability to leverage the core language model's reasoning capabilities. This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt. Using the Listen, Think, and Understand (LTU) model as a testbed, we conduct an experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our newly created reasoning benchmark for audio-based semantic reasoning focusing on synonym and hypernym recognition. Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning using interleaved training prompts improves the results further, however, at the expense of the MLLM's audio labeling ability.