🤖 AI Summary
This work addresses the problems of modality dependency imbalance and insufficient cross-modal synergy in large language models (LLMs) for audio-speech joint reasoning tasks. Methodologically, we propose JASCO—the first benchmark for joint audio-speech commonsense reasoning—requiring models to infer speaker actions from mixed audio containing both speech and environmental sounds. We formally define the “audio-speech co-reasoning” task, construct a scenario-based multimodal dataset, *What Are They Doing?*, and introduce a modality dependency analysis framework. Building upon auditory large language model (ALLM) architectures, our approach integrates Whisper speech representations, CNN-based spectrogram encoding, and a speech-audio semantic alignment mechanism, trained via multi-task joint fine-tuning. On JASCO, our method achieves significant gains in cross-modal reasoning accuracy. Moreover, it reveals, for the first time, the strong reliance of current ALLMs on speech modality and their critical bottleneck in understanding environmental audio.
📝 Abstract
In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called"What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.