🤖 AI Summary
This work identifies a systematic temporal misalignment in large audio-language models (LALMs): predicted timestamps consistently precede or lag ground-truth event onset times, with error accumulating linearly with audio duration—reaching tens of seconds in long recordings. This bias is pervasive across datasets, models, and event types. To quantify it, we propose the Temporal Bias Index (TBI) and introduce the first interpretable, visualization-based analytical framework to systematically characterize LALMs’ temporal alignment failures. Through controlled experiments and statistical analysis across multiple LALMs and benchmark datasets, we empirically demonstrate that current models face a fundamental bottleneck in modeling long-range temporal structure in audio-language understanding. Our findings establish a novel evaluation paradigm for audio–language cross-modal alignment and provide concrete directions for improving temporal fidelity in multimodal foundation models.
📝 Abstract
Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked "At which second does the lecturer introduce the key formula?", models often predict timestamps that are consistently earlier or later than the ground truth. Through controlled experiments on timestamped datasets, we find that temporal bias (i) is prevalent across datasets and models, (ii) increases with audio length - even accumulating to tens of seconds in extended recordings, and (iii) varies across event types and positions. We quantify this effect with the Temporal Bias Index (TBI), measuring systematic misalignment in predicted event timings, and complement it with a visualization framework. Our findings highlight a fundamental limitation in current LALMs and call for the development of temporally robust architectures.