Not in Sync: Unveiling Temporal Bias in Audio Chat Models

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a systematic temporal misalignment in large audio-language models (LALMs): predicted timestamps consistently precede or lag ground-truth event onset times, with error accumulating linearly with audio duration—reaching tens of seconds in long recordings. This bias is pervasive across datasets, models, and event types. To quantify it, we propose the Temporal Bias Index (TBI) and introduce the first interpretable, visualization-based analytical framework to systematically characterize LALMs’ temporal alignment failures. Through controlled experiments and statistical analysis across multiple LALMs and benchmark datasets, we empirically demonstrate that current models face a fundamental bottleneck in modeling long-range temporal structure in audio-language understanding. Our findings establish a novel evaluation paradigm for audio–language cross-modal alignment and provide concrete directions for improving temporal fidelity in multimodal foundation models.

Technology Category

Application Category

📝 Abstract
Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked "At which second does the lecturer introduce the key formula?", models often predict timestamps that are consistently earlier or later than the ground truth. Through controlled experiments on timestamped datasets, we find that temporal bias (i) is prevalent across datasets and models, (ii) increases with audio length - even accumulating to tens of seconds in extended recordings, and (iii) varies across event types and positions. We quantify this effect with the Temporal Bias Index (TBI), measuring systematic misalignment in predicted event timings, and complement it with a visualization framework. Our findings highlight a fundamental limitation in current LALMs and call for the development of temporally robust architectures.
Problem

Research questions and friction points this paper is trying to address.

LALMs struggle to accurately locate event timestamps
Temporal bias increases with audio length significantly
Systematic misalignment varies across event types and positions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically measuring temporal bias with TBI index
Visualizing timestamp misalignment through framework
Proposing temporally robust architectures for LALMs
🔎 Similar Papers
No similar papers found.