🤖 AI Summary
Video-LLMs exhibit significant limitations in temporal reasoning tasks. To address this, we propose a systematic solution comprising three core components: (1) the first five-dimensional temporal-aware instruction-tuning dataset—covering sequence order, duration, causality, frequency, and relative timing; (2) a multi-task prompt-tuning framework that requires no additional temporal annotations; and (3) a novel temporal understanding benchmark designed to be robust against spatial/static shortcut biases and enforce multi-dimensional alignment. Our method integrates temporal dimension disentanglement modeling with shortcut identification and filtering. Extensive experiments demonstrate that our approach achieves an average 19.7% improvement across temporal reasoning, event ordering, and duration estimation tasks, substantially enhancing the robustness and fidelity of Video-LLMs’ temporal modeling capabilities.
📝 Abstract
Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.