TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Video-LLMs exhibit significant limitations in temporal reasoning tasks. To address this, we propose a systematic solution comprising three core components: (1) the first five-dimensional temporal-aware instruction-tuning dataset—covering sequence order, duration, causality, frequency, and relative timing; (2) a multi-task prompt-tuning framework that requires no additional temporal annotations; and (3) a novel temporal understanding benchmark designed to be robust against spatial/static shortcut biases and enforce multi-dimensional alignment. Our method integrates temporal dimension disentanglement modeling with shortcut identification and filtering. Extensive experiments demonstrate that our approach achieves an average 19.7% improvement across temporal reasoning, event ordering, and duration estimation tasks, substantially enhancing the robustness and fidelity of Video-LLMs’ temporal modeling capabilities.

Technology Category

Application Category

📝 Abstract

Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.

Problem

Research questions and friction points this paper is trying to address.

Enhance temporal understanding in video large language models.

Reduce reliance on costly temporal annotations in video tasks.

Develop a benchmark for accurate temporal-sensitive video understanding.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal-sensitive instruction fine-tuning dataset

Multi-task prompt fine-tuning without annotations

Novel benchmark for temporal video understanding

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs