TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video-LLMs exhibit significant limitations in temporal reasoning tasks. To address this, we propose a systematic solution comprising three core components: (1) the first five-dimensional temporal-aware instruction-tuning dataset—covering sequence order, duration, causality, frequency, and relative timing; (2) a multi-task prompt-tuning framework that requires no additional temporal annotations; and (3) a novel temporal understanding benchmark designed to be robust against spatial/static shortcut biases and enforce multi-dimensional alignment. Our method integrates temporal dimension disentanglement modeling with shortcut identification and filtering. Extensive experiments demonstrate that our approach achieves an average 19.7% improvement across temporal reasoning, event ordering, and duration estimation tasks, substantially enhancing the robustness and fidelity of Video-LLMs’ temporal modeling capabilities.

Technology Category

Application Category

📝 Abstract
Video large language models have achieved remarkable performance in tasks such as video question answering, however, their temporal understanding remains suboptimal. To address this limitation, we curate a dedicated instruction fine-tuning dataset that focuses on enhancing temporal comprehension across five key dimensions. In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. Furthermore, we develop a novel benchmark for temporal-sensitive video understanding that not only fills the gaps in dimension coverage left by existing benchmarks but also rigorously filters out potential shortcuts, ensuring a more accurate evaluation. Extensive experimental results demonstrate that our approach significantly enhances the temporal understanding of video-LLMs while avoiding reliance on shortcuts.
Problem

Research questions and friction points this paper is trying to address.

Enhance temporal understanding in video large language models.
Reduce reliance on costly temporal annotations in video tasks.
Develop a benchmark for accurate temporal-sensitive video understanding.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal-sensitive instruction fine-tuning dataset
Multi-task prompt fine-tuning without annotations
Novel benchmark for temporal video understanding
Yunxiao Wang
Yunxiao Wang
Phd student, Shandong University
Multimedia ComputingAffective ComputingInformation Retrieval
M
Meng Liu
Shandong Jianzhu University
Rui Shao
Rui Shao
Professor, Harbin Institute of Technology (Shenzhen)
Computer VisionMultimodal LLMEmbodied AI
H
Haoyu Zhang
Harbin Institute of Technology
Bin Wen
Bin Wen
快手
MLLM
F
Fan Yang
Kuaishou Technology
T
Tingting Gao
Kuaishou Technology
D
Di Zhang
Kuaishou Technology
L
Liqiang Nie
Harbin Institute of Technology