π€ AI Summary
Existing video large language models often struggle to accurately model temporal dynamics under sparse frame sampling due to the absence of intermediate frames. To address this limitation, this work proposes ViKey, a training-free framework that explicitly incorporates frame indices as dictionary keys through visual prompting and a lightweight Keyword-Frame Mapping (KFM) module, enabling efficient temporal anchoring and frame-level referencing. Without requiring any additional training, ViKey achieves performance on multiple benchmarks that closely matches that of full-frame inputs while utilizing only 20% of the video frames, substantially enhancing the modelβs capacity for temporal reasoning about event progression.
π Abstract
Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.