How Important are Videos for Training Video LLMs?

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study challenges the prevailing assumption that video data is essential for training Video Large Language Models (Video LLMs), investigating whether image-text data alone can support robust temporal reasoning. Method: We systematically evaluate image-only trained LLMs on the TVBench benchmark and propose a lightweight fine-tuning paradigm using annotated, temporally ordered image sequences—requiring only a small number of frames per sequence. Contribution/Results: Our experiments show that image-only models already significantly outperform random baselines on temporal reasoning tasks, while full video fine-tuning yields marginal gains. The proposed image-sequence fine-tuning achieves performance comparable to, and occasionally exceeding, that of full video fine-tuning across multiple temporal reasoning benchmarks. These findings refute the necessity of video data for high-quality temporal modeling, demonstrating that temporally structured image sequences inherently encode sufficient sequential information. This work establishes a more efficient, cost-effective paradigm for Video LLM training.

Technology Category

Application Category

📝 Abstract

Research into Video Large Language Models (LLMs) has progressed rapidly, with numerous models and benchmarks emerging in just a few years. Typically, these models are initialized with a pretrained text-only LLM and finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video-specific training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recent LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Additionally, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in temporal reasoning performance close to, and occasionally higher than, what is achieved by video-trained LLMs. This suggests suboptimal utilization of rich temporal features found in real video by current models. Our analysis motivates further research into the mechanisms that allow image-trained LLMs to perform temporal reasoning, as well as into the bottlenecks that render current video training schemes inefficient.

Problem

Research questions and friction points this paper is trying to address.

Assessing video importance for training Video LLMs

Evaluating temporal reasoning in image-trained Video LLMs

Identifying inefficiencies in current video training schemes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-trained LLMs excel in temporal reasoning

LongVU algorithm boosts TVBench performance

Simple annotated image sequences enhance temporal capabilities

🔎 Similar Papers

From Image to Video, what do we need in multimodal LLMs?