Video Finetuning Improves Reasoning Between Frames

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work investigates how video fine-tuning enhances the temporal reasoning capabilities of multimodal large language models (MLLMs), specifically examining whether and how such models implicitly acquire inter-frame reasoning ability. To probe this, we propose visual Chain-of-Thought (vCoT)—an explicit, fine-grained temporal reasoning framework that generates stepwise event descriptions to uncover latent temporal modeling capacity. Experiments show that vCoT significantly improves pure image-based MLLMs on long-video question answering, confirming its efficacy in compensating for temporal reasoning deficits. Conversely, video-fine-tuned models already possess strong implicit inter-frame modeling capacity, enabling effective transfer to static relational reasoning tasks—outperforming image-only baselines. Crucially, this study is the first to systematically disentangle and empirically validate the synergistic interaction between video fine-tuning and explicit temporal reasoning. Our findings establish a new paradigm for dynamic visual understanding in MLLMs.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (LLMs) have made rapid progress in visual understanding, yet their extension from images to videos often reduces to a naive concatenation of frame tokens. In this work, we investigate what video finetuning brings to multimodal LLMs. We propose Visual Chain-of-Thought (vCoT), an explicit reasoning process that generates transitional event descriptions between consecutive frames. Using vCoT, we systematically compare image-only LVLMs with their video-finetuned counterparts, both with and without access to these transitional cues. Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering, while yielding only marginal gains for video-finetuned models. This suggests that the latter already capture frame-to-frame transitions implicitly. Moreover, we find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models' baselines on relational visual reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal LLMs' reasoning between consecutive video frames

Improving long-form video question answering through transitional event descriptions

Transferring temporal reasoning ability from videos to static visual tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video finetuning enhances multimodal LLMs' temporal reasoning

Visual Chain-of-Thought generates transitional event descriptions

Video models transfer temporal reasoning to static visual tasks

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding