🤖 AI Summary
This work investigates how video fine-tuning enhances the temporal reasoning capabilities of multimodal large language models (MLLMs), specifically examining whether and how such models implicitly acquire inter-frame reasoning ability. To probe this, we propose visual Chain-of-Thought (vCoT)—an explicit, fine-grained temporal reasoning framework that generates stepwise event descriptions to uncover latent temporal modeling capacity. Experiments show that vCoT significantly improves pure image-based MLLMs on long-video question answering, confirming its efficacy in compensating for temporal reasoning deficits. Conversely, video-fine-tuned models already possess strong implicit inter-frame modeling capacity, enabling effective transfer to static relational reasoning tasks—outperforming image-only baselines. Crucially, this study is the first to systematically disentangle and empirically validate the synergistic interaction between video fine-tuning and explicit temporal reasoning. Our findings establish a new paradigm for dynamic visual understanding in MLLMs.
📝 Abstract
Multimodal large language models (LLMs) have made rapid progress in visual understanding, yet their extension from images to videos often reduces to a naive concatenation of frame tokens. In this work, we investigate what video finetuning brings to multimodal LLMs. We propose Visual Chain-of-Thought (vCoT), an explicit reasoning process that generates transitional event descriptions between consecutive frames. Using vCoT, we systematically compare image-only LVLMs with their video-finetuned counterparts, both with and without access to these transitional cues. Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering, while yielding only marginal gains for video-finetuned models. This suggests that the latter already capture frame-to-frame transitions implicitly. Moreover, we find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models' baselines on relational visual reasoning tasks.