Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Although video-supervised fine-tuning (Video-SFT) enhances the video understanding capabilities of multimodal large language models, it often degrades their performance on static image comprehension, revealing an inherent trade-off in spatiotemporal modeling. This work systematically investigates this phenomenon and finds that increasing the number of input frames improves video-related performance but fails to recover or enhance image understanding. To address this challenge, the authors propose an instruction-aware Hybrid-Frame strategy that dynamically and adaptively allocates frame counts during fine-tuning. This approach effectively mitigates the conflict between image and video performance across diverse model architectures, parameter scales, and frame sampling configurations, highlighting the critical importance of preserving spatial reasoning capabilities in multimodal learning.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.

Problem

Research questions and friction points this paper is trying to address.

video fine-tuning

spatial understanding

temporal understanding

multimodal large language models

image-video trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-SFT

spatial-temporal trade-off

Hybrid-Frame strategy