SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding

📅 2025-04-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the limitations of Video-LLMs in fine-grained spatiotemporal understanding—particularly in modeling dynamic changes and reasoning about local details—this paper proposes SF²T, a self-supervised video snippet fine-tuning paradigm. SF²T requires no human annotations; instead, it leverages inherent spatiotemporal consistency within videos to construct self-supervised tasks, jointly optimizing the video encoder and multimodal large language model. Our contributions are threefold: (1) the first lightweight SF²T fine-tuning framework; (2) FineVidBench, the first benchmark supporting dual-granularity evaluation at both scene-level and snippet-level; and (3) a hierarchical evaluation protocol. Experiments demonstrate that SF²T achieves an average accuracy improvement of 12.7% on FineVidBench, significantly enhancing the model’s capacity to perceive and interpret dynamic fine-grained visual details.

Technology Category

Application Category

📝 Abstract

Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF$^2$T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. We assessed multiple models and validated the effectiveness of SF$^2$T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details.

Problem

Research questions and friction points this paper is trying to address.

Improving fine-grained video understanding in Video-LLMs

Reducing reliance on labor-intensive video annotations

Addressing limitations in capturing spatiotemporal video variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised fragment fine-tuning for Video-LLMs

Leverages inherent video characteristics for training

Introduces FineVidBench for fine-grained evaluation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Authors to Follow