RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Vision-language models (VLMs) exhibit limited performance in low-resource video classification due to the “reasoning gap” between sparse domain-specific spatiotemporal content and abstract class labels. To address this, we propose a two-stage self-improving fine-tuning framework. In Stage I, prompt engineering enables VLMs to autonomously generate domain-specific video reasoning texts—serving as interpretable, intermediate supervision signals. In Stage II, we jointly optimize self-supervised rationale-guided fine-tuning and standard supervised fine-tuning. This is the first work to integrate self-generated textual rationales into VLM-based video understanding without requiring additional human annotations, thereby significantly enhancing the model’s capacity for domain-specific spatiotemporal reasoning. Extensive experiments on multiple low-data video benchmarks demonstrate consistent superiority over conventional supervised fine-tuning, validating both effectiveness and generalizability under data-scarce conditions.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical extit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.

Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with domain-specific video classification under limited data

Rationale gap exists between spatio-temporal content and classification labels

Method improves VLM adaptation without requiring new annotation resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates rationales to bridge semantic gaps

Fine-tunes models using self-generated textual rationales

Uses two-stage fine-tuning for domain adaptation

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding