RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) exhibit limited performance in low-resource video classification due to the “reasoning gap” between sparse domain-specific spatiotemporal content and abstract class labels. To address this, we propose a two-stage self-improving fine-tuning framework. In Stage I, prompt engineering enables VLMs to autonomously generate domain-specific video reasoning texts—serving as interpretable, intermediate supervision signals. In Stage II, we jointly optimize self-supervised rationale-guided fine-tuning and standard supervised fine-tuning. This is the first work to integrate self-generated textual rationales into VLM-based video understanding without requiring additional human annotations, thereby significantly enhancing the model’s capacity for domain-specific spatiotemporal reasoning. Extensive experiments on multiple low-data video benchmarks demonstrate consistent superiority over conventional supervised fine-tuning, validating both effectiveness and generalizability under data-scarce conditions.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical extit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.
Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with domain-specific video classification under limited data
Rationale gap exists between spatio-temporal content and classification labels
Method improves VLM adaptation without requiring new annotation resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates rationales to bridge semantic gaps
Fine-tunes models using self-generated textual rationales
Uses two-stage fine-tuning for domain adaptation
🔎 Similar Papers
No similar papers found.
Meilong Xu
Meilong Xu
Stony Brook University
Machine LearningComputer VisionTopological Data Analysis
D
Di Fu
ByteDance Inc., Seattle, WA, USA / Sydney, Australia
J
Jiaxing Zhang
ByteDance Inc., Seattle, WA, USA / Sydney, Australia
G
Gong Yu
ByteDance Inc., Seattle, WA, USA / Sydney, Australia
J
Jiayu Zheng
ByteDance Inc., Seattle, WA, USA / Sydney, Australia
X
Xiaoling Hu
Harvard Medical School, Boston, MA, USA
D
Dongdi Zhao
ByteDance Inc., Seattle, WA, USA / Sydney, Australia
F
Feiyang Li
ByteDance Inc., Seattle, WA, USA / Sydney, Australia
C
Chao Chen
Stony Brook University, Stony Brook, NY , USA
Y
Yong Cao
ByteDance Inc., Seattle, WA, USA / Sydney, Australia