🤖 AI Summary
This work addresses the inefficiencies in multimodal instruction tuning caused by heterogeneous utility among samples in mixed image-video data pools. To this end, we propose Goal-Driven Optimization (GDO), a novel framework that introduces, for the first time, a task-oriented data selection mechanism for multimodal training. GDO dynamically constructs high-efficiency subsets tailored to diverse optimization objectives—such as MinLoss, Diverse, and Temp—using six sample descriptors. Under a fixed training protocol, our method achieves superior performance using only 27.3k–35.4k samples, outperforming the Uni-10x baseline trained on 512k samples across MVBench, VideoMME, MLVU, and LVBench, with accuracy gains up to 3.08 percentage points. This demonstrates a significant improvement in both training efficiency and long-form video understanding capability.
📝 Abstract
Multimodal instruction tuning is often compute-inefficient because training budgets are spread across large mixed image-video pools whose utility is highly uneven. We present Goal-Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1$\times$ training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra-long-video setting and the mismatch between that benchmark and the short-video/image-dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long-video understanding behavior. Overall, GDO provides a goal-driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at https://github.com/rujiewu/GDO.