Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiencies in multimodal instruction tuning caused by heterogeneous utility among samples in mixed image-video data pools. To this end, we propose Goal-Driven Optimization (GDO), a novel framework that introduces, for the first time, a task-oriented data selection mechanism for multimodal training. GDO dynamically constructs high-efficiency subsets tailored to diverse optimization objectives—such as MinLoss, Diverse, and Temp—using six sample descriptors. Under a fixed training protocol, our method achieves superior performance using only 27.3k–35.4k samples, outperforming the Uni-10x baseline trained on 512k samples across MVBench, VideoMME, MLVU, and LVBench, with accuracy gains up to 3.08 percentage points. This demonstrates a significant improvement in both training efficiency and long-form video understanding capability.

Technology Category

Application Category

📝 Abstract
Multimodal instruction tuning is often compute-inefficient because training budgets are spread across large mixed image-video pools whose utility is highly uneven. We present Goal-Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1$\times$ training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra-long-video setting and the mismatch between that benchmark and the short-video/image-dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long-video understanding behavior. Overall, GDO provides a goal-driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at https://github.com/rujiewu/GDO.
Problem

Research questions and friction points this paper is trying to address.

Multimodal instruction tuning
Data efficiency
Training convergence
Sample selection
Compute inefficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Goal-Driven Data Optimization
Multimodal Instruction Tuning
Data Efficiency
Sample Selection
Temporal Understanding
🔎 Similar Papers
No similar papers found.