π€ AI Summary
This work addresses the trade-off between quality and quantity in synthetic data for large-scale multimodal model training by proposing the One-Step-Train framework, which introduces optimization theory into data selection for the first time. The method employs a lightweight proxy model to simulate a single-step gradient update, enabling efficient estimation of each sampleβs marginal utility. Leveraging incremental optimization and Pareto optimality, it selects high-value data subsets with strong efficiency, interpretability, and the ability to identify harmful samples. Experiments on the Qwen model series demonstrate that using only the top-20 selected data subset outperforms LLM-as-a-Judge by 5.6 points and full supervised fine-tuning (Full-SFT) by 8.8 points; further, the top-50 subset reduces training costs by 43% while improving performance by 1.8 points.
π Abstract
The scaling of Large Multimodal Models (LMMs) is constrained by the quality-quantity trade-off inherent in synthetic data. Previous approaches, such as LLM-as-a-Judge, have proven their effectiveness in addressing this but suffer from prohibitive computational costs and lack of interpretability. To bridge this gap, we propose One-Step-Train (OST), a framework that reformulates data selection as an incremental optimization utility ranking problem. Instead of relying on semantic heuristics, OST estimates the marginal utility of each sample via a simulated single-step update on a lightweight proxy. Experiments on the Qwen series across multimodal mathematical reasoning benchmarks demonstrate that OST achieves Pareto-optimal efficiency. By selecting the top-50 subset, OST reduces training costs by 43% (and total time consumption by 17) while surpassing the strong LLM-as-a-Judge baseline by 1.8 points. Furthermore, under a fixed compute budget, our method using only the top-20 subset achieves a 5.6 point gain over LLM-as-a-Judge, improves upon heuristic scoring baselines like DEITA, and outperforms the Full-SFT baseline by 8.8 points. Notably, while Full-SFT suffers from performance degradation due to noise, our optimization-grounded approach effectively identifies toxic samples, successfully reversing the negative transfer frequently observed in complex reasoning tasks.