π€ AI Summary
Existing data selection methods struggle to effectively balance multidimensional performance metrics in end-to-end autonomous driving, leading to suboptimal training efficiency. This work proposes MOSAIC, a novel framework that introduces, for the first time, a scaling-aware data mixture optimization mechanism. MOSAIC partitions the data into distinct domains, models the neural scaling laws of each domain with respect to a comprehensive driving compliance metricβthe End-to-End Performance-Driven Metric Suite (EPDMS)βand iteratively optimizes the mixture ratios to dynamically balance domain influences. Experimental results demonstrate that MOSAIC achieves significant improvements over multiple baselines on EPDMS using only 20% of the training data, substantially enhancing both data utilization efficiency and model performance.
π Abstract
Large-scale deep learning models for physical AI applications depend on diverse training data collection efforts. These models and correspondingly, the training data, must address different evaluation criteria necessary for the models to be deployable in real-world environments. Data selection policies can guide the development of the training set, but current frameworks do not account for the ambiguity in how data points affect different metrics. In this work, we propose Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a general data selection framework that operates by: (i) partitioning the dataset into domains; (ii) fitting neural scaling laws from each data domain to the evaluation metrics; and (iii) optimizing a data mixture by iteratively adding data from domains that maximize the change in metrics. We apply MOSAIC to autonomous driving (AD), where an End-to-End (E2E) planner model is evaluated on the Extended Predictive Driver Model Score (EPDMS), an aggregate of driving rule compliance metrics. Here, MOSAIC outperforms a diverse set of baselines on EPDMS with up to 80\% less data.