🤖 AI Summary
Large-scale autonomous driving datasets suffer from severe redundancy, leading to high storage and training costs with diminishing returns in model performance. To address this, we propose a model-agnostic, unsupervised data pruning method that jointly optimizes trajectory entropy maximization and KL divergence minimization to preserve the statistical characteristics of the original trajectory distribution. Our approach employs iterative greedy sampling to select high-information samples without requiring model feedback or hand-crafted heuristics. This work is the first to systematically integrate trajectory entropy optimization with distribution matching theory, breaking from conventional pruning paradigms reliant on model-specific signals or manual rules. Evaluated on the NuPlan benchmark, our method achieves up to 40% data compression while maintaining closed-loop performance—i.e., no degradation in collision rate or task completion rate—and significantly reduces both storage footprint and training cost.
📝 Abstract
Collecting large-scale naturalistic driving data is essential for training robust autonomous driving planners. However, real-world datasets often contain a substantial amount of repetitive and low-value samples, which lead to excessive storage costs and bring limited benefits to policy learning. To address this issue, we propose an information-theoretic data pruning method that effectively reduces the training data volume without compromising model performance. Our approach evaluates the trajectory distribution information entropy of driving data and iteratively selects high-value samples that preserve the statistical characteristics of the original dataset in a model-agnostic manner. From a theoretical perspective, we show that maximizing trajectory entropy effectively constrains the Kullback-Leibler divergence between the pruned subset and the original data distribution, thereby maintaining generalization ability. Comprehensive experiments on the NuPlan benchmark with a large-scale imitation learning framework demonstrate that the proposed method can reduce the dataset size by up to 40% while maintaining closed-loop performance. This work provides a lightweight and theoretically grounded approach for scalable data management and efficient policy learning in autonomous driving systems.