Are All Data Necessary? Efficient Data Pruning for Large-scale Autonomous Driving Dataset via Trajectory Entropy Maximization

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Large-scale autonomous driving datasets suffer from severe redundancy, leading to high storage and training costs with diminishing returns in model performance. To address this, we propose a model-agnostic, unsupervised data pruning method that jointly optimizes trajectory entropy maximization and KL divergence minimization to preserve the statistical characteristics of the original trajectory distribution. Our approach employs iterative greedy sampling to select high-information samples without requiring model feedback or hand-crafted heuristics. This work is the first to systematically integrate trajectory entropy optimization with distribution matching theory, breaking from conventional pruning paradigms reliant on model-specific signals or manual rules. Evaluated on the NuPlan benchmark, our method achieves up to 40% data compression while maintaining closed-loop performance—i.e., no degradation in collision rate or task completion rate—and significantly reduces both storage footprint and training cost.

Technology Category

Application Category

📝 Abstract

Collecting large-scale naturalistic driving data is essential for training robust autonomous driving planners. However, real-world datasets often contain a substantial amount of repetitive and low-value samples, which lead to excessive storage costs and bring limited benefits to policy learning. To address this issue, we propose an information-theoretic data pruning method that effectively reduces the training data volume without compromising model performance. Our approach evaluates the trajectory distribution information entropy of driving data and iteratively selects high-value samples that preserve the statistical characteristics of the original dataset in a model-agnostic manner. From a theoretical perspective, we show that maximizing trajectory entropy effectively constrains the Kullback-Leibler divergence between the pruned subset and the original data distribution, thereby maintaining generalization ability. Comprehensive experiments on the NuPlan benchmark with a large-scale imitation learning framework demonstrate that the proposed method can reduce the dataset size by up to 40% while maintaining closed-loop performance. This work provides a lightweight and theoretically grounded approach for scalable data management and efficient policy learning in autonomous driving systems.

Problem

Research questions and friction points this paper is trying to address.

Reduces repetitive low-value data in autonomous driving datasets

Selects high-value samples via trajectory entropy maximization

Maintains model performance while cutting dataset size by 40%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory entropy maximization for data pruning

Model-agnostic selection of high-value driving samples

Reduces dataset size by 40% while maintaining performance

🔎 Similar Papers

SSTP: Efficient Sample Selection for Trajectory Prediction