Are All Data Necessary? Efficient Data Pruning for Large-scale Autonomous Driving Dataset via Trajectory Entropy Maximization

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale autonomous driving datasets suffer from severe redundancy, leading to high storage and training costs with diminishing returns in model performance. To address this, we propose a model-agnostic, unsupervised data pruning method that jointly optimizes trajectory entropy maximization and KL divergence minimization to preserve the statistical characteristics of the original trajectory distribution. Our approach employs iterative greedy sampling to select high-information samples without requiring model feedback or hand-crafted heuristics. This work is the first to systematically integrate trajectory entropy optimization with distribution matching theory, breaking from conventional pruning paradigms reliant on model-specific signals or manual rules. Evaluated on the NuPlan benchmark, our method achieves up to 40% data compression while maintaining closed-loop performance—i.e., no degradation in collision rate or task completion rate—and significantly reduces both storage footprint and training cost.

Technology Category

Application Category

📝 Abstract
Collecting large-scale naturalistic driving data is essential for training robust autonomous driving planners. However, real-world datasets often contain a substantial amount of repetitive and low-value samples, which lead to excessive storage costs and bring limited benefits to policy learning. To address this issue, we propose an information-theoretic data pruning method that effectively reduces the training data volume without compromising model performance. Our approach evaluates the trajectory distribution information entropy of driving data and iteratively selects high-value samples that preserve the statistical characteristics of the original dataset in a model-agnostic manner. From a theoretical perspective, we show that maximizing trajectory entropy effectively constrains the Kullback-Leibler divergence between the pruned subset and the original data distribution, thereby maintaining generalization ability. Comprehensive experiments on the NuPlan benchmark with a large-scale imitation learning framework demonstrate that the proposed method can reduce the dataset size by up to 40% while maintaining closed-loop performance. This work provides a lightweight and theoretically grounded approach for scalable data management and efficient policy learning in autonomous driving systems.
Problem

Research questions and friction points this paper is trying to address.

Reduces repetitive low-value data in autonomous driving datasets
Selects high-value samples via trajectory entropy maximization
Maintains model performance while cutting dataset size by 40%
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trajectory entropy maximization for data pruning
Model-agnostic selection of high-value driving samples
Reduces dataset size by 40% while maintaining performance
🔎 Similar Papers
No similar papers found.
Zhaoyang Liu
Zhaoyang Liu
Tongyi Lab, Alibaba Group
LLMRecommendation
Weitao Zhou
Weitao Zhou
Tsinghua University
Autonomous DrivingReinforcement Learning
J
Junze Wen
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
C
Cheng Jing
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
Qian Cheng
Qian Cheng
University of Leeds
sustainable developmentcolour science
Kun Jiang
Kun Jiang
Tsinghua University
autonomous driving
D
Diange Yang
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China