AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high data redundancy and excessive computational overhead in large-scale object detection training, this paper proposes AdaDeDup, an adaptive hybrid data pruning framework. Its core innovation lies in coupling density-aware initial screening with a model-feedback-driven, cluster-adaptive deduplication mechanism: samples are first clustered based on feature-space density; then, a lightweight surrogate model evaluates task-aware loss per cluster to dynamically assign cluster-specific pruning thresholds. Evaluated on Waymo and COCO using BEVFormer and Faster R-CNN, AdaDeDup retains only 80% of the original training data while preserving over 98% of baseline performance. Compared to random sampling, it reduces accuracy degradation by 54.3%. The method significantly improves training efficiency and enhances data utilization quality without compromising detection accuracy.

Technology Category

Application Category

📝 Abstract
The computational burden and inherent redundancy of large-scale datasets challenge the training of contemporary machine learning models. Data pruning offers a solution by selecting smaller, informative subsets, yet existing methods struggle: density-based approaches can be task-agnostic, while model-based techniques may introduce redundancy or prove computationally prohibitive. We introduce Adaptive De-Duplication (AdaDeDup), a novel hybrid framework that synergistically integrates density-based pruning with model-informed feedback in a cluster-adaptive manner. AdaDeDup first partitions data and applies an initial density-based pruning. It then employs a proxy model to evaluate the impact of this initial pruning within each cluster by comparing losses on kept versus pruned samples. This task-aware signal adaptively adjusts cluster-specific pruning thresholds, enabling more aggressive pruning in redundant clusters while preserving critical data in informative ones. Extensive experiments on large-scale object detection benchmarks (Waymo, COCO, nuScenes) using standard models (BEVFormer, Faster R-CNN) demonstrate AdaDeDup's advantages. It significantly outperforms prominent baselines, substantially reduces performance degradation (e.g., over 54% versus random sampling on Waymo), and achieves near-original model performance while pruning 20% of data, highlighting its efficacy in enhancing data efficiency for large-scale model training. Code is open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational burden in large-scale object detection training
Improves data pruning by combining density-based and model-based methods
Minimizes performance degradation while pruning redundant data samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid density-model pruning framework
Cluster-adaptive threshold adjustment
Proxy model evaluates pruning impact
🔎 Similar Papers
No similar papers found.