AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address high data redundancy and excessive computational overhead in large-scale object detection training, this paper proposes AdaDeDup, an adaptive hybrid data pruning framework. Its core innovation lies in coupling density-aware initial screening with a model-feedback-driven, cluster-adaptive deduplication mechanism: samples are first clustered based on feature-space density; then, a lightweight surrogate model evaluates task-aware loss per cluster to dynamically assign cluster-specific pruning thresholds. Evaluated on Waymo and COCO using BEVFormer and Faster R-CNN, AdaDeDup retains only 80% of the original training data while preserving over 98% of baseline performance. Compared to random sampling, it reduces accuracy degradation by 54.3%. The method significantly improves training efficiency and enhances data utilization quality without compromising detection accuracy.

Technology Category

Application Category

📝 Abstract

The computational burden and inherent redundancy of large-scale datasets challenge the training of contemporary machine learning models. Data pruning offers a solution by selecting smaller, informative subsets, yet existing methods struggle: density-based approaches can be task-agnostic, while model-based techniques may introduce redundancy or prove computationally prohibitive. We introduce Adaptive De-Duplication (AdaDeDup), a novel hybrid framework that synergistically integrates density-based pruning with model-informed feedback in a cluster-adaptive manner. AdaDeDup first partitions data and applies an initial density-based pruning. It then employs a proxy model to evaluate the impact of this initial pruning within each cluster by comparing losses on kept versus pruned samples. This task-aware signal adaptively adjusts cluster-specific pruning thresholds, enabling more aggressive pruning in redundant clusters while preserving critical data in informative ones. Extensive experiments on large-scale object detection benchmarks (Waymo, COCO, nuScenes) using standard models (BEVFormer, Faster R-CNN) demonstrate AdaDeDup's advantages. It significantly outperforms prominent baselines, substantially reduces performance degradation (e.g., over 54% versus random sampling on Waymo), and achieves near-original model performance while pruning 20% of data, highlighting its efficacy in enhancing data efficiency for large-scale model training. Code is open-sourced.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational burden in large-scale object detection training

Improves data pruning by combining density-based and model-based methods

Minimizes performance degradation while pruning redundant data samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid density-model pruning framework

Cluster-adaptive threshold adjustment

Proxy model evaluates pruning impact

🔎 Similar Papers

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation