🤖 AI Summary
This work addresses the limitation of existing DETR-based models, which only model semantic uncertainty and struggle to capture spatial uncertainty effectively, while high-accuracy approaches such as deep ensembles or Monte Carlo (MC) Dropout incur substantial computational overhead and inference latency. To overcome this, the authors propose GroupEnsemble, a method that injects multiple independent sets of object queries in parallel into the Transformer decoder within a single forward pass, enabling efficient ensemble prediction under the DETR framework for the first time. By combining inter-group attention masking with MC Dropout, the approach jointly models both spatial and semantic uncertainties. Experiments on Cityscapes and COCO demonstrate that GroupEnsemble outperforms deep ensembles across multiple uncertainty estimation metrics at significantly lower computational cost.
📝 Abstract
Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty.
To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder's inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at https://github.com/yutongy98/GroupEnsemble.