🤖 AI Summary
This work addresses three key bottlenecks in Imaginary Supervised Object Detection (ISOD) under sim-to-real transfer: (1) low-quality synthetic data—caused by simplistic prompts, blurry images, and weak supervision; (2) slow convergence and overfitting of DETR-style detectors due to random query initialization; and (3) heightened sensitivity to pseudo-label noise induced by uniform denoising. To tackle these, we propose a high-quality proposal-guided query initialization and cascaded denoising framework. Specifically, we generate high-fidelity synthetic data using LLaMA-3, Flux, and Grounding DINO; initialize DETR queries via SAM-proposed regions and RoI-pooled feature encoding; and introduce a hierarchical, IoU-thresholded denoising training strategy. Trained solely on FluxVOC for 12 epochs, our method achieves 61.04% mAP@0.5 on PASCAL VOC 2007—significantly outperforming strong baselines and marking the first efficient leap from weakly supervised synthetic training to fully supervised real-world performance.
📝 Abstract
Object detection models demand large-scale annotated datasets, which are costly and labor-intensive to create. This motivated Imaginary Supervised Object Detection (ISOD), where models train on synthetic images and test on real images. However, existing methods face three limitations: (1) synthetic datasets suffer from simplistic prompts, poor image quality, and weak supervision; (2) DETR-based detectors, due to their random query initialization, struggle with slow convergence and overfitting to synthetic patterns, hindering real-world generalization; (3) uniform denoising pressure promotes model overfitting to pseudo-label noise. We propose Cascade HQP-DETR to address these limitations. First, we introduce a high-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to generate the FluxVOC and FluxCOCO datasets, advancing ISOD from weak to full supervision. Second, our High-Quality Proposal guided query encoding initializes object queries with image-specific priors from SAM-generated proposals and RoI-pooled features, accelerating convergence while steering the model to learn transferable features instead of overfitting to synthetic patterns. Third, our cascade denoising algorithm dynamically adjusts training weights through progressively increasing IoU thresholds across decoder layers, guiding the model to learn robust boundaries from reliable visual cues rather than overfitting to noisy labels. Trained for just 12 epochs solely on FluxVOC, Cascade HQP-DETR achieves a SOTA 61.04% mAP@0.5 on PASCAL VOC 2007, outperforming strong baselines, with its competitive real-data performance confirming the architecture's universal applicability.