🤖 AI Summary
This study addresses the absence of effective pretraining evaluation metrics for generative image synthesis methods that reliably predict downstream YOLO object detection performance, as conventional global metrics like FID prove inadequate. The work systematically evaluates GANs, diffusion models, and hybrid generators across varying data complexities and augmentation ratios, assessing their impact on YOLOv11 performance. It integrates global features from Inception-v3 and DINOv2 embedding spaces with object-centric distribution distances derived from bounding box statistics. For the first time under controlled augmentation ratios, the research reveals the scene-dependent nature of the relationship between generative metrics and detection performance. Residualized correlation analysis, which accounts for augmentation quantity effects, shows that most raw metric correlations substantially diminish. Notably, synthetic augmentation yields relative mAP improvements of up to +7.6% and +30.6% in challenging scenarios involving pedestrians and potted plants, respectively.
📝 Abstract
Synthetic images are increasingly used to augment object-detection training sets, but reliably evaluating a synthetic dataset before training remains difficult: standard global generative metrics (e.g., FID) often do not predict downstream detection mAP. We present a controlled evaluation of synthetic augmentation for YOLOv11 across three single-class detection regimes -- Traffic Signs (sparse/near-saturated), Cityscapes Pedestrian (dense/occlusion-heavy), and COCO PottedPlant (multi-instance/high-variability). We benchmark six GAN-, diffusion-, and hybrid-based generators over augmentation ratios from 10% to 150% of the real training split, and train YOLOv11 both from scratch and with COCO-pretrained initialization, evaluating on held-out real test splits (mAP@0.50:0.95). For each dataset-generator-augmentation configuration, we compute pre-training dataset metrics under a matched-size bootstrap protocol, including (i) global feature-space metrics in both Inception-v3 and DINOv2 embeddings and (ii) object-centric distribution distances over bounding-box statistics. Synthetic augmentation yields substantial gains in the more challenging regimes (up to +7.6% and +30.6% relative mAP in Pedestrian and PottedPlant, respectively) but is marginal in Traffic Signs and under pretrained fine-tuning. To separate metric signal from augmentation quantity, we report both raw and augmentation-controlled (residualized) correlations with multiple-testing correction, showing that metric-performance alignment is strongly regime-dependent and that many apparent raw associations weaken after controlling for augmentation level.