🤖 AI Summary
Nighttime pedestrian detection suffers from severe performance degradation when detectors trained solely on daytime data are applied, due to scarce annotated nighttime data and significant illumination discrepancies between day and night. To address this, this work proposes a day-to-night image translation framework based on SDXL-Turbo, incorporating a DINOv2 self-attention map-guided block-level semantic contrastive loss and a target consistency loss. This approach generates high-fidelity nighttime images while effectively preserving pedestrian semantic structures and annotation information through local–global semantic consistency constraints, thereby significantly enhancing object retention during domain transfer. Experiments show that the synthesized images achieve an FID of 22.5, and detectors trained on this synthetic data reduce miss rates by 6–7% compared to the daytime baseline, approaching the performance of models trained on real nighttime data.
📝 Abstract
Night-time pedestrian detection remains challenging because labelled night-time data are limited and large illumination differences make daytime-only trained detectors unreliable. Latent diffusion models (LDMs) provide a powerful basis for image-to-image translation and cross-domain augmentation, but their effectiveness in safety-critical perception depends on whether detector-relevant objects and local semantic structure are preserved when translating between source and target domains. In this work, we present Contrastive-SDXL, a day-to-night augmentation framework for night-time pedestrian detection built on SDXL-Turbo and fine-tuned using Low-Rank Adaptation (LoRA). To preserve semantic correspondence between daytime inputs and translated night-time images, we introduce a patch-wise semantic contrastive loss guided by a pretrained DINOv2 encoder rather than generator encoder features. Multi-level DINOv2 self-attention maps enforce both local and global semantic consistency, while an object consistency loss explicitly encourages pedestrian preservation. Contrastive-SDXL produces realistic night-time images, achieving a Frechet Inception Distance (FID) of 22.5. Detectors trained with our synthetic images obtain a 6-7% reduction in miss rate compared with a daytime-only baseline, approaching the performance of detectors trained on real night-time data. These results demonstrate that consistency-driven diffusion augmentation can effectively support safety-critical night-time pedestrian detection.Specific