🤖 AI Summary
To address the dual challenges of scarce manual annotations and significant synthetic-to-real domain gaps in UAV scenarios, this paper proposes a three-stage diffusion-model-driven coarse-to-fine hierarchical alignment framework. The method explicitly decouples and bridges discrepancies between synthetic and real images—both in global statistical distributions and local structural details—through synergistic stages: global style transfer, local super-resolution refinement, and hallucinated instance removal. Guided by few-shot real images and distribution consistency constraints, it achieves label-preserving, high-fidelity domain adaptation. On UAV Sim2Real benchmarks including Semantic-Drone, our approach improves mAP50 by +14.1% over strong baselines. Ablation studies confirm that all three stages are complementary and indispensable. This work pioneers a diffusion-based hierarchical domain alignment paradigm, offering a novel solution for low-resource UAV object detection.
📝 Abstract
Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at href{https://github.com/liwd190019/CFHA}{this url}.