Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of scarce manual annotations and significant synthetic-to-real domain gaps in UAV scenarios, this paper proposes a three-stage diffusion-model-driven coarse-to-fine hierarchical alignment framework. The method explicitly decouples and bridges discrepancies between synthetic and real images—both in global statistical distributions and local structural details—through synergistic stages: global style transfer, local super-resolution refinement, and hallucinated instance removal. Guided by few-shot real images and distribution consistency constraints, it achieves label-preserving, high-fidelity domain adaptation. On UAV Sim2Real benchmarks including Semantic-Drone, our approach improves mAP50 by +14.1% over strong baselines. Ablation studies confirm that all three stages are complementary and indispensable. This work pioneers a diffusion-based hierarchical domain alignment paradigm, offering a novel solution for low-resource UAV object detection.

Technology Category

Application Category

📝 Abstract
Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at href{https://github.com/liwd190019/CFHA}{this url}.
Problem

Research questions and friction points this paper is trying to address.

Bridges domain gap between synthetic and real UAV images.
Enhances human detection accuracy in drone-based scenarios.
Preserves synthetic labels while refining global and local details.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global style transfer aligns synthetic images with real-world color and texture.
Local refinement enhances fine details and preserves shape integrity.
Hallucination removal filters unrealistic human instances to match target distribution.
🔎 Similar Papers
No similar papers found.
W
Wenda Li
Department of Electrical Engineering and Computer Science, University of Michigan
Meng Wu
Meng Wu
Department of Electrical Engineering, Stanford University
Medical ImagingMachine LearningComputer Vision
S
Sungmin Eum
DEVCOM Army Research Laboratory
H
Heesung Kwon
DEVCOM Army Research Laboratory
Qing Qu
Qing Qu
Assistant Professor, Dept. of EECS, University of Michigan
Machine LearningNonconvex OptimizationHigh Dimensional Data AnalysisDeep Learning Theory