🤖 AI Summary
To address the challenges of insufficient pose diversity, low photorealism, and high construction cost in few-shot aerial human detection, this paper proposes an unpaired progressive pose transfer framework. Methodologically, it first synthesizes diverse human poses; then constructs a graph structure based on semantic similarity and employs Dijkstra’s algorithm to derive controllable, temporally coherent pose evolution paths; finally, it performs style-preserving conditional image translation to iteratively inject neighboring pose sequences into real-world aerial backgrounds, achieving high-fidelity pose augmentation. Crucially, the method requires no paired pose images or additional synthetic data. Evaluated on three aerial benchmarks—VisDrone, Okutama-Action, and ICG—the approach significantly improves few-shot detection accuracy, demonstrating both effectiveness and cross-dataset generalizability.
📝 Abstract
We present a framework for diversifying human poses in a synthetic dataset for aerial-view human detection. Our method firstly constructs a set of novel poses using a pose generator and then alters images in the existing synthetic dataset to assume the novel poses while maintaining the original style using an image translator. Since images corresponding to the novel poses are not available in training, the image translator is trained to be applicable only when the input and target poses are similar, thus training does not require the novel poses and their corresponding images. Next, we select a sequence of target novel poses from the novel pose set, using Dijkstra's algorithm to ensure that poses closer to each other are located adjacently in the sequence. Finally, we repeatedly apply the image translator to each target pose in sequence to produce a group of novel pose images representing a variety of different limited body movements from the source pose. Experiments demonstrate that, regardless of how the synthetic data is used for training or the data size, leveraging the pose-diversified synthetic dataset in training generally presents remarkably better accuracy than using the original synthetic dataset on three aerial-view human detection benchmarks (VisDrone, Okutama-Action, and ICG) in the few-shot regime.