🤖 AI Summary
This work addresses the significant performance degradation of human pose estimation under low-light conditions, primarily caused by scarce annotated data and degraded visual information. To tackle this challenge, the authors propose an unsupervised domain adaptation framework that synthesizes photorealistic low-light images and integrates them within a Transformer architecture through a DC-component-based high-pass filter (DHF), a low-light feature injection module (LCIM), and a dynamic attention control mechanism (DCA). This design enables adaptive fusion of visual cues and pose priors while effectively preserving high-frequency details and enhancing robustness. The method achieves state-of-the-art results, improving AP by 10.1 points to 56.4% on the ExLPose-test hard set and by 7.4 points to 31.4% on the cross-dataset EHPT-XC benchmark, substantially outperforming existing approaches.
📝 Abstract
Low-visibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions. But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes. Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions. To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes low-light images and dynamically fuses visual cues with pose priors for improved pose estimation. Specifically, our synthesis method incorporates a Direct-Current-based High-Pass Filter (DHF) and a Low-light Characteristics Injection Module (LCIM) to inject high-frequency details from input low-light images, overcoming rigidity or the detail loss in existing approaches. Furthermore, we introduce a Dynamic Control of Attention (DCA) module that adaptively balances image cues with learned pose priors in the Transformer architecture. Experiments show that UDAPose outperforms state-of-the-art methods, with notable AP gains of 10.1 (56.4%) on the ExLPose-test hard set (LL-H) and 7.4 (31.4%) in cross-dataset validation on EHPT-XC. Code: https://github.com/Vision-and-Multimodal-Intelligence-Lab/UDAPose