🤖 AI Summary
Existing pose estimation methods struggle with multi-scale feature extraction for low-resolution images (e.g., distant human bodies or heads). To address this, we propose the Cascaded Multi-Scale Attention (CMSA) mechanism—the first to enable downsample-free cross-scale feature interaction within a CNN-ViT hybrid architecture. CMSA integrates grouped multi-head self-attention with window-based local attention in a cascaded design, facilitating seamless fusion of features with heterogeneous receptive fields while avoiding information loss from input or intermediate feature map downsampling. Evaluated on human and head pose estimation benchmarks, our approach achieves significant improvements over state-of-the-art methods with substantially fewer parameters—particularly under low-resolution conditions—demonstrating both superior accuracy and computational efficiency.
📝 Abstract
In real-world applications of image recognition tasks, such as human pose estimation, cameras often capture objects, like human bodies, at low resolutions. This scenario poses a challenge in extracting and leveraging multi-scale features, which is often essential for precise inference. To address this challenge, we propose a new attention mechanism, named cascaded multi-scale attention (CMSA), tailored for use in CNN-ViT hybrid architectures, to handle low-resolution inputs effectively. The design of CMSA enables the extraction and seamless integration of features across various scales without necessitating the downsampling of the input image or feature maps. This is achieved through a novel combination of grouped multi-head self-attention mechanisms with window-based local attention and cascaded fusion of multi-scale features over different scales. This architecture allows for the effective handling of features across different scales, enhancing the model's ability to perform tasks such as human pose estimation, head pose estimation, and more with low-resolution images. Our experimental results show that the proposed method outperforms existing state-of-the-art methods in these areas with fewer parameters, showcasing its potential for broad application in real-world scenarios where capturing high-resolution images is not feasible. Code is available at https://github.com/xyongLu/CMSA.