🤖 AI Summary
Existing pose estimation methods suffer from high computational overhead and model complexity. To address this, we propose a lightweight dual-stacked hourglass network. Our key innovation lies in the first integration of depthwise separable convolutions with the Convolutional Block Attention Module (CBAM) into the hourglass architecture, enabling simultaneous model compression and enhanced feature representation. Compared to the original eight-stacked hourglass baseline, our model reduces parameters to only 10% (2.3M) and computational cost to 3.7G FLOPs, while achieving competitive accuracy—72.07 AP on COCO and MPII benchmarks—surpassing six state-of-the-art lightweight models. This design achieves a superior trade-off among accuracy, parameter count, and inference efficiency, offering a practical and efficient solution for real-time human pose estimation on edge devices.
📝 Abstract
Pose estimation is a critical task in computer vision with a wide range of applications from activity monitoring to human-robot interaction. However,most of the existing methods are computationally expensive or have complex architecture. Here we propose a lightweight attention based pose estimation network that utilizes depthwise separable convolution and Convolutional Block Attention Module on an hourglass backbone. The network significantly reduces the computational complexity (floating point operations) and the model size (number of parameters) containing only about 10% of parameters of original eight stack Hourglass network. Experiments were conducted on COCO and MPII datasets using a two stack hourglass backbone. The results showed that our model performs well in comparison to six other lightweight pose estimation models with an average precision of 72.07. The model achieves this performance with only 2.3M parameters and 3.7G FLOPs.