🤖 AI Summary
To address temporal redundancy and insufficient fine-grained perception in video human pose estimation caused by Transformer models operating solely on low-resolution features, this paper proposes a multi-scale resolution encoding framework coupled with a density peak clustering (DPC)-based dynamic token importance assessment mechanism, enabling semantic-aware fine-grained feature pruning. The method integrates multi-granularity spatiotemporal modeling, adaptive token importance ranking, and redundant feature elimination—achieving substantial efficiency gains without sacrificing accuracy. Evaluated on PoseTrack2017, it achieves 87.4 mAP while accelerating inference by 93.8%. Moreover, it establishes new state-of-the-art (SOTA) performance in both accuracy and efficiency across three major benchmarks—marking the first work to realize multi-scale semantic-guided dynamic token reduction in video pose estimation.
📝 Abstract
Human pose estimation, with its broad applications in action recognition and motion capture, has experienced significant advancements. However, current Transformer-based methods for video pose estimation often face challenges in managing redundant temporal information and achieving fine-grained perception because they only focus on processing low-resolution features. To address these challenges, we propose a novel multi-scale resolution framework that encodes spatio-temporal representations at varying granularities and executes fine-grained perception compensation. Furthermore, we employ a density peaks clustering method to dynamically identify and prioritize tokens that offer important semantic information. This strategy effectively prunes redundant feature tokens, especially those arising from multi-frame features, thereby optimizing computational efficiency without sacrificing semantic richness. Empirically, it sets new benchmarks for both performance and efficiency on three large-scale datasets. Our method achieves a 93.8% improvement in inference speed compared to the baseline, while also enhancing pose estimation accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.