Multi-Grained Feature Pruning for Video-Based Human Pose Estimation

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address temporal redundancy and insufficient fine-grained perception in video human pose estimation caused by Transformer models operating solely on low-resolution features, this paper proposes a multi-scale resolution encoding framework coupled with a density peak clustering (DPC)-based dynamic token importance assessment mechanism, enabling semantic-aware fine-grained feature pruning. The method integrates multi-granularity spatiotemporal modeling, adaptive token importance ranking, and redundant feature elimination—achieving substantial efficiency gains without sacrificing accuracy. Evaluated on PoseTrack2017, it achieves 87.4 mAP while accelerating inference by 93.8%. Moreover, it establishes new state-of-the-art (SOTA) performance in both accuracy and efficiency across three major benchmarks—marking the first work to realize multi-scale semantic-guided dynamic token reduction in video pose estimation.

Technology Category

Application Category

📝 Abstract

Human pose estimation, with its broad applications in action recognition and motion capture, has experienced significant advancements. However, current Transformer-based methods for video pose estimation often face challenges in managing redundant temporal information and achieving fine-grained perception because they only focus on processing low-resolution features. To address these challenges, we propose a novel multi-scale resolution framework that encodes spatio-temporal representations at varying granularities and executes fine-grained perception compensation. Furthermore, we employ a density peaks clustering method to dynamically identify and prioritize tokens that offer important semantic information. This strategy effectively prunes redundant feature tokens, especially those arising from multi-frame features, thereby optimizing computational efficiency without sacrificing semantic richness. Empirically, it sets new benchmarks for both performance and efficiency on three large-scale datasets. Our method achieves a 93.8% improvement in inference speed compared to the baseline, while also enhancing pose estimation accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.

Problem

Research questions and friction points this paper is trying to address.

Addresses redundant temporal information in video-based human pose estimation.

Enhances fine-grained perception by processing multi-scale resolution features.

Improves computational efficiency without losing semantic richness in pose estimation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale resolution framework for spatio-temporal encoding

Density peaks clustering for dynamic token prioritization

Feature pruning to optimize computational efficiency

🔎 Similar Papers

No similar papers found.