Multi-Grained Feature Pruning for Video-Based Human Pose Estimation

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address temporal redundancy and insufficient fine-grained perception in video human pose estimation caused by Transformer models operating solely on low-resolution features, this paper proposes a multi-scale resolution encoding framework coupled with a density peak clustering (DPC)-based dynamic token importance assessment mechanism, enabling semantic-aware fine-grained feature pruning. The method integrates multi-granularity spatiotemporal modeling, adaptive token importance ranking, and redundant feature elimination—achieving substantial efficiency gains without sacrificing accuracy. Evaluated on PoseTrack2017, it achieves 87.4 mAP while accelerating inference by 93.8%. Moreover, it establishes new state-of-the-art (SOTA) performance in both accuracy and efficiency across three major benchmarks—marking the first work to realize multi-scale semantic-guided dynamic token reduction in video pose estimation.

Technology Category

Application Category

📝 Abstract
Human pose estimation, with its broad applications in action recognition and motion capture, has experienced significant advancements. However, current Transformer-based methods for video pose estimation often face challenges in managing redundant temporal information and achieving fine-grained perception because they only focus on processing low-resolution features. To address these challenges, we propose a novel multi-scale resolution framework that encodes spatio-temporal representations at varying granularities and executes fine-grained perception compensation. Furthermore, we employ a density peaks clustering method to dynamically identify and prioritize tokens that offer important semantic information. This strategy effectively prunes redundant feature tokens, especially those arising from multi-frame features, thereby optimizing computational efficiency without sacrificing semantic richness. Empirically, it sets new benchmarks for both performance and efficiency on three large-scale datasets. Our method achieves a 93.8% improvement in inference speed compared to the baseline, while also enhancing pose estimation accuracy, reaching 87.4 mAP on the PoseTrack2017 dataset.
Problem

Research questions and friction points this paper is trying to address.

Addresses redundant temporal information in video-based human pose estimation.
Enhances fine-grained perception by processing multi-scale resolution features.
Improves computational efficiency without losing semantic richness in pose estimation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale resolution framework for spatio-temporal encoding
Density peaks clustering for dynamic token prioritization
Feature pruning to optimize computational efficiency
🔎 Similar Papers
No similar papers found.
Z
Zhigang Wang
College of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, China
Shaojing Fan
Shaojing Fan
Department of Electrical and Computer Engineering, National University of Singapore
Cognitive VisionComputer VisionExperimental Psychology
Zhenguang Liu
Zhenguang Liu
Zhejiang University
BlockchainSmart Contract SecurityMultimedia
Z
Zheqi Wu
College of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, China
S
Sifan Wu
College of Computer Science and Technology, Jilin University, Changchun, China
Y
Yingying Jiao
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China