High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video human pose estimation (VHPE) methods struggle to simultaneously capture global motion trends and high-frequency local joint dynamics, while quadratic-complexity global modeling severely hampers efficiency on high-resolution sequences. To address this, we propose a global–local decoupled state-space model framework: first extending Mamba to 6D spatiotemporal selective scanning; deploying a global spatiotemporal Mamba to model long-range dynamic dependencies, complemented by a sliding-window local Mamba for fine-grained joint detail enhancement; and integrating both via a spatial–temporal modulation mechanism. Our approach achieves linear computational complexity while jointly modeling multi-scale motion features efficiently. Evaluated on four mainstream benchmarks, it significantly outperforms state-of-the-art methods in accuracy, inference speed, and model scalability.

Technology Category

Application Category

📝 Abstract
Modeling high-resolution spatiotemporal representations, including both global dynamic contexts (e.g., holistic human motion tendencies) and local motion details (e.g., high-frequency changes of keypoints), is essential for video-based human pose estimation (VHPE). Current state-of-the-art methods typically unify spatiotemporal learning within a single type of modeling structure (convolution or attention-based blocks), which inherently have difficulties in balancing global and local dynamic modeling and may bias the network to one of them, leading to suboptimal performance. Moreover, existing VHPE models suffer from quadratic complexity when capturing global dependencies, limiting their applicability especially for high-resolution sequences. Recently, the state space models (known as Mamba) have demonstrated significant potential in modeling long-range contexts with linear complexity; however, they are restricted to 1D sequential data. In this paper, we present a novel framework that extends Mamba from two aspects to separately learn global and local high-resolution spatiotemporal representations for VHPE. Specifically, we first propose a Global Spatiotemporal Mamba, which performs 6D selective space-time scan and spatial- and temporal-modulated scan merging to efficiently extract global representations from high-resolution sequences. We further introduce a windowed space-time scan-based Local Refinement Mamba to enhance the high-frequency details of localized keypoint motions. Extensive experiments on four benchmark datasets demonstrate that the proposed model outperforms state-of-the-art VHPE approaches while achieving better computational trade-offs.
Problem

Research questions and friction points this paper is trying to address.

Modeling global-local spatiotemporal representations for video pose estimation
Addressing quadratic complexity limitations in global dependency modeling
Extending 1D state space models for high-resolution video sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global Spatiotemporal Mamba for global motion modeling
Local Refinement Mamba for high-frequency detail enhancement
6D selective space-time scan with linear complexity
🔎 Similar Papers
No similar papers found.