NCSTR: Node-Centric Decoupled Spatio-Temporal Reasoning for Video-based Human Pose Estimation

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the challenges of insufficient joint topology representation and weak cross-frame consistency in video-based human pose estimation, which are primarily caused by motion blur, occlusion, and complex spatiotemporal dynamics. To this end, we propose a node-centric explicit reasoning framework that integrates sub-pixel cues and inter-frame motion information through velocity-aware joint embeddings. An attention-driven pose query encoder is employed to generate image-conditioned joint representations, while a novel dual-branch decoupled spatiotemporal attention mechanism separately models local and global temporal propagation as well as spatial constraints. The final pose predictions are adaptively fused from these complementary streams. By introducing, for the first time, an explicit node-centered reasoning paradigm, our method achieves state-of-the-art performance on three mainstream benchmarks, significantly enhancing both topological expressiveness and temporal consistency.

Technology Category

Application Category

📝 Abstract

Video-based human pose estimation remains challenged by motion blur, occlusion, and complex spatiotemporal dynamics. Existing methods often rely on heatmaps or implicit spatio-temporal feature aggregation, which limits joint topology expressiveness and weakens cross-frame consistency. To address these problems, we propose a novel node-centric framework that explicitly integrates visual, temporal, and structural reasoning for accurate pose estimation. First, we design a visuo-temporal velocity-based joint embedding that fuses sub-pixel joint cues and inter-frame motion to build appearance- and motion-aware representations. Then, we introduce an attention-driven pose-query encoder, which applies attention over joint-wise heatmaps and frame-wise features to map the joint representations into a pose-aware node space, generating image-conditioned joint-aware node embeddings. Building upon these node embeddings, we propose a dual-branch decoupled spatio-temporal attention graph that models temporal propagation and spatial constraint reasoning in specialized local and global branches. Finally, a node-space expert fusion module is proposed to adaptively fuse the complementary outputs from both branches, integrating local and global cues for final joint predictions. Extensive experiments on three widely used video pose benchmarks demonstrate that our method outperforms state-of-the-art methods. The results highlight the value of explicit node-centric reasoning, offering a new perspective for advancing video-based human pose estimation.

Problem

Research questions and friction points this paper is trying to address.

video-based human pose estimation

motion blur

occlusion

spatiotemporal dynamics

joint topology

Innovation

Methods, ideas, or system contributions that make the work stand out.

node-centric reasoning

decoupled spatio-temporal attention

visuo-temporal joint embedding