🤖 AI Summary
This work addresses the challenge of multi-instance hand joint pose tracking in minimally invasive surgical videos—complicated by occlusions, motion blur, and anatomical ambiguities. We propose a Temporal Graph Convolutional Network (T-GCN) that jointly integrates optical-flow-driven inter-frame motion constraints with hand topology priors. To our knowledge, this is the first end-to-end hand pose estimation framework to explicitly model spatiotemporal consistency. Additionally, we introduce multi-scale feature fusion and a differentiable bone-projection loss. Evaluated on a real surgical video dataset, our method achieves a mean joint error of 8.2 mm—23% lower than the state-of-the-art—and runs at 32 FPS, satisfying clinical real-time requirements. The core contribution is the first surgery-specific, spatiotemporally consistent hand pose estimation architecture, significantly improving accuracy and robustness under complex intraoperative conditions.