A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the significant challenges of 3D hand pose estimation in surgical environments, where strong illumination, occlusions, and uniform glove-wearing lead to highly homogeneous hand appearances, compounded by the scarcity of annotated data. The authors propose the first general-purpose, multi-view 3D hand pose estimation pipeline that requires neither training nor domain-specific fine-tuning. Their approach leverages off-the-shelf pre-trained models to sequentially perform human detection, whole-body pose estimation, and 2D hand keypoint prediction, followed by a multi-view geometric optimization to recover 3D poses. Additionally, they introduce the first large-scale benchmark dataset for 3D hand pose estimation in surgical settings, comprising 68,000 annotated frames. Experiments demonstrate that the proposed method reduces the mean joint error by 31% in 2D and 76% in 3D, substantially outperforming existing baselines.

Technology Category

Application Category

📝 Abstract

Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.

Problem

Research questions and friction points this paper is trying to address.

3D hand pose estimation

surgical environment

occlusion

annotated dataset

gloved hands

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view pipeline

3D hand pose estimation

surgical benchmark dataset