A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery

πŸ“… 2026-01-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the significant challenges of 3D hand pose estimation in surgical environments, where strong illumination, occlusions, and uniform glove-wearing lead to highly homogeneous hand appearances, compounded by the scarcity of annotated data. The authors propose the first general-purpose, multi-view 3D hand pose estimation pipeline that requires neither training nor domain-specific fine-tuning. Their approach leverages off-the-shelf pre-trained models to sequentially perform human detection, whole-body pose estimation, and 2D hand keypoint prediction, followed by a multi-view geometric optimization to recover 3D poses. Additionally, they introduce the first large-scale benchmark dataset for 3D hand pose estimation in surgical settings, comprising 68,000 annotated frames. Experiments demonstrate that the proposed method reduces the mean joint error by 31% in 2D and 76% in 3D, substantially outperforming existing baselines.

Technology Category

Application Category

πŸ“ Abstract
Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.
Problem

Research questions and friction points this paper is trying to address.

3D hand pose estimation
surgical environment
occlusion
annotated dataset
gloved hands
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view pipeline
3D hand pose estimation
surgical benchmark dataset
constrained 3D optimization
off-the-shelf pretrained models
πŸ”Ž Similar Papers
No similar papers found.
V
Valery Fischer
University Hospital Balgrist, University of Zurich, Switzerland; ETH ZΓΌrich, Switzerland
A
Alan Magdaleno
University Hospital Balgrist, University of Zurich, Switzerland
A
A. Calek
University Hospital Balgrist, University of Zurich, Switzerland
N
N. Cavalcanti
University Hospital Balgrist, University of Zurich, Switzerland
N
Nathan Hoffman
University Hospital Balgrist, University of Zurich, Switzerland
C
Christoph Germann
University Hospital Balgrist, University of Zurich, Switzerland
J
Joschua Wuthrich
University Hospital Balgrist, University of Zurich, Switzerland
M
Max Krahenmann
University Hospital Balgrist, University of Zurich, Switzerland
M
Mazda Farshad
University Hospital Balgrist, University of Zurich, Switzerland
P
Philipp Furnstahl
University Hospital Balgrist, University of Zurich, Switzerland
Lilian Calvet
Lilian Calvet
Postdoc in Computer Vision
computer visionmachine learningaugmented realitymedical imagingcomputer-assisted interventions