Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenge of 6D pose estimation for previously unseen surgical instruments in operating room scenarios by proposing the first multi-view method that requires no task-specific training. Relying solely on textured CAD models as priors, the approach integrates a pre-trained feature extractor, cross-view attention mechanisms, and occlusion-aware multi-view contour registration. High-precision pose estimates are achieved through multi-view geometric consistency verification, triangulation, and reprojection refinement. Evaluated on the real-world MVPSP surgical dataset, the method attains millimeter-level accuracy comparable to supervised approaches, demonstrating—for the first time—the robustness of an annotation-free, training-free strategy for tracking novel instruments in dynamic clinical environments.

Technology Category

Application Category

📝 Abstract

Purpose: Accurate detection and 6D pose estimation of surgical instruments are crucial for many computer-assisted interventions. However, supervised methods lack flexibility for new or unseen tools and require extensive annotated data. This work introduces a training-free pipeline for accurate multi-view 6D pose estimation of unseen surgical instruments, which only requires a textured CAD model as prior knowledge. Methods: Our pipeline consists of two main stages. First, for detection, we generate object mask proposals in each view and score their similarity to rendered templates using a pre-trained feature extractor. Detections are matched across views, triangulated into 3D instance candidates, and filtered using multi-view geometric consistency. Second, for pose estimation, a set of pose hypotheses is iteratively refined and scored using feature-metric scores with cross-view attention. The best hypothesis undergoes a final refinement using a novel multi-view, occlusion-aware contour registration, which minimizes reprojection errors of unoccluded contour points. Results: The proposed method was rigorously evaluated on real-world surgical data from the MVPSP dataset. The method achieves millimeter-accurate pose estimates that are on par with supervised methods under controlled conditions, while maintaining full generalization to unseen instruments. These results demonstrate the feasibility of training-free, marker-less detection and tracking in surgical scenes, and highlight the unique challenges in surgical environments. Conclusion: We present a novel and flexible pipeline that effectively combines state-of-the-art foundational models, multi-view geometry, and contour-based refinement for high-accuracy 6D pose estimation of surgical instruments without task-specific training. This approach enables robust instrument tracking and scene understanding in dynamic clinical environments.

Problem

Research questions and friction points this paper is trying to address.

surgical instruments

6D pose estimation

training-free

unseen objects

multi-view detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

6D pose estimation

multi-view geometry