Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of automatically detecting instrument handover events in surgical videos, which is complicated by frequent occlusions, cluttered backgrounds, and complex temporal dynamics. The authors propose a spatiotemporal vision framework that integrates Vision Transformers for spatial feature extraction with a unidirectional LSTM to model temporal dependencies. Through multi-task learning, the model jointly predicts both the occurrence and direction of handovers, while discrete event localization is achieved via confidence peak detection. Notably, this study introduces handover direction classification into surgical handover detection for the first time, employs a unified architecture to avoid cascaded errors, and enhances interpretability using Layer-CAM. Evaluated on a kidney transplantation surgery video dataset, the method achieves an F1 score of 0.84 for handover detection and an average F1 of 0.72 for direction classification, significantly outperforming single-task models and the VideoMamba baseline.
📝 Abstract
Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.
Problem

Research questions and friction points this paper is trying to address.

surgical instrument handover
event-level detection
intraoperative video
occlusions
temporal dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer
LSTM
multi-task learning
event-level detection
interpretability
🔎 Similar Papers
No similar papers found.
K
Katerina Katsarou
Fraunhofer HHI, Berlin, Germany
G
George Zountsas
Fraunhofer HHI, Berlin, Germany; Technical University of Berlin, Germany
K
Karam Tomotaki-Dawoud
Fraunhofer HHI, Berlin, Germany
A
Alexander Ehrenhoefer
Fraunhofer HHI, Berlin, Germany; Technical University of Berlin, Germany
Paul Chojecki
Paul Chojecki
Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute, HHI
human-computer interactionspatial interactionmultimodal interactionusability engineering
D
David Przewozny
Fraunhofer HHI, Berlin, Germany
I
Igor Maximilian Sauer
Charité - Universitätsmedizin Berlin, Germany
A
Amira Mouakher
Université de Perpignan, Perpignan, France
Sebastian Bosse
Sebastian Bosse
Head of Interactive & Cognitive Systems, Fraunhofer HHI, Germany
computer visionhuman-computer interactionhybrid modelsmachine learningcognition modelling