🤖 AI Summary
Existing methods for visual guidance in minimally invasive surgery often conflate visual attention estimation with camera control or rely on object-centric assumptions, hindering stable and accurate field-of-view guidance. This work addresses surgical attention tracking as a spatiotemporal learning problem and introduces, for the first time, a framework that integrates temporal proposal re-ranking with motion-aware optimization to generate per-frame dense attention heatmaps, enabling continuous and interpretable visual guidance. The study contributes SurgAtt-1.16M, a large-scale clinically annotated benchmark supporting cross-institutional and cross-procedural analysis, and demonstrates state-of-the-art performance across multiple surgical datasets. The proposed approach exhibits strong robustness under challenging conditions—including occlusions, multi-instrument interference, and cross-domain scenarios—and can be directly deployed for robotic field-of-view planning and autonomous camera control.
📝 Abstract
Accurate and stable field-of-view (FoV) guidance is critical for safe and efficient minimally invasive surgery, yet existing approaches often conflate visual attention estimation with downstream camera control or rely on direct object-centric assumptions. In this work, we formulate surgical attention tracking as a spatio-temporal learning problem and model surgeon focus as a dense attention heatmap, enabling continuous and interpretable frame-wise FoV guidance. We propose SurgAtt-Tracker, a holistic framework that robustly tracks surgical attention by exploiting temporal coherence through proposal-level reranking and motion-aware refinement, rather than direct regression. To support systematic training and evaluation, we introduce SurgAtt-1.16M, a large-scale benchmark with a clinically grounded annotation protocol that enables comprehensive heatmap-based attention analysis across procedures and institutions. Extensive experiments on multiple surgical datasets demonstrate that SurgAtt-Tracker consistently achieves state-of-the-art performance and strong robustness under occlusion, multi-instrument interference, and cross-domain settings. Beyond attention tracking, our approach provides a frame-wise FoV guidance signal that can directly support downstream robotic FoV planning and automatic camera control.