🤖 AI Summary
This paper addresses the limitations of isolated modeling—namely, excessive memory consumption, low inference efficiency, and loss of semantic coherence—in instrument segmentation, pose estimation, and action recognition for minimally invasive surgery (MIS) vision. It presents a systematic review of multi-task learning (MTL) in this domain. First, it establishes a novel MTL taxonomy tailored to surgical vision, uncovering task-specific semantic coupling patterns and gradient conflict mechanisms. Second, it proposes a unified framework integrating a shared feature encoder, gradient normalization, uncertainty-aware loss weighting, and anatomy-guided attention. Based on a comprehensive analysis of 87 studies, the work identifies three critical bottlenecks: poor generalizability, insufficient real-time performance, and limited clinical interpretability. Finally, it recommends standardized evaluation protocols. The study delivers both a theoretical foundation and a practical paradigm for advancing MTL in surgical vision.