🤖 AI Summary
Existing surgical action triplet recognition methods rely on frame-level classification, failing to associate actions with specific instrument instances; spatial localization typically employs class activation maps, lacking the fine-grained precision and robustness required for instrument–tissue interaction modeling. To address these limitations, we propose a novel task—**instance-level surgical action triplet segmentation**—and introduce CholecTriplet-Seg, the first large-scale, densely annotated dataset enabling spatial grounding of <instrument, action, target> triplets. We further design TargetFusionNet, the first strongly supervised instance segmentation framework for this task, integrating instrument instance queries with weak anatomical priors via a target-aware fusion mechanism. Our approach achieves state-of-the-art performance across recognition, detection, and triplet segmentation metrics, demonstrating that joint modeling of instance-level supervision and target-informed anatomical priors significantly enhances surgical scene understanding.
📝 Abstract
Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance-level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.