Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

📅 2025-11-01

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing surgical action triplet recognition methods rely on frame-level classification, failing to associate actions with specific instrument instances; spatial localization typically employs class activation maps, lacking the fine-grained precision and robustness required for instrument–tissue interaction modeling. To address these limitations, we propose a novel task—**instance-level surgical action triplet segmentation**—and introduce CholecTriplet-Seg, the first large-scale, densely annotated dataset enabling spatial grounding of <instrument, action, target> triplets. We further design TargetFusionNet, the first strongly supervised instance segmentation framework for this task, integrating instrument instance queries with weak anatomical priors via a target-aware fusion mechanism. Our approach achieves state-of-the-art performance across recognition, detection, and triplet segmentation metrics, demonstrating that joint modeling of instance-level supervision and target-informed anatomical priors significantly enhances surgical scene understanding.

Technology Category

Application Category

📝 Abstract

Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance-level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.

Problem

Research questions and friction points this paper is trying to address.

Grounding surgical instrument-tissue interactions spatially

Linking instrument instances with actions and targets

Improving precision in surgical action triplet recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grounding surgical triplets with instrument instance segmentation

Proposing TargetFusionNet with target-aware fusion mechanism

Linking instrument masks with verb and target annotations

🔎 Similar Papers

CholecInstanceSeg: A Tool Instance Segmentation Dataset for Laparoscopic Surgery