🤖 AI Summary
Surgical triplet recognition (instrument-action-target) faces dual challenges: inter-task representation entanglement and intra-task class imbalance, both exacerbated by long-tailed label distributions. To address these, we propose a Shared-Specific-Decoupled Representation Learning framework that explicitly disentangles general-purpose and task-specific features to mitigate cross-task optimization conflicts. We further introduce a Coordinated Gradient Learning strategy that dynamically balances positive and negative sample gradients during backpropagation, alleviating class bias. Additionally, we incorporate a large language model–driven semantic-enhanced dynamic prompt pool to enable spatiotemporal-aware task prompting. Evaluated on CholecT45 and CholecT50, our method significantly improves recognition accuracy for rare classes—achieving +4.2% mAP on CholecT45 and +3.8% on CholecT50 over prior state-of-the-art—demonstrating superior multi-task joint optimization capability and robustness to long-tailed distributions.
📝 Abstract
Surgical triplet recognition, which involves identifying instrument, verb, target, and their combinations, is a complex surgical scene understanding challenge plagued by long-tailed data distribution. The mainstream multi-task learning paradigm benefiting from cross-task collaborative promotion has shown promising performance in identifying triples, but two key challenges remain: 1) inter-task optimization conflicts caused by entangling task-generic and task-specific representations; 2) intra-task optimization conflicts due to class-imbalanced training data. To overcome these difficulties, we propose the MLLM-Engaged Joint Optimization (MEJO) framework that empowers both inter- and intra-task optimization for surgical triplet recognition. For inter-task optimization, we introduce the Shared-Specific-Disentangled (S$^2$D) learning scheme that decomposes representations into task-shared and task-specific components. To enhance task-shared representations, we construct a Multimodal Large Language Model (MLLM) powered probabilistic prompt pool to dynamically augment visual features with expert-level semantic cues. Additionally, comprehensive task-specific cues are modeled via distinct task prompts covering the temporal-spatial dimensions, effectively mitigating inter-task ambiguities. To tackle intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative gradients originating from head and tail classes for more coordinated learning behaviors. Extensive experiments on the CholecT45 and CholecT50 datasets demonstrate the superiority of our proposed framework, validating its effectiveness in handling optimization conflicts.