MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Surgical triplet recognition (instrument-action-target) faces dual challenges: inter-task representation entanglement and intra-task class imbalance, both exacerbated by long-tailed label distributions. To address these, we propose a Shared-Specific-Decoupled Representation Learning framework that explicitly disentangles general-purpose and task-specific features to mitigate cross-task optimization conflicts. We further introduce a Coordinated Gradient Learning strategy that dynamically balances positive and negative sample gradients during backpropagation, alleviating class bias. Additionally, we incorporate a large language model–driven semantic-enhanced dynamic prompt pool to enable spatiotemporal-aware task prompting. Evaluated on CholecT45 and CholecT50, our method significantly improves recognition accuracy for rare classes—achieving +4.2% mAP on CholecT45 and +3.8% on CholecT50 over prior state-of-the-art—demonstrating superior multi-task joint optimization capability and robustness to long-tailed distributions.

Technology Category

Application Category

📝 Abstract

Surgical triplet recognition, which involves identifying instrument, verb, target, and their combinations, is a complex surgical scene understanding challenge plagued by long-tailed data distribution. The mainstream multi-task learning paradigm benefiting from cross-task collaborative promotion has shown promising performance in identifying triples, but two key challenges remain: 1) inter-task optimization conflicts caused by entangling task-generic and task-specific representations; 2) intra-task optimization conflicts due to class-imbalanced training data. To overcome these difficulties, we propose the MLLM-Engaged Joint Optimization (MEJO) framework that empowers both inter- and intra-task optimization for surgical triplet recognition. For inter-task optimization, we introduce the Shared-Specific-Disentangled (S$^2$D) learning scheme that decomposes representations into task-shared and task-specific components. To enhance task-shared representations, we construct a Multimodal Large Language Model (MLLM) powered probabilistic prompt pool to dynamically augment visual features with expert-level semantic cues. Additionally, comprehensive task-specific cues are modeled via distinct task prompts covering the temporal-spatial dimensions, effectively mitigating inter-task ambiguities. To tackle intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative gradients originating from head and tail classes for more coordinated learning behaviors. Extensive experiments on the CholecT45 and CholecT50 datasets demonstrate the superiority of our proposed framework, validating its effectiveness in handling optimization conflicts.

Problem

Research questions and friction points this paper is trying to address.

Addresses inter-task optimization conflicts in surgical triplet recognition

Resolves intra-task optimization conflicts from class-imbalanced data

Enhances surgical scene understanding through MLLM-engaged joint optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared-Specific-Disentangled learning for representation separation

MLLM-powered probabilistic prompts for semantic augmentation

Coordinated Gradient Learning for class imbalance mitigation

🔎 Similar Papers

Multitask Learning in Minimally Invasive Surgical Vision: A Review