fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing CLIP-based models rely on global image features, limiting zero-shot fine-grained surgical action triplet (subject–verb–object) recognition—particularly in generalizing to unseen anatomical structures and instrument–verb combinations. To address this, we propose a triplet-aware vision–language learning framework: (i) hierarchical prompt modeling, explicitly encoding semantic hierarchies of subject, verb, and object; (ii) LoRA-driven adaptive fine-tuning of the visual backbone to enhance object-centric, fine-grained representations; and (iii) graph-structured patch clustering distillation, jointly consolidating anatomical, instrumental, and action-related features. We introduce the first benchmark enabling dual-axis zero-shot generalization—across both anatomical targets and instrument–verb pairs. Evaluated on CholecT50, our method achieves significant improvements in F1-score and mean average precision (mAP), marking the first demonstration of robust zero-shot recognition for previously unseen anatomical regions and instrument–verb compositions.

Technology Category

Application Category

📝 Abstract

While vision-language models like CLIP have advanced zero-shot surgical phase recognition, they struggle with fine-grained surgical activities, especially action triplets. This limitation arises because current CLIP formulations rely on global image features, which overlook the fine-grained semantics and contextual details crucial for complex tasks like zero-shot triplet recognition. Furthermore, these models do not explore the hierarchical structure inherent in triplets, reducing their ability to generalize to novel triplets. To address these challenges, we propose fine-CLIP, which learns object-centric features and lever- ages the hierarchy in triplet formulation. Our approach integrates three components: hierarchical prompt modeling to capture shared semantics, LoRA-based vision backbone adaptation for enhanced feature extraction, and a graph-based condensation strategy that groups similar patch features into meaningful object clusters. Since triplet classification is a challenging task, we introduce an alternative yet meaningful base-to-novel generalization benchmark with two settings on the CholecT50 dataset: Unseen-Target, assessing adaptability to triplets with novel anatomical structures, and Unseen-Instrument-Verb, where models need to generalize to novel instrument-verb interactions. fine-CLIP shows significant improvements in F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.

Problem

Research questions and friction points this paper is trying to address.

Improving zero-shot recognition of fine-grained surgical action triplets

Addressing limitations of global image features in CLIP models

Enhancing generalization to novel instrument-verb interactions in surgery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical prompt modeling captures shared semantics

LoRA-based vision backbone enhances feature extraction

Graph-based condensation groups patch features

🔎 Similar Papers

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures