CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
Surgical video understanding is hindered by the scarcity of real-world annotations and substantial domain shifts between synthetic and real data. To address this, this work proposes CauCLIP, a causal-inspired vision-language framework that learns domain-invariant representations without requiring target-domain supervision. The method integrates frequency-domain augmentation to preserve semantic structure while perturbing domain-specific features, and introduces a causal suppression loss to eliminate non-causal biases, thereby focusing on stable causal factors inherent in surgical workflows. Evaluated on the SurgVisDom hard domain adaptation benchmark, CauCLIP significantly outperforms existing approaches, demonstrating the effectiveness of causally guided vision-language modeling in enhancing robustness and generalization for surgical phase recognition.

Technology Category

Application Category

📝 Abstract
Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.
Problem

Research questions and friction points this paper is trying to address.

surgical phase recognition
sim-to-real gap
domain gap
surgical video understanding
domain generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

causality
vision-language modeling
domain generalization
surgical phase recognition
frequency-based augmentation