CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Surgical video understanding is hindered by the scarcity of real-world annotations and substantial domain shifts between synthetic and real data. To address this, this work proposes CauCLIP, a causal-inspired vision-language framework that learns domain-invariant representations without requiring target-domain supervision. The method integrates frequency-domain augmentation to preserve semantic structure while perturbing domain-specific features, and introduces a causal suppression loss to eliminate non-causal biases, thereby focusing on stable causal factors inherent in surgical workflows. Evaluated on the SurgVisDom hard domain adaptation benchmark, CauCLIP significantly outperforms existing approaches, demonstrating the effectiveness of causally guided vision-language modeling in enhancing robustness and generalization for surgical phase recognition.

Technology Category

Application Category

📝 Abstract

Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.

Problem

Research questions and friction points this paper is trying to address.

surgical phase recognition

sim-to-real gap

domain gap

surgical video understanding

domain generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

causality

vision-language modeling

domain generalization