🤖 AI Summary
Surgical video stage recognition suffers from insufficient robustness due to domain shift and quality degradation. To address this, this paper introduces the digital twin (DT) paradigm—first applied to surgical stage recognition—proposing a DT representation framework that decouples high-level semantics from low-level visual features. Leveraging vision foundation models (SAM and DINO), the framework constructs causal-driven, semantically consistent DT representations to replace or augment raw video inputs. The method significantly enhances out-of-distribution generalization and robustness under severe degradation: it achieves 80.3% video-level accuracy (+3.9 points) on the highly degraded Cholec80 test set, 67.9% (+16.8) on CRCD, and 99.8% (+90.9) on an internal robotic surgery dataset—substantially outperforming all baselines. These results comprehensively validate the effectiveness and broad applicability of DT representations for medical video understanding.
📝 Abstract
Surgical phase recognition (SPR) is an integral component of surgical data science, enabling high-level surgical analysis. End-to-end trained neural networks that predict surgical phase directly from videos have shown excellent performance on benchmarks. However, these models struggle with robustness due to non-causal associations in the training set. Our goal is to improve model robustness to variations in the surgical videos by leveraging the digital twin (DT) paradigm -- an intermediary layer to separate high-level analysis (SPR) from low-level processing. As a proof of concept, we present a DT representation-based framework for SPR from videos. The framework employs vision foundation models with reliable low-level scene understanding to craft DT representation. We embed the DT representation in place of raw video inputs in the state-of-the-art SPR model. The framework is trained on the Cholec80 dataset and evaluated on out-of-distribution (OOD) and corrupted test samples. Contrary to the vulnerability of the baseline model, our framework demonstrates strong robustness on both OOD and corrupted samples, with a video-level accuracy of 80.3 on a highly corrupted Cholec80 test set, 67.9 on the challenging CRCD dataset, and 99.8 on an internal robotic surgery dataset, outperforming the baseline by 3.9, 16.8, and 90.9 respectively. We also find that using DT representation as an augmentation to the raw input can significantly improve model robustness. Our findings lend support to the thesis that DT representations are effective in enhancing model robustness. Future work will seek to improve the feature informativeness and incorporate interpretability for a more comprehensive framework.