🤖 AI Summary
Surgical video understanding faces three major challenges: scarce high-quality annotations, complex spatiotemporal modeling, and cross-institutional domain shift—hindering clinical deployment of deep learning models. To address these, we propose three novel semi-supervised frameworks—DIST, SemiVT-Surge, and ENCORE—that integrate dynamic pseudo-label generation, multi-task collaborative learning, and domain adaptation mechanisms, substantially reducing reliance on fully annotated data. Concurrently, we construct and publicly release two high-quality, multi-task surgical video benchmarks: GynSurg and Cataract-1K. Our methods achieve state-of-the-art performance across multiple cross-center datasets on phase recognition, action segmentation, and event detection. They demonstrate significantly improved generalizability and robustness under domain shifts. This work advances reproducibility, scalability, and clinical deployability of surgical AI systems.
📝 Abstract
Advances in surgical video analysis are transforming operating rooms into intelligent, data-driven environments. Computer-assisted systems support full surgical workflow, from preoperative planning to intraoperative guidance and postoperative assessment. However, developing robust and generalizable models for surgical video understanding remains challenging due to (I) annotation scarcity, (II) spatiotemporal complexity, and (III) domain gap across procedures and institutions. This doctoral research aims to bridge the gap between deep learning-based surgical video analysis in research and its real-world clinical deployment. To address the core challenge of recognizing surgical phases, actions, and events, critical for analysis, I benchmarked state-of-the-art neural network architectures to identify the most effective designs for each task. I further improved performance by proposing novel architectures and integrating advanced modules. Given the high cost of expert annotations and the domain gap across surgical video sources, I focused on reducing reliance on labeled data. We developed semi-supervised frameworks that improve model performance across tasks by leveraging large amounts of unlabeled surgical video. We introduced novel semi-supervised frameworks, including DIST, SemiVT-Surge, and ENCORE, that achieved state-of-the-art results on challenging surgical datasets by leveraging minimal labeled data and enhancing model training through dynamic pseudo-labeling. To support reproducibility and advance the field, we released two multi-task datasets: GynSurg, the largest gynecologic laparoscopy dataset, and Cataract-1K, the largest cataract surgery video dataset. Together, this work contributes to robust, data-efficient, and clinically scalable solutions for surgical video analysis, laying the foundation for generalizable AI systems that can meaningfully impact surgical care and training.