π€ AI Summary
This work addresses the challenge of limited generalization in existing AI models for minimally invasive surgery, which stems from variations in surgical procedures and practitioners across institutions. To overcome this, the authors propose ZEN, the first intraoperative video foundation model capable of unified representation across multiple surgical procedures. Trained on a diverse dataset of over four million frames spanning more than 21 procedure types, ZEN leverages a self-supervised multi-teacher distillation framework and introduces a standardized benchmark for downstream tasks. It consistently outperforms current methods across zero-shot, few-shot, frozen-backbone, and full fine-tuning settings, achieving state-of-the-art performance on 20 downstream tasks. These results demonstrate ZENβs strong cross-procedure and cross-institutional generalization, establishing a new paradigm for intraoperative assistance and intelligent surgical training evaluation.
π Abstract
In minimally invasive surgery, clinical decisions depend on real-time visual interpretation, yet intraoperative perception varies substantially across surgeons and procedures. This variability limits consistent assessment, training, and the development of reliable artificial intelligence systems, as most surgical AI models are designed for narrowly defined tasks and do not generalize across procedures or institutions. Here we introduce ZEN, a generalizable foundation model for intraoperative surgical video understanding trained on more than 4 million frames from over 21 procedures using a self-supervised multi-teacher distillation framework. We curated a large and diverse dataset and systematically evaluated multiple representation learning strategies within a unified benchmark. Across 20 downstream tasks and full fine-tuning, frozen-backbone, few-shot and zero-shot settings, ZEN consistently outperforms existing surgical foundation models and demonstrates robust cross-procedure generalization. These results suggest a step toward unified representations for surgical scene understanding and support future applications in intraoperative assistance and surgical training assessment.