🤖 AI Summary
To address the insufficient robustness of surgical phase recognition in Endoscopic Submucosal Dissection (ESD), this paper proposes a diffusion-based framework integrated with clinical prior knowledge. Unlike prevailing multi-stage iterative optimization approaches, our method formulates phase recognition as an end-to-end denoising generation task. It employs joint vision-temporal encoding to extract discriminative features, incorporates a conditional masking mechanism to explicitly model spatial priors, boundary ambiguity, and temporal logic, and leverages clinical knowledge to guide training—thereby enhancing logical consistency and error correction capability. Evaluated on ESD820, Cholec80, and multiple external multi-center datasets, our method achieves state-of-the-art or superior performance. To the best of our knowledge, this is the first work to demonstrate the effectiveness and generalizability of generative diffusion models for surgical phase recognition.
📝 Abstract
Gastrointestinal malignancies constitute a leading cause of cancer-related mortality worldwide, with advanced-stage prognosis remaining particularly dismal. Originating as a groundbreaking technique for early gastric cancer treatment, Endoscopic Submucosal Dissection has evolved into a versatile intervention for diverse gastrointestinal lesions. While computer-assisted systems significantly enhance procedural precision and safety in ESD, their clinical adoption faces a critical bottleneck: reliable surgical phase recognition within complex endoscopic workflows. Current state-of-the-art approaches predominantly rely on multi-stage refinement architectures that iteratively optimize temporal predictions. In this paper, we present Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework that reimagines phase recognition through denoising diffusion principles while preserving the core iterative refinement philosophy. This architecture progressively reconstructs phase sequences starting from random noise and conditioned on visual-temporal features. To better capture three domain-specific characteristics, including positional priors, boundary ambiguity, and relation dependency, we design a conditional masking strategy. Furthermore, we incorporate clinical prior knowledge into the model training to improve its ability to correct phase logical errors. Comprehensive evaluations on ESD820, Cholec80, and external multi-center demonstrate that our proposed CPKD achieves superior or comparable performance to state-of-the-art approaches, validating the effectiveness of diffusion-based generative paradigms for surgical phase recognition.