Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement

📅 2025-06-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Surgical environments exhibit high heterogeneity across institutions, and the high cost of expert annotations severely limits model generalizability in cross-institutional settings. Method: We propose SPA, a lightweight adaptive framework that enables zero-code, customizable deployment via natural-language stage definitions, minimal image annotations (only 32 samples), and task-graph priors. SPA models temporal structure using a task-graph-guided diffusion model and introduces a multimodal-prediction-driven dynamic test-time adaptation mechanism to enhance robustness. It synergistically integrates vision-language foundation models, few-shot spatial alignment, and self-supervised cross-domain adaptation. Contribution/Results: SPA achieves state-of-the-art performance in few-shot surgical phase recognition across multi-center, multi-procedure benchmarks—significantly outperforming fully supervised baselines. It substantially reduces clinical annotation burden while improving practical deployability and cross-institutional generalization.

Technology Category

Application Category

📝 Abstract
The complexity and diversity of surgical workflows, driven by heterogeneous operating room settings, institutional protocols, and anatomical variability, present a significant challenge in developing generalizable models for cross-institutional and cross-procedural surgical understanding. While recent surgical foundation models pretrained on large-scale vision-language data offer promising transferability, their zero-shot performance remains constrained by domain shifts, limiting their utility in unseen surgical environments. To address this, we introduce Surgical Phase Anywhere (SPA), a lightweight framework for versatile surgical workflow understanding that adapts foundation models to institutional settings with minimal annotation. SPA leverages few-shot spatial adaptation to align multi-modal embeddings with institution-specific surgical scenes and phases. It also ensures temporal consistency through diffusion modeling, which encodes task-graph priors derived from institutional procedure protocols. Finally, SPA employs dynamic test-time adaptation, exploiting the mutual agreement between multi-modal phase prediction streams to adapt the model to a given test video in a self-supervised manner, enhancing the reliability under test-time distribution shifts. SPA is a lightweight adaptation framework, allowing hospitals to rapidly customize phase recognition models by defining phases in natural language text, annotating a few images with the phase labels, and providing a task graph defining phase transitions. The experimental results show that the SPA framework achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures, even outperforming full-shot models with 32-shot labeled data. Code is available at https://github.com/CAMMA-public/SPA
Problem

Research questions and friction points this paper is trying to address.

Adapting surgical foundation models to diverse institutional settings with minimal annotation
Ensuring temporal consistency in surgical phase recognition using diffusion modeling
Enhancing reliability under test-time distribution shifts via dynamic adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-shot spatial adaptation for multi-modal alignment
Diffusion modeling for temporal consistency
Dynamic test-time adaptation for self-supervised learning
🔎 Similar Papers
No similar papers found.