Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of cross-center generalization in surgical video classification, specifically for appendicitis detection, under federated learning (FL) settings where data remain decentralized across clinical sites. Method: We establish the first FL benchmark for surgical video classification, built upon the ViViT architecture and incorporating linear probing, metric learning, and triplet loss. We evaluate multiple FL aggregation strategies—including FedAvg, FedMedian, and FedSAM—under a realistic multi-center setup and assess both global model performance and local fine-tuning adaptability. Contribution/Results: ViViT achieves the highest global accuracy, with substantial post-fine-tuning gains per site; however, cross-center generalization remains limited and ranking stability is poor, revealing critical bottlenecks including class imbalance and hyperparameter sensitivity. Crucially, we systematically characterize how model architecture, preprocessing, and loss design govern the trade-off between global robustness and local personalization—providing a reproducible benchmark and principled design guidelines for federated learning on medical video data.

Technology Category

Application Category

📝 Abstract
Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking federated learning for surgical video classification across clinical centers
Assessing generalization to unseen centers and local adaptation via fine-tuning
Developing collaborative AI models without sharing sensitive patient data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated learning for surgical video classification
ViViT model with spatiotemporal analysis
FedAvg aggregation without sharing patient data
🔎 Similar Papers
No similar papers found.