π€ AI Summary
This work addresses the challenges of poor local model adaptability and inaccurate server-side aggregation in federated learning for surgical video understanding, which arise from the high diversity of tissues and tasks across procedures. To overcome these issues, the authors propose SurgFed, a novel framework that introduces Language-guided Channel Selection (LCS) and Language-guided Hypernetwork Aggregation (LHA). These components leverage textual instructions to dynamically adjust local model architectures and guide cross-task parameter fusion. By integrating lightweight channel selection, inter-layer cross-attention, and a hypernetwork-based design, SurgFed enables efficient federated learning across multiple tasks, institutions, and surgical procedures. Extensive experiments on four surgical procedures and five public datasets demonstrate that SurgFed significantly outperforms existing methods, achieving state-of-the-art performance in both surgical scene segmentation and depth estimation tasks.
π Abstract
Surgical scene Multi-Task Federated Learning (MTFL) is essential for robot-assisted minimally invasive surgery (RAS) but remains underexplored in surgical video understanding due to two key challenges: (1) Tissue Diversity: Local models struggle to adapt to site-specific tissue features, limiting their effectiveness in heterogeneous clinical environments and leading to poor local predictions. (2) Task Diversity: Server-side aggregation, relying solely on gradient-based clustering, often produces suboptimal or incorrect parameter updates due to inter-site task heterogeneity, resulting in inaccurate localization. In light of these two issues, we propose SurgFed, a multi-task federated learning framework, enabling federated learning for surgical scene segmentation and depth estimation across diverse surgical types. SurgFed is powered by two appealing designs, i.e., Language-guided Channel Selection (LCS) and Language-guided Hyper Aggregation (LHA), to address the challenge of fully exploration on corss-site and cross-task. Technically, the LCS is first designed a lightweight personalized channel selection network that enhances site-specific adaptation using pre-defined text inputs, which optimally the local model learn the specific embeddings. We further introduce the LHA that employs a layer-wise cross-attention mechanism with pre-defined text inputs to model task interactions across sites and guide a hypernetwork for personalized parameter updates. Extensive empirical evidence shows that SurgFed yields improvements over the state-of-the-art methods in five public datasets across four surgical types. The code is available at https://anonymous.4open.science/r/SurgFed-070E/.