🤖 AI Summary
This study addresses the bottleneck in surgical video analysis—downstream tasks (e.g., surgical phase recognition) requiring extensive expert annotations for fine-tuning. We propose a no-retraining, transferability evaluation method based on pretrained model embedding features. For the first time, we systematically benchmark three source-agnostic metrics—LogME, H-Score, and TransRate—on the RAMIE and AutoLaparo datasets. Results show LogME (especially with subset-min aggregation) achieves the highest correlation with actual fine-tuning performance; H-Score exhibits limited predictive power, while TransRate suffers from rank reversal. Our key contribution is identifying the discriminative failure of existing metrics when candidate models yield comparable performance, and proposing a principled model selection strategy that jointly considers feature diversity and validation fidelity. This approach significantly reduces annotation dependency and improves the efficiency of selecting optimal transferable models for surgical video analysis.
📝 Abstract
Fine-tuning pre-trained models has become a cornerstone of modern machine learning, allowing practitioners to achieve high performance with limited labeled data. In surgical video analysis, where expert annotations are especially time-consuming and costly, identifying the most suitable pre-trained model for a downstream task is both critical and challenging. Source-independent transferability estimation (SITE) offers a solution by predicting how well a model will fine-tune on target data using only its embeddings or outputs, without requiring full retraining. In this work, we formalize SITE for surgical phase recognition and provide the first comprehensive benchmark of three representative metrics, LogME, H-Score, and TransRate, on two diverse datasets (RAMIE and AutoLaparo). Our results show that LogME, particularly when aggregated by the minimum per-subset score, aligns most closely with fine-tuning accuracy; H-Score yields only weak predictive power; and TransRate often inverses true model rankings. Ablation studies show that when candidate models have similar performances, transferability estimates lose discriminative power, emphasizing the importance of maintaining model diversity or using additional validation. We conclude with practical guidelines for model selection and outline future directions toward domain-specific metrics, theoretical foundations, and interactive benchmarking tools.