🤖 AI Summary
This study addresses the challenge of developing a generalizable and clinically deployable foundation model for ultrasound imaging, which is hindered by substantial anatomical and acquisition protocol variability. To this end, the authors propose a unified multi-head multi-task learning framework (MH-MTL) that jointly handles 27 diverse tasks—including segmentation, classification, detection, and regression—within a single shared network. Built upon an ImageNet-pretrained EfficientNet-B4 backbone and a feature pyramid network (FPN) for multi-scale feature extraction, the framework incorporates task-adaptive learning rate scaling and cosine annealing. A task routing mechanism enables differentiated fusion of high-level semantics and spatial details, significantly enhancing cross-task generalization. This work establishes the first unified multi-task foundation model benchmark for ultrasound image analysis, demonstrating its feasibility and robustness on the FM_UIA~2026 dataset and providing a strong, extensible baseline for future research.
📝 Abstract
Ultrasound (US) imaging exhibits substantial heterogeneity across anatomical structures and acquisition protocols, posing significant challenges to the development of generalizable analysis models. Most existing methods are task-specific, limiting their suitability as clinically deployable foundation models. To address this limitation, the Foundation Model Challenge for Ultrasound Image Analysis (FM\_UIA~2026) introduces a large-scale multi-task benchmark comprising 27 subtasks across segmentation, classification, detection, and regression. In this paper, we present the official baseline for FM\_UIA~2026 based on a unified Multi-Head Multi-Task Learning (MH-MTL) framework that supports all tasks within a single shared network. The model employs an ImageNet-pretrained EfficientNet--B4 backbone for robust feature extraction, combined with a Feature Pyramid Network (FPN) to capture multi-scale contextual information. A task-specific routing strategy enables global tasks to leverage high-level semantic features, while dense prediction tasks exploit spatially detailed FPN representations. Training incorporates a composite loss with task-adaptive learning rate scaling and a cosine annealing schedule. Validation results demonstrate the feasibility and robustness of this unified design, establishing a strong and extensible baseline for ultrasound foundation model research. The code and dataset are publicly available at \href{https://github.com/lijiake2408/Foundation-Model-Challenge-for-Ultrasound-Image-Analysis}{GitHub}.