A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current test-time adaptation (TTA) evaluation for medical image segmentation suffers from narrow modality coverage, task singularity, and inconsistent evaluation protocols. To address these limitations, we introduce MedSeg-TTA—the first large-scale, cross-modal TTA benchmark—encompassing seven distinct medical imaging modalities and standardizing data preprocessing, backbone architectures, and evaluation protocols. The benchmark integrates twenty representative TTA methods spanning four paradigms: input transformation, feature alignment, output regularization, and prior estimation, enabling the first standardized, multi-modal, multi-paradigm comparison. Extensive experiments reveal no universally optimal method: input-level approaches exhibit superior robustness under mild distribution shifts, while feature- and output-level methods achieve better boundary-aware metrics; several methods degrade significantly under severe center-to-device domain shifts. This work establishes a systematic, cross-modal foundation for rigorously assessing the reliability and applicability of TTA techniques in medical imaging.

Technology Category

Application Category

📝 Abstract
Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptation methods across seven imaging modalities, including MRI, CT, ultrasound, pathology, dermoscopy, OCT, and chest X-ray, under fully unified data preprocessing, backbone configuration, and test time protocols. The benchmark encompasses four significant adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation, enabling the first systematic cross-modality comparison of their reliability and applicability. The results show that no single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts. Feature-level and Output-level methods offer greater advantages in boundary-related metrics, whereas prior-based methods exhibit strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts, which highlights the importance of principled method selection for clinical deployment. MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard, establishing a rigorous foundation for future research on robust, clinically reliable test-time adaptation. All source codes and open-source datasets are available at https://github.com/wenjing-gg/MedSeg-TTA.
Problem

Research questions and friction points this paper is trying to address.

Evaluates test time adaptation methods for medical image segmentation across diverse imaging modalities.
Compares adaptation paradigms to assess reliability and applicability under domain shifts.
Highlights performance variability and need for principled method selection in clinical deployment.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for 20 adaptation methods across 7 modalities
Systematic comparison of four adaptation paradigms under unified protocols
Standardized datasets and implementations for robust clinical deployment
W
Wenjing Yu
Hangzhou Dianzi University, Hangzhou, 310018, China
S
Shuo Jiang
Hangzhou Dianzi University, Hangzhou, 310018, China
Y
Yifei Chen
Tsinghua University, Beijing, 100084, China
S
Shuo Chang
Hangzhou Dianzi University, Hangzhou, 310018, China
Y
Yuanhan Wang
Hangzhou Dianzi University, Hangzhou, 310018, China
B
Beining Wu
Hangzhou Dianzi University, Hangzhou, 310018, China
J
Jie Dong
Hangzhou Dianzi University, Hangzhou, 310018, China
M
Mingxuan Liu
Tsinghua University, Beijing, 100084, China
Shenghao Zhu
Shenghao Zhu
Hangzhou Dianzi University, Hangzhou, 310018, China
Feiwei Qin
Feiwei Qin
Prof. College of Computer Science, Hangzhou Dianzi University
Artificial IntelligenceComputer-Aided DesignComputer VisionMedical Image Analysis
C
Changmiao Wang
Shenzhen Research Institute of Big Data, Shenzhen, 518172, China
Qiyuan Tian
Qiyuan Tian
Tsinghua University, Stanford University, Massachusetts General Hospital, Harvard Medical School
MRIDiffusion MRINeuroimagingDeep Learning