A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Current test-time adaptation (TTA) evaluation for medical image segmentation suffers from narrow modality coverage, task singularity, and inconsistent evaluation protocols. To address these limitations, we introduce MedSeg-TTA—the first large-scale, cross-modal TTA benchmark—encompassing seven distinct medical imaging modalities and standardizing data preprocessing, backbone architectures, and evaluation protocols. The benchmark integrates twenty representative TTA methods spanning four paradigms: input transformation, feature alignment, output regularization, and prior estimation, enabling the first standardized, multi-modal, multi-paradigm comparison. Extensive experiments reveal no universally optimal method: input-level approaches exhibit superior robustness under mild distribution shifts, while feature- and output-level methods achieve better boundary-aware metrics; several methods degrade significantly under severe center-to-device domain shifts. This work establishes a systematic, cross-modal foundation for rigorously assessing the reliability and applicability of TTA techniques in medical imaging.

Technology Category

Application Category

📝 Abstract

Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptation methods across seven imaging modalities, including MRI, CT, ultrasound, pathology, dermoscopy, OCT, and chest X-ray, under fully unified data preprocessing, backbone configuration, and test time protocols. The benchmark encompasses four significant adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation, enabling the first systematic cross-modality comparison of their reliability and applicability. The results show that no single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts. Feature-level and Output-level methods offer greater advantages in boundary-related metrics, whereas prior-based methods exhibit strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts, which highlights the importance of principled method selection for clinical deployment. MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard, establishing a rigorous foundation for future research on robust, clinically reliable test-time adaptation. All source codes and open-source datasets are available at https://github.com/wenjing-gg/MedSeg-TTA.

Problem

Research questions and friction points this paper is trying to address.

Evaluates test time adaptation methods for medical image segmentation across diverse imaging modalities.

Compares adaptation paradigms to assess reliability and applicability under domain shifts.

Highlights performance variability and need for principled method selection in clinical deployment.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for 20 adaptation methods across 7 modalities

Systematic comparison of four adaptation paradigms under unified protocols

Standardized datasets and implementations for robust clinical deployment

🔎 Similar Papers

Gradient Alignment Improves Test-Time Adaptation for Medical Image Segmentation