🤖 AI Summary
Medical vision-language models face a fundamental trade-off between generalizability and domain specificity: pretrained general models exhibit strong robustness but lack modality-specific knowledge, whereas fine-tuned expert models achieve high in-distribution performance yet suffer from poor out-of-distribution generalization. Existing static model merging techniques—designed for natural images—demonstrate unstable performance across diverse medical imaging modalities. To address this, we propose T³, the first test-time adaptive model merging framework that requires no backpropagation. T³ dynamically computes sample- or batch-level interpolation weights via Jensen–Shannon divergence to personalize the fusion of general and expert models. We further introduce output distribution alignment and a batch-wise variant (T³_B) to enhance efficiency. Evaluated across four major medical imaging modalities, T³ achieves state-of-the-art Top-1 accuracy, significantly reduces prediction error, and maintains low inference latency—enabling efficient clinical deployment for multimodal medical AI.
📝 Abstract
In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.