T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Medical vision-language models face a fundamental trade-off between generalizability and domain specificity: pretrained general models exhibit strong robustness but lack modality-specific knowledge, whereas fine-tuned expert models achieve high in-distribution performance yet suffer from poor out-of-distribution generalization. Existing static model merging techniques—designed for natural images—demonstrate unstable performance across diverse medical imaging modalities. To address this, we propose T³, the first test-time adaptive model merging framework that requires no backpropagation. T³ dynamically computes sample- or batch-level interpolation weights via Jensen–Shannon divergence to personalize the fusion of general and expert models. We further introduce output distribution alignment and a batch-wise variant (T³_B) to enhance efficiency. Evaluated across four major medical imaging modalities, T³ achieves state-of-the-art Top-1 accuracy, significantly reduces prediction error, and maintains low inference latency—enabling efficient clinical deployment for multimodal medical AI.

Technology Category

Application Category

📝 Abstract

In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.

Problem

Research questions and friction points this paper is trying to address.

Addresses medical vision-language models' robustness-precision tradeoff

Overcomes static model merging limitations in clinical modality shifts

Enables adaptive model fusion without backpropagation for medical imaging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic model merging using Jensen-Shannon divergence

Batch-wise merging reduces computational costs

Backpropagation-free framework for medical imaging analysis

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training