🤖 AI Summary
Medical vision-language models (MVLMs) are susceptible to distribution shifts introduced during real-world clinical workflows—such as image acquisition, reconstruction, display, and transmission—leading to degraded reliability. This work proposes CoDA, a novel framework that formulates the first clinically grounded chain of distribution shift attacks by composing multiple plausible image processing stages. CoDA induces significant model failure while preserving visual plausibility, revealing that joint multi-stage perturbations are substantially more detrimental than single-stage ones. To mitigate this vulnerability, the authors introduce a lightweight teacher-guided token-space adaptation strategy, incorporating masked structural similarity constraints and patch-level alignment. This approach effectively enhances the zero-shot robustness of CLIP-style MVLMs under CoDA-induced perturbations across diverse modalities, including brain MRI, chest X-ray, and abdominal CT.
📝 Abstract
Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.