CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical vision-language models (MVLMs) are susceptible to distribution shifts introduced during real-world clinical workflows—such as image acquisition, reconstruction, display, and transmission—leading to degraded reliability. This work proposes CoDA, a novel framework that formulates the first clinically grounded chain of distribution shift attacks by composing multiple plausible image processing stages. CoDA induces significant model failure while preserving visual plausibility, revealing that joint multi-stage perturbations are substantially more detrimental than single-stage ones. To mitigate this vulnerability, the authors introduce a lightweight teacher-guided token-space adaptation strategy, incorporating masked structural similarity constraints and patch-level alignment. This approach effectively enhances the zero-shot robustness of CLIP-style MVLMs under CoDA-induced perturbations across diverse modalities, including brain MRI, chest X-ray, and abdominal CT.

Technology Category

Application Category

📝 Abstract
Medical vision--language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.
Problem

Research questions and friction points this paper is trying to address.

medical vision-language models
clinical workflow robustness
distribution shift
image degradation
model reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Distribution Attacks
Medical Vision-Language Models
Post-Hoc Token-Space Repair
Clinical Plausibility
Multimodal Robustness
🔎 Similar Papers
No similar papers found.
X
Xiang Chen
Fangfang Yang
Fangfang Yang
PhD, University of California, Riverside
Electrical Engineering
Chunlei Meng
Chunlei Meng
Fudan University
Embodied Ai,Multimodal,Multi-agent
C
Chengyin Hu
A
Ang Li
Y
Yiwei Wei
J
Jiahuan Long
J
Jiujiang Guo