🤖 AI Summary
This study investigates whether vision-language models (VLMs) fine-tuned on medical data possess genuine clinical reasoning capabilities or merely rely on superficial visual cues. We systematically evaluate open-source VLMs—including LLaVA, LLaVA-Med, Gemma, and MedGemma—across four medical imaging tasks of increasing difficulty and reveal, for the first time, a sharp performance drop in complex scenarios. To probe the models’ underlying knowledge, we introduce a diagnosis pipeline grounded in image descriptions, leveraging GPT-5.1 to generate structured clinical reports, and analyze visual encoder embeddings to pinpoint failure modes. Our findings indicate that medical fine-tuning does not consistently improve performance; instead, model outputs are highly sensitive to prompting, accuracy approaches random levels as task complexity increases, and overall behavior remains fragile and unreliable.
📝 Abstract
Vision-language models (VLMs) are increasingly adapted through domain-specific fine-tuning, yet it remains unclear whether this improves reasoning beyond superficial visual cues, particularly in high-stakes domains like medicine. We evaluate four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks of increasing difficulty: brain tumor, pneumonia, skin cancer, and histopathology classification. We find that performance degrades toward near-random levels as task difficulty increases, indicating limited clinical reasoning. Medical fine-tuning provides no consistent advantage, and models are highly sensitive to prompt formulation, with minor changes causing large swings in accuracy and refusal rates. To test whether closed-form VQA suppresses latent knowledge, we introduce a description-based pipeline where models generate image descriptions that a text-only model (GPT-5.1) uses for diagnosis. This recovers a limited additional signal but remains bounded by task difficulty. Analysis of vision encoder embeddings further shows that failures stem from both weak visual representations and downstream reasoning. Overall, medical VLM performance is fragile, prompt-dependent, and not reliably improved by domain-specific fine-tuning.