Test-Time-Scaling for Zero-Shot Diagnosis with Visual-Language Reasoning

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of supervised fine-tuning data and high annotation costs in medical image diagnosis, this paper proposes a zero-shot test-time scaling framework that enhances the clinical reliability of large language models (LLMs) for visual question answering and diagnostic reasoning—without any supervised fine-tuning. Methodologically, a vision-language model (VLM) first generates multi-perspective image explanations; an LLM then performs unbiased, multi-candidate output fusion and consensus aggregation over these explanations. Our key contribution is the first test-time scaling mechanism tailored for medical reasoning, integrating multi-path feature interpretation with interpretability-driven decision aggregation. Evaluated across multimodal datasets spanning radiology, ophthalmology, and histopathology, our approach significantly outperforms zero-shot baselines in diagnostic accuracy while improving both clinical credibility and output interpretability.

Technology Category

Application Category

📝 Abstract
As a cornerstone of patient care, clinical decision-making significantly influences patient outcomes and can be enhanced by large language models (LLMs). Although LLMs have demonstrated remarkable performance, their application to visual question answering in medical imaging, particularly for reasoning-based diagnosis, remains largely unexplored. Furthermore, supervised fine-tuning for reasoning tasks is largely impractical due to limited data availability and high annotation costs. In this work, we introduce a zero-shot framework for reliable medical image diagnosis that enhances the reasoning capabilities of LLMs in clinical settings through test-time scaling. Given a medical image and a textual prompt, a vision-language model processes a medical image along with a corresponding textual prompt to generate multiple descriptions or interpretations of visual features. These interpretations are then fed to an LLM, where a test-time scaling strategy consolidates multiple candidate outputs into a reliable final diagnosis. We evaluate our approach across various medical imaging modalities -- including radiology, ophthalmology, and histopathology -- and demonstrate that the proposed test-time scaling strategy enhances diagnostic accuracy for both our and baseline methods. Additionally, we provide an empirical analysis showing that the proposed approach, which allows unbiased prompting in the first stage, improves the reliability of LLM-generated diagnoses and enhances classification accuracy.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs for medical image diagnosis with zero-shot learning
Overcoming limited data via test-time scaling in clinical reasoning
Improving diagnostic accuracy across diverse medical imaging modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot framework for medical image diagnosis
Test-time scaling consolidates multiple outputs
Vision-language model enhances LLM reasoning
🔎 Similar Papers
No similar papers found.