🤖 AI Summary
Current medical multimodal models suffer from hallucination and logical inconsistency in clinical vision–language reasoning, severely undermining diagnostic reliability. To address this, we propose a trustworthy, logic-driven multimodal diagnostic framework that introduces a novel collaborative mechanism between a Logic Tree Generator and a Reasoning Controller. This decomposes diagnosis into verifiable premise chains, jointly modeling vision–language alignment and formal logical constraints. Built upon the LLaVA architecture, our framework integrates cross-modal projection, stepwise reasoning control, and structured reasoning tracing. Evaluated on benchmarks including MedXpertQA, it achieves significant improvements in diagnostic accuracy and reasoning interpretability while retaining competitive performance on pure-text tasks. Our core contribution is the first integration of verifiable logical structure into the multimodal medical reasoning pipeline—establishing a new paradigm for clinically trustworthy AI.
📝 Abstract
With the rapid growth of large language models (LLMs) and vision-language models (VLMs) in medicine, simply integrating clinical text and medical imaging does not guarantee reliable reasoning. Existing multimodal models often produce hallucinations or inconsistent chains of thought, limiting clinical trust. We propose a diagnostic framework built upon LLaVA that combines vision-language alignment with logic-regularized reasoning. The system includes an input encoder for text and images, a projection module for cross-modal alignment, a reasoning controller that decomposes diagnostic tasks into steps, and a logic tree generator that assembles stepwise premises into verifiable conclusions. Evaluations on MedXpertQA and other benchmarks show that our method improves diagnostic accuracy and yields more interpretable reasoning traces on multimodal tasks, while remaining competitive on text-only settings. These results suggest a promising step toward trustworthy multimodal medical AI.