A Medical Multimodal Diagnostic Framework Integrating Vision-Language Models and Logic Tree Reasoning

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Current medical multimodal models suffer from hallucination and logical inconsistency in clinical vision–language reasoning, severely undermining diagnostic reliability. To address this, we propose a trustworthy, logic-driven multimodal diagnostic framework that introduces a novel collaborative mechanism between a Logic Tree Generator and a Reasoning Controller. This decomposes diagnosis into verifiable premise chains, jointly modeling vision–language alignment and formal logical constraints. Built upon the LLaVA architecture, our framework integrates cross-modal projection, stepwise reasoning control, and structured reasoning tracing. Evaluated on benchmarks including MedXpertQA, it achieves significant improvements in diagnostic accuracy and reasoning interpretability while retaining competitive performance on pure-text tasks. Our core contribution is the first integration of verifiable logical structure into the multimodal medical reasoning pipeline—establishing a new paradigm for clinically trustworthy AI.

Technology Category

Application Category

📝 Abstract

With the rapid growth of large language models (LLMs) and vision-language models (VLMs) in medicine, simply integrating clinical text and medical imaging does not guarantee reliable reasoning. Existing multimodal models often produce hallucinations or inconsistent chains of thought, limiting clinical trust. We propose a diagnostic framework built upon LLaVA that combines vision-language alignment with logic-regularized reasoning. The system includes an input encoder for text and images, a projection module for cross-modal alignment, a reasoning controller that decomposes diagnostic tasks into steps, and a logic tree generator that assembles stepwise premises into verifiable conclusions. Evaluations on MedXpertQA and other benchmarks show that our method improves diagnostic accuracy and yields more interpretable reasoning traces on multimodal tasks, while remaining competitive on text-only settings. These results suggest a promising step toward trustworthy multimodal medical AI.

Problem

Research questions and friction points this paper is trying to address.

Improves diagnostic accuracy in multimodal medical tasks

Reduces hallucinations and inconsistent reasoning in AI

Enhances interpretability of AI diagnostic reasoning traces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates vision-language models with logic tree reasoning

Uses a reasoning controller to decompose diagnostic tasks

Generates verifiable conclusions through stepwise premises assembly

🔎 Similar Papers

Multi-modal vision-language model for generalizable annotation-free pathology localization and clinical diagnosis