VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in medical multimodal large language models—namely modality collapse, insufficient visual supervision, train-inference inconsistency, and lack of interpretability—by introducing the VITAL framework. VITAL enhances latent reasoning through dual visual-semantic supervision: an auxiliary text decoder reconstructs the reasoning chain, while a visual projector regresses region-of-interest features from a frozen medical vision encoder. This approach uniquely enables zero-overhead, post-hoc bimodal interpretability without additional inference cost. Trained on a 61K medical image dataset, VITAL achieves state-of-the-art performance across seven medical visual question answering benchmarks, significantly outperforming larger-scale medical MLLMs and matching the performance of trillion-parameter closed-source models.
📝 Abstract
Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inference overhead of chain-of-thought for medical VQA. However, existing methods suffer from modality collapse, insufficient visual supervision, and train-inference mismatch. Moreover, their opaque latent states offer no interpretability, which is critical in clinical applications. We propose VITAL, a latent-space reasoning framework for medical MLLMs with visual-semantic dual supervision: an auxiliary text decoder reconstructs reasoning chains from latent states, while a visual projector regresses ROI features from a frozen, independent medical vision encoder. Both modules are discarded at inference with zero overhead, yet can be re-attached post-hoc for dual interpretability, providing textual and visual explanations of the reasoning process without sacrificing efficiency. We construct a 61K dataset spanning 9 imaging modalities, exceeding prior medical visual latent reasoning datasets by an order of magnitude. Experiments on 7 benchmarks show that VITAL consistently and substantially outperforms the backbone, all latent reasoning baselines, and medical MLLMs trained on far larger data, achieving state-of-the-art results competitive with trillion-parameter proprietary models.
Problem

Research questions and friction points this paper is trying to address.

latent reasoning
modality collapse
visual supervision
train-inference mismatch
interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent reasoning
visual-semantic dual supervision
medical MLLMs
interpretability
modality alignment
🔎 Similar Papers
No similar papers found.