A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Visual question answering (VQA) over multi-parametric 3D brain MRI remains challenging due to the complexity of spatiotemporal multimodal alignment and scarcity of expert-annotated data. Method: We propose mpLLM, a hierarchical mixture-of-experts multimodal large language model for multi-parametric 3D MRI VQA—trained without image-report pretraining. It integrates modality-level and token-level expert routing for fine-grained cross-modal alignment across sequences (e.g., T1, T2, FLAIR). A prompt-conditioned multimodal fusion architecture is introduced, alongside the first neuro-radiologist-validated 3D multi-parametric MRI VQA dataset. To address annotation scarcity, we devise a clinical-knowledge-guided synthetic VQA generation protocol. Contribution/Results: mpLLM achieves an average +5.3% improvement over state-of-the-art medical vision-language models across multiple benchmarks. Ablation studies confirm the criticality of hierarchical routing and prompt-driven fusion. Code will be open-sourced; the dataset is forthcoming.

Technology Category

Application Category

📝 Abstract

We introduce mpLLM, a prompt-conditioned hierarchical mixture-of-experts (MoE) architecture for visual question answering over multi-parametric 3D brain MRI (mpMRI). mpLLM routes across modality-level and token-level projection experts to fuse multiple interrelated 3D modalities, enabling efficient training without image--report pretraining. To address limited image-text paired supervision, mpLLM integrates a synthetic visual question answering (VQA) protocol that generates medically relevant VQA from segmentation annotations, and we collaborate with medical experts for clinical validation. mpLLM outperforms strong medical VLM baselines by 5.3% on average across multiple mpMRI datasets. Our study features three main contributions: (1) the first clinically validated VQA dataset for 3D brain mpMRI, (2) a novel multimodal LLM that handles multiple interrelated 3D modalities, and (3) strong empirical results that demonstrate the medical utility of our methodology. Ablations highlight the importance of modality-level and token-level experts and prompt-conditioned routing. We have included our source code in the supplementary materials and will release our dataset upon publication.

Problem

Research questions and friction points this paper is trying to address.

Developed multimodal LLM for 3D brain MRI visual question answering

Addressed limited supervision with synthetic VQA generation protocol

Introduced hierarchical MoE architecture for multi-parametric modality fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-conditioned hierarchical MoE for multimodal fusion

Synthetic VQA protocol using segmentation annotations

Modality-level and token-level expert routing for 3D MRI

🔎 Similar Papers

No similar papers found.