Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the limitation of existing medical vision-language models that process 3D brain MRI using 2D slices, thereby failing to preserve the full spatial context essential for neuroradiological interpretation. To overcome this, the authors propose a staged vision-language framework tailored for 3D brain tumor MRI. The approach begins by inflating a pretrained 2D medical encoder into a 3D Vision Transformer, followed by a three-stage alignment strategy—comprising contrastive learning, supervised projector warm-up, and LoRA-based fine-tuning—to jointly optimize with a causal language model. This is the first method to integrate inflated 3D visual encoding with staged alignment, specifically designed for neuroradiology to accurately model critical features such as lesion laterality, infiltration patterns, and anatomical localization. Evaluated on 468 cases, the model achieves a clinical-pathological F1 score of 0.951—substantially surpassing the 2D baseline of 0.413—while maintaining 100% specificity on healthy samples.

Technology Category

Application Category

📝 Abstract

Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbf{Brain3D}, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbf{Brain3D} is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnote{Our code is publicly available for transparency and reproducibility

Problem

Research questions and friction points this paper is trying to address.

3D brain MRI

vision-language models

radiology report generation

spatial context

neuroradiology

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inflated Vision Transformer

3D Medical Vision-Language Model

Staged Alignment