Exploring the Design Space of 3D MLLMs for CT Report Generation

πŸ“… 2025-06-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses automatic radiology report generation (RRG) from 3D CT volumes. We systematically investigate the design space of 3D multimodal large language models (MLLMs), focusing on four key components: 3D vision encoders (ViTs), cross-modal projectors, large language model (LLM) architectures, and fine-tuning strategies. We propose two knowledge-driven report enhancement methods and identify three critical insights: (1) RRG performance is largely independent of LLM parameter count; (2) matching the 3D input volume size to the ViT’s pretraining scale is essential for optimal performance; and (3) jointly incorporating segmentation masks significantly improves generation quality. Evaluated on the AMOS-MM dataset (1,687 cases), our best configuration achieves a 10% improvement in GREEN score and secures second place in the MICCAI 2024 AMOS-MM Challenge. All code is publicly released.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Large Language Models (MLLMs) have emerged as a promising way to automate Radiology Report Generation (RRG). In this work, we systematically investigate the design space of 3D MLLMs, including visual input representation, projectors, Large Language Models (LLMs), and fine-tuning techniques for 3D CT report generation. We also introduce two knowledge-based report augmentation methods that improve performance on the GREEN score by up to 10%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely independent of the size of LLM under the same training protocol. We also show that larger volume size does not always improve performance if the original ViT was pre-trained on a smaller volume size. Lastly, we show that using a segmentation mask along with the CT volume improves performance. The code is publicly available at https://github.com/bowang-lab/AMOS-MM-Solution
Problem

Research questions and friction points this paper is trying to address.

Investigates 3D MLLM design for CT report generation
Evaluates impact of LLM size and volume on performance
Proposes knowledge-based methods to enhance report accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically explore 3D MLLM design space
Introduce knowledge-based report augmentation methods
Use segmentation mask with CT volume
πŸ”Ž Similar Papers
No similar papers found.
Mohammed Baharoon
Mohammed Baharoon
Harvard Medical School
Computer VisionMultimodal LearningUnsupervised LearninigFoundation Models
J
Jun Ma
Vector Institute for Artificial Intelligence, Toronto, Canada; AI Hub, University Health Network, Toronto, Canada
C
Congyu Fang
Vector Institute for Artificial Intelligence, Toronto, Canada; Peter Munk Cardiac Centre, University Health Network, Toronto, Canada; Department of Computer Science, University of Toronto, Toronto, Canada
A
Augustin Toma
Vector Institute for Artificial Intelligence, Toronto, Canada; Medical Biophysics, University of Toronto, Toronto, Canada
B
Bo Wang
Vector Institute for Artificial Intelligence, Toronto, Canada; Peter Munk Cardiac Centre, University Health Network, Toronto, Canada; Department of Computer Science, University of Toronto, Toronto, Canada; Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada; AI Hub, University Health Network, Toronto, Canada