Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of insufficient pretraining and limited task adaptability in existing 3D multimodal large models due to the scarcity of 3D medical imaging data. To overcome this, the authors present the first complete transfer of a well-pretrained 2D multimodal large language model to 3D CT analysis. They propose a text-guided hierarchical mixture-of-experts framework (TGH-MoE) combined with a two-stage training strategy to enable task-adaptive feature extraction. The approach significantly outperforms current 3D medical multimodal large models on both medical report generation (MRG) and medical visual question answering (MVQA) tasks, demonstrating the effectiveness of cross-dimensional transfer from 2D to 3D and highlighting the innovation of the TGH-MoE mechanism.

Technology Category

Application Category

📝 Abstract
3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image features for various tasks, we then design a Text-Guided Hierarchical MoE (TGH-MoE) framework, which can distinguish tasks under the guidance of the text prompt. Furthermore, we propose a two-stage training strategy to learn both task-shared and task-specific image features. As demonstrated empirically, our method outperforms existing 3D medical MLLMs in both MRG and MVQA tasks. Our code will be released once this paper is accepted.
Problem

Research questions and friction points this paper is trying to address.

3D medical image analysis
multimodal large language models
medical report generation
medical visual question answering
vision encoder
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D medical image analysis
multimodal large language model
Text-Guided Hierarchical MoE
two-stage training
medical report generation
Y
Yang Yu
Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
D
Dunyuan Xu
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
Y
Yaoqian Li
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
Xiaomeng Li
Xiaomeng Li
Assistant Professor, The Hong Kong University of Science and Technology
Medical Image AnalysisAI in HealthcareDeep Learning
Jinpeng Li
Jinpeng Li
The Chinese University of Hong Kong
Deep LearningMedical Image AnalysisPedestrian Detection
P
Pheng-Ann Heng
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China; Institute of Medical Intelligence and XR, The Chinese University of Hong Kong, Hong Kong, China