🤖 AI Summary
This study addresses the challenge of insufficient pretraining and limited task adaptability in existing 3D multimodal large models due to the scarcity of 3D medical imaging data. To overcome this, the authors present the first complete transfer of a well-pretrained 2D multimodal large language model to 3D CT analysis. They propose a text-guided hierarchical mixture-of-experts framework (TGH-MoE) combined with a two-stage training strategy to enable task-adaptive feature extraction. The approach significantly outperforms current 3D medical multimodal large models on both medical report generation (MRG) and medical visual question answering (MVQA) tasks, demonstrating the effectiveness of cross-dimensional transfer from 2D to 3D and highlighting the innovation of the TGH-MoE mechanism.
📝 Abstract
3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image features for various tasks, we then design a Text-Guided Hierarchical MoE (TGH-MoE) framework, which can distinguish tasks under the guidance of the text prompt. Furthermore, we propose a two-stage training strategy to learn both task-shared and task-specific image features. As demonstrated empirically, our method outperforms existing 3D medical MLLMs in both MRG and MVQA tasks. Our code will be released once this paper is accepted.