Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of insufficient pretraining and limited task adaptability in existing 3D multimodal large models due to the scarcity of 3D medical imaging data. To overcome this, the authors present the first complete transfer of a well-pretrained 2D multimodal large language model to 3D CT analysis. They propose a text-guided hierarchical mixture-of-experts framework (TGH-MoE) combined with a two-stage training strategy to enable task-adaptive feature extraction. The approach significantly outperforms current 3D medical multimodal large models on both medical report generation (MRG) and medical visual question answering (MVQA) tasks, demonstrating the effectiveness of cross-dimensional transfer from 2D to 3D and highlighting the innovation of the TGH-MoE mechanism.

Technology Category

Application Category

📝 Abstract

3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image features for various tasks, we then design a Text-Guided Hierarchical MoE (TGH-MoE) framework, which can distinguish tasks under the guidance of the text prompt. Furthermore, we propose a two-stage training strategy to learn both task-shared and task-specific image features. As demonstrated empirically, our method outperforms existing 3D medical MLLMs in both MRG and MVQA tasks. Our code will be released once this paper is accepted.

Problem

Research questions and friction points this paper is trying to address.

3D medical image analysis

multimodal large language models

medical report generation

medical visual question answering

vision encoder

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D medical image analysis

multimodal large language model

Text-Guided Hierarchical MoE

two-stage training

medical report generation

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

2024-03-15arXiv.orgCitations: 0

Authors to Follow