HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical vision-language models (Med-VLMs) struggle to jointly optimize multimodal understanding and generation capabilities. To address this, we propose HealthGPT—the first end-to-end, multitask-unified medical multimodal large language model. Methodologically, we introduce Heterogeneous Low-Rank Adaptation (H-LoRA) to tightly couple a hierarchical vision encoder with an autoregressive language decoder, and design a three-stage collaborative training strategy to enable cross-modal semantic alignment and joint transfer of understanding-generation knowledge. HealthGPT is built upon open-source large language models and trained end-to-end on VL-Health, our high-quality, self-curated medical multimodal dataset. It achieves state-of-the-art performance across unified benchmarks—including medical image captioning, visual question answering, and radiology report generation—demonstrating substantial improvements in generalization and scalability. The code and model weights are publicly released.

Technology Category

Application Category

📝 Abstract
We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.
Problem

Research questions and friction points this paper is trying to address.

Integrates medical visual comprehension and generation
Adapts heterogeneous knowledge to pre-trained LLMs
Develops a medical domain-specific dataset VL-Health
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous low-rank adaptation
Hierarchical visual perception
Three-stage learning strategy
🔎 Similar Papers
No similar papers found.
Tianwei Lin
Tianwei Lin
Zhejiang University
MLLMs
W
Wenqiao Zhang
Zhejiang University
Sijing Li
Sijing Li
zhejiang university
MLLM
Yuqian Yuan
Yuqian Yuan
PhD student, Zhejiang University
Computer VisionMachine Learning
B
Binhe Yu
University of Electronic Science and Technology of China
H
Haoyuan Li
Alibaba
Wanggui He
Wanggui He
Researcher, Alibaba Group
ai
H
Hao Jiang
Alibaba
M
Mengze Li
The Hong Kong University of Science and Technology
X
Xiaohui Song
Zhejiang University
Siliang Tang
Siliang Tang
Professor of Computer Science, Zhejiang University
Natural Language ProcessingCross-media AnalysisGraph Neural Network
J
Jun Xiao
Zhejiang University
H
Hui Lin
Zhejiang University
Y
Yueting Zhuang
Zhejiang University
B
Beng Chin Ooi
National University of Singapore