🤖 AI Summary
Medical vision-language models (Med-VLMs) struggle to jointly optimize multimodal understanding and generation capabilities. To address this, we propose HealthGPT—the first end-to-end, multitask-unified medical multimodal large language model. Methodologically, we introduce Heterogeneous Low-Rank Adaptation (H-LoRA) to tightly couple a hierarchical vision encoder with an autoregressive language decoder, and design a three-stage collaborative training strategy to enable cross-modal semantic alignment and joint transfer of understanding-generation knowledge. HealthGPT is built upon open-source large language models and trained end-to-end on VL-Health, our high-quality, self-curated medical multimodal dataset. It achieves state-of-the-art performance across unified benchmarks—including medical image captioning, visual question answering, and radiology report generation—demonstrating substantial improvements in generalization and scalability. The code and model weights are publicly released.
📝 Abstract
We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.