Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Clinical decision-making is often inefficient and prone to missed diagnoses due to challenges in fusing heterogeneous multimodal medical data—such as text, 2D/3D imaging, and video. Existing medical vision-language models (VLMs) suffer from architectural opacity, scarcity of high-quality annotations, and poor scalability across modalities. To address these limitations, we propose the first transparent, unified, full-modality medical VLM framework. Our approach introduces a medical-aware token compression mechanism and a progressive multi-scale patch encoder, enabling synergistic learning across 2D → 3D → video modalities. We employ end-to-end alignment training with efficient token reduction. Evaluated on 30 cross-modal medical benchmarks, our method achieves state-of-the-art performance. Models ranging from 7B to 32B parameters require only 4K–40K GPU-hours for training—matching or surpassing closed-source systems in accuracy while significantly enhancing clinical interpretability and deployment flexibility.

Technology Category

Application Category

📝 Abstract
Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}.
Problem

Research questions and friction points this paper is trying to address.

Integrating diverse medical data modalities causes diagnostic inefficiencies
Medical vision-language models suffer from opaque pipelines and data scarcity
Existing architectures lack flexibility for holistic clinical decision-making
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified patch-based vision encoder with LLM decoder
Progressive training scaling from 2D to 3D video
Medical-aware token reduction for efficient training
🔎 Similar Papers
No similar papers found.
Songtao Jiang
Songtao Jiang
Zhejiang University
Vision-Language ModelsAI for Bioinfomatics and Medical
Y
Yuan Wang
Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310016, Zhejiang, China.
Sibo Song
Sibo Song
Alibaba
computer visiondeep learningmultimodal learning
T
Tianxiang Hu
Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310016, Zhejiang, China.
Chenyi Zhou
Chenyi Zhou
Zhejiang University
artificial intelligence
Bin Pu
Bin Pu
The Hong Kong University of Science and Technology | HNU | NTU
Computer visionMedical image analysisUltrasound image processingAI4Science
Y
Yan Zhang
College of Computer Science and Technology, Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University, Hangzhou 310027, Zhejiang, China.
Z
Zhibo Yang
Alibaba Inc, Hangzhou 310023, China.
Y
Yang Feng
Angelalign Technology Inc., Shanghai 200082, China.
Joey Tianyi Zhou
Joey Tianyi Zhou
A*STAR and NUS
Efficient AIRobust & Safe AI
Jin Hao
Jin Hao
Assistant Professor, Shanghai Jiao Tong University
Stem Cell BiologyNeuroscienceBrain organoids
Zijian Chen
Zijian Chen
Shanghai Jiao Tong University | Shanghai AI Laboratory
Image/Video Quality AssessmentLarge Multi-modal Models
R
Ruijia Wu
School of Artificial Intelligence, Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai 200030, China.
T
Tao Tang
China Mobile Group Zhejiang Company Limited, Hangzhou 310016, Zhejiang, China.
J
Junhui Lv
Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence, Haining 314400, Zhejiang, China.
Hongxia Xu
Hongxia Xu
Zhejiang University
AI4ScienceNanomedicineMedical imaging
H
Hongwei Wang
College of Computer Science and Technology, Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University, Hangzhou 310027, Zhejiang, China.
J
Jun Xiao
College of Computer Science and Technology, Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University, Hangzhou 310027, Zhejiang, China.
B
Bin Feng
Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310016, Zhejiang, China.
F
Fudong Zhu
Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310016, Zhejiang, China.
Kenli Li
Kenli Li
Cheung Kong Professor, Hunan University
High-performance ComputingParallel and Distributed ProcessingAI and Big Data
Weidi Xie
Weidi Xie
Shanghai Jiao Tong University | VGG, University of Oxford
Computer VisionAI for HealthcareAI for Science
Jimeng Sun
Jimeng Sun
Professor at University of Illinois Urbana-Champaign
AI for healthcareMachine learning for healthcaredeep learning for healthcare
J
Jian Wu
College of Computer Science and Technology, Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University, Hangzhou 310027, Zhejiang, China.
Zuozhu Liu
Zuozhu Liu
Assistant Professor, Zhejiang University/University of Illinois Urbana-Champaign
deep learningvision-language modelsmedical AI