Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Clinical decision-making is often inefficient and prone to missed diagnoses due to challenges in fusing heterogeneous multimodal medical data—such as text, 2D/3D imaging, and video. Existing medical vision-language models (VLMs) suffer from architectural opacity, scarcity of high-quality annotations, and poor scalability across modalities. To address these limitations, we propose the first transparent, unified, full-modality medical VLM framework. Our approach introduces a medical-aware token compression mechanism and a progressive multi-scale patch encoder, enabling synergistic learning across 2D → 3D → video modalities. We employ end-to-end alignment training with efficient token reduction. Evaluated on 30 cross-modal medical benchmarks, our method achieves state-of-the-art performance. Models ranging from 7B to 32B parameters require only 4K–40K GPU-hours for training—matching or surpassing closed-source systems in accuracy while significantly enhancing clinical interpretability and deployment flexibility.

Technology Category

Application Category

📝 Abstract

Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}.

Problem

Research questions and friction points this paper is trying to address.

Integrating diverse medical data modalities causes diagnostic inefficiencies

Medical vision-language models suffer from opaque pipelines and data scarcity

Existing architectures lack flexibility for holistic clinical decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified patch-based vision encoder with LLM decoder

Progressive training scaling from 2D to 3D video

Medical-aware token reduction for efficient training

🔎 Similar Papers

No similar papers found.