InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of general-purpose multimodal large language models (MLLMs) in healthcare—including domain-specific knowledge gaps, inefficient knowledge distillation, and high computational costs for continual pretraining—this paper introduces InfiMed, a series of specialized medical MLLMs. Methodologically, we propose a five-dimensional medical data quality evaluation framework; adopt a low-to-high-resolution progressive image training strategy coupled with multimodal sequence packing; and design a three-stage supervised fine-tuning pipeline to efficiently inject medical knowledge. Our contributions include significant improvements in medical visual question answering and diagnostic performance: InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B across multiple benchmarks, while the 4B variant surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating that domain specialization and computational efficiency can be jointly optimized within a lightweight architecture.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have shown remarkable potential in various domains, yet their application in the medical field is hindered by several challenges. General-purpose MLLMs often lack the specialized knowledge required for medical tasks, leading to uncertain or hallucinatory responses. Knowledge distillation from advanced models struggles to capture domain-specific expertise in radiology and pharmacology. Additionally, the computational cost of continual pretraining with large-scale medical data poses significant efficiency challenges. To address these issues, we propose InfiMed-Foundation-1.7B and InfiMed-Foundation-4B, two medical-specific MLLMs designed to deliver state-of-the-art performance in medical applications. We combined high-quality general-purpose and medical multimodal data and proposed a novel five-dimensional quality assessment framework to curate high-quality multimodal medical datasets. We employ low-to-high image resolution and multimodal sequence packing to enhance training efficiency, enabling the integration of extensive medical data. Furthermore, a three-stage supervised fine-tuning process ensures effective knowledge extraction for complex medical tasks. Evaluated on the MedEvalKit framework, InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B, while InfiMed-Foundation-4B surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating superior performance in medical visual question answering and diagnostic tasks. By addressing key challenges in data quality, training efficiency, and domain-specific knowledge extraction, our work paves the way for more reliable and effective AI-driven solutions in healthcare. InfiMed-Foundation-4B model is available at href{https://huggingface.co/InfiX-ai/InfiMed-Foundation-4B}{InfiMed-Foundation-4B}.
Problem

Research questions and friction points this paper is trying to address.

Addressing medical MLLMs' knowledge gaps and hallucinatory responses
Overcoming computational inefficiency in medical data pretraining
Enhancing domain-specific knowledge extraction for medical tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compute-efficient pre-training with multimodal sequence packing
Multi-stage supervised fine-tuning for medical tasks
Five-dimensional quality assessment for medical datasets
G
Guanghao Zhu
The Hong Kong Polytechnic University
Z
Zhitian Hou
Sun Yat-sen University
Z
Zeyu Liu
The Hong Kong Polytechnic University
Zhijie Sang
Zhijie Sang
Microsoft
NLP
Congkai Xie
Congkai Xie
Reallm Labs
Hongxia Yang
Hongxia Yang
Professor, HK Polytechnic University
Machine LearningGenerative AICognitive IntelligenceStatistical Modeling