InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the limitations of general-purpose multimodal large language models (MLLMs) in healthcare—including domain-specific knowledge gaps, inefficient knowledge distillation, and high computational costs for continual pretraining—this paper introduces InfiMed, a series of specialized medical MLLMs. Methodologically, we propose a five-dimensional medical data quality evaluation framework; adopt a low-to-high-resolution progressive image training strategy coupled with multimodal sequence packing; and design a three-stage supervised fine-tuning pipeline to efficiently inject medical knowledge. Our contributions include significant improvements in medical visual question answering and diagnostic performance: InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B across multiple benchmarks, while the 4B variant surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating that domain specialization and computational efficiency can be jointly optimized within a lightweight architecture.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have shown remarkable potential in various domains, yet their application in the medical field is hindered by several challenges. General-purpose MLLMs often lack the specialized knowledge required for medical tasks, leading to uncertain or hallucinatory responses. Knowledge distillation from advanced models struggles to capture domain-specific expertise in radiology and pharmacology. Additionally, the computational cost of continual pretraining with large-scale medical data poses significant efficiency challenges. To address these issues, we propose InfiMed-Foundation-1.7B and InfiMed-Foundation-4B, two medical-specific MLLMs designed to deliver state-of-the-art performance in medical applications. We combined high-quality general-purpose and medical multimodal data and proposed a novel five-dimensional quality assessment framework to curate high-quality multimodal medical datasets. We employ low-to-high image resolution and multimodal sequence packing to enhance training efficiency, enabling the integration of extensive medical data. Furthermore, a three-stage supervised fine-tuning process ensures effective knowledge extraction for complex medical tasks. Evaluated on the MedEvalKit framework, InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B, while InfiMed-Foundation-4B surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating superior performance in medical visual question answering and diagnostic tasks. By addressing key challenges in data quality, training efficiency, and domain-specific knowledge extraction, our work paves the way for more reliable and effective AI-driven solutions in healthcare. InfiMed-Foundation-4B model is available at href{https://huggingface.co/InfiX-ai/InfiMed-Foundation-4B}{InfiMed-Foundation-4B}.

Problem

Research questions and friction points this paper is trying to address.

Addressing medical MLLMs' knowledge gaps and hallucinatory responses

Overcoming computational inefficiency in medical data pretraining

Enhancing domain-specific knowledge extraction for medical tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compute-efficient pre-training with multimodal sequence packing

Multi-stage supervised fine-tuning for medical tasks

Five-dimensional quality assessment for medical datasets

🔎 Similar Papers

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs