🤖 AI Summary
To address the limitations of general-purpose multimodal large language models (MLLMs) in healthcare—including domain-specific knowledge gaps, inefficient knowledge distillation, and high computational costs for continual pretraining—this paper introduces InfiMed, a series of specialized medical MLLMs. Methodologically, we propose a five-dimensional medical data quality evaluation framework; adopt a low-to-high-resolution progressive image training strategy coupled with multimodal sequence packing; and design a three-stage supervised fine-tuning pipeline to efficiently inject medical knowledge. Our contributions include significant improvements in medical visual question answering and diagnostic performance: InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B across multiple benchmarks, while the 4B variant surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating that domain specialization and computational efficiency can be jointly optimized within a lightweight architecture.
📝 Abstract
Multimodal large language models (MLLMs) have shown remarkable potential in various domains, yet their application in the medical field is hindered by several challenges. General-purpose MLLMs often lack the specialized knowledge required for medical tasks, leading to uncertain or hallucinatory responses. Knowledge distillation from advanced models struggles to capture domain-specific expertise in radiology and pharmacology. Additionally, the computational cost of continual pretraining with large-scale medical data poses significant efficiency challenges. To address these issues, we propose InfiMed-Foundation-1.7B and InfiMed-Foundation-4B, two medical-specific MLLMs designed to deliver state-of-the-art performance in medical applications. We combined high-quality general-purpose and medical multimodal data and proposed a novel five-dimensional quality assessment framework to curate high-quality multimodal medical datasets. We employ low-to-high image resolution and multimodal sequence packing to enhance training efficiency, enabling the integration of extensive medical data. Furthermore, a three-stage supervised fine-tuning process ensures effective knowledge extraction for complex medical tasks. Evaluated on the MedEvalKit framework, InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B, while InfiMed-Foundation-4B surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating superior performance in medical visual question answering and diagnostic tasks. By addressing key challenges in data quality, training efficiency, and domain-specific knowledge extraction, our work paves the way for more reliable and effective AI-driven solutions in healthcare. InfiMed-Foundation-4B model is available at href{https://huggingface.co/InfiX-ai/InfiMed-Foundation-4B}{InfiMed-Foundation-4B}.