Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Current medical multimodal large language models (MLLMs) suffer from three key limitations: narrow medical knowledge coverage (overemphasizing imaging), high hallucination rates (due to insufficient data curation), and weak complex reasoning capabilities. To address these, we propose the first general-purpose foundation model for comprehensive multimodal medical understanding and reasoning. Our approach innovatively integrates cross-modal medical data co-distillation and synthesis with a multi-stage progressive training paradigm, and—crucially—introduces verifiable reward-based reinforcement learning for the first time to optimize medical reasoning. We concurrently release MedEvalKit, a unified evaluation framework. Empirical results demonstrate that our model consistently outperforms existing open-source medical MLLMs across multimodal question answering, text-based question answering, and clinical report generation. It significantly reduces hallucination rates while improving clinical consistency and reasoning reliability.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

Problem

Research questions and friction points this paper is trying to address.

Limited medical knowledge coverage beyond imaging

High susceptibility to hallucinations in medical data

Lack of reasoning for complex medical scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive medical data curation from multimodal sources

Multi-stage training for embedding medical expertise

Reinforcement learning with verifiable rewards for reasoning

🔎 Similar Papers

Alifuse: Aligning and Fusing Multimodal Medical Data for Computer-Aided Diagnosis