Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The strong heterogeneity and format inconsistency across medical multimodal data—such as 2D images, 3D volumetric scans, and video sequences—severely hinder the development of unified medical multimodal large language models (MLLMs). To address this, we propose the first end-to-end framework for cross-modal general medical visual understanding. Our method innovatively integrates long-context pretraining, supervised fine-tuning (SFT) on scarce medical data, and group-relative policy optimization (GRPO). We further introduce the first extended evaluation benchmark covering both 3D and video modalities. By jointly training on natural images and domain-specific medical data, our model achieves state-of-the-art performance on medical visual question answering, video question answering, and 3D imaging understanding. The model is publicly released, significantly enhancing reproducibility and transparency in medical AI research.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature -- encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective through three key strategies: (1) scaling up pretraining by integrating long-context data from both natural and medical-specific domains; (2) complementing fine-tuning with rare medical data, including holistic video analysis and underrepresented 2D modalities such as ultrasound and dermoscopy images; (3) extending existing evaluation frameworks to incorporate 3D volumetric and video understanding benchmarks. Through supervised fine-tuning (SFT) and group relative policy optimization (GRPO), we develop Fleming-VL in multiple model scales. Extensive experiments demonstrate that Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding. We publicly release Fleming-VL to promote transparent, reproducible, and auditable progress in medical AI.
Problem

Research questions and friction points this paper is trying to address.

Addressing heterogeneous medical data modalities integration
Bridging domain gaps across 2D/3D/video medical formats
Developing unified medical visual reasoning framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates long-context data from natural and medical domains
Fine-tunes with rare medical data including video and ultrasound
Extends evaluation to 3D volumetric and video benchmarks
🔎 Similar Papers
No similar papers found.