Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

📅 2024-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing biomedical multimodal large language models (MLLMs) lack pixel-level fine-grained understanding and interactive capabilities, limiting their clinical utility. To address this, we propose MedPLIB—a fully end-to-end model enabling arbitrary pixel-level interactions (e.g., points, bounding boxes, free-form masks) with precise spatial grounding, while jointly modeling visual question answering (VQA) and pixel-level localization. Our method introduces a novel Mixture-of-Experts (MoE)-based multi-stage training paradigm that decouples visual-language comprehension from pixel localization tasks. Furthermore, we construct MeCoVQA—the first eight-modality, complex medical VQA dataset featuring dense pixel annotations. Extensive experiments demonstrate that MedPLIB achieves state-of-the-art performance across diverse biomedical vision-language benchmarks. Notably, in zero-shot pixel localization, it attains a +19.7 mDice improvement over the best-performing small model and +15.6 over the strongest large model.

Technology Category

Application Category

📝 Abstract
In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at https://github.com/ShawnHuang497/MedPLIB.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Biomedical Field
Pixel-level Operation Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

MedPLIB
Multi-modal Large Language Model
Medical Image Understanding
X
Xiaoshuang Huang
Baidu Inc, China Agricultural University
L
Lingdong Shen
Institute of Automation, Chinese Academy of Sciences
J
Jia Liu
Baidu Inc
F
Fangxin Shang
Baidu Inc
H
Hongxiang Li
Peking University
Haifeng Huang
Haifeng Huang
Iowa State University
Computer VisionMulti-modal Learning
Yehui Yang
Yehui Yang
Baidu, Bytedance
Computer visionMultimodal machine learning related applications