Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical multimodal large language models (MLLMs) suffer from loose image–task alignment and limited generalization in multi-task learning due to coarse-grained data construction. To address this, we propose an image-centric multi-annotation paradigm and introduce IMAX—the first X-ray image-centric multi-task dataset—covering seven clinical tasks, with each image averagely associated with 4.10 tasks and 7.46 samples, enabling fine-grained, dense image–task alignment. We further uncover statistical correlations between optimization dynamics and multi-task performance, leading to DMAX—a transferable enhancement training strategy. Evaluated on seven open-source medical MLLMs, IMAX improves average multi-task performance by 3.20%–21.05%, significantly boosting cross-task generalization and clinical capability in multidimensional radiographic interpretation.

Technology Category

Application Category

📝 Abstract
The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-task learning in medical foundation models
Addressing decentralized image-task alignment in medical data
Improving multi-dimensional image interpretation for clinical needs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-centric multi-annotation X-ray dataset (IMAX)
High-quality data curation with 354K entries
Dense annotation averaging 4.10 tasks per image
🔎 Similar Papers
No similar papers found.
X
Xun Zhu
Department of Electronic Engineering, Tsinghua University, Beijing, China
F
Fanbin Mo
School of Artificial Intelligence, BUPT, Beijing, China
Z
Zheng Zhang
Department of Electronic Engineering, Tsinghua University, Beijing, China
Jiaxi Wang
Jiaxi Wang
Tsinghua University
machine learning
Yiming Shi
Yiming Shi
University of Electronic Science and Technology of China
Efficient AIParameter Efficient Fine TuningDiffusionMultimodal
M
Ming Wu
School of Artificial Intelligence, BUPT, Beijing, China
Chuang Zhang
Chuang Zhang
Tsinghua University
Autonomous DrivingIntelligent Connected Vehicle
M
Miao Li
Department of Electronic Engineering, Tsinghua University, Beijing, China
Ji Wu
Ji Wu
Tsinghua University
Artificial Intelligence,smart healthcaremachine learningpattern recognitionspeech recognition