Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing medical multimodal large language models (MLLMs) suffer from loose image–task alignment and limited generalization in multi-task learning due to coarse-grained data construction. To address this, we propose an image-centric multi-annotation paradigm and introduce IMAX—the first X-ray image-centric multi-task dataset—covering seven clinical tasks, with each image averagely associated with 4.10 tasks and 7.46 samples, enabling fine-grained, dense image–task alignment. We further uncover statistical correlations between optimization dynamics and multi-task performance, leading to DMAX—a transferable enhancement training strategy. Evaluated on seven open-source medical MLLMs, IMAX improves average multi-task performance by 3.20%–21.05%, significantly boosting cross-task generalization and clinical capability in multidimensional radiographic interpretation.

Technology Category

Application Category

📝 Abstract

The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-task learning in medical foundation models

Addressing decentralized image-task alignment in medical data

Improving multi-dimensional image interpretation for clinical needs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-centric multi-annotation X-ray dataset (IMAX)

High-quality data curation with 354K entries

Dense annotation averaging 4.10 tasks per image

🔎 Similar Papers

A Generalist Learner for Multifaceted Medical Image Interpretation