RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Current medical vision-language models struggle to jointly generate diagnostic text and pixel-level segmentation masks, limiting their clinical utility. To address this, we propose the first unified vision-language framework capable of simultaneous diagnostic report generation and multi-object segmentation. Our approach introduces a task-alignment mechanism and a hierarchical multimodal fusion strategy to enable end-to-end, synergistic reasoning across anomaly detection, textual description, and fine-grained mask prediction. To support training and evaluation, we introduce RadDiagSeg-D, a novel radiology dataset featuring image-text pairs with pixel-level annotations across multiple organs and pathologies. Extensive experiments demonstrate that our method significantly outperforms existing baselines on joint text generation and segmentation tasks, establishing a new benchmark for multimodal joint output in medical imaging.

Technology Category

Application Category

📝 Abstract

Most current medical vision language models struggle to jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions. This represents a major limitation towards clinical application, as assistive systems that fail to provide both modalities simultaneously offer limited value to medical practitioners. To alleviate this limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified and hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is precisely designed to support the development of models that produce descriptive text and corresponding segmentation masks in tandem. Subsequently, we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation. RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong performance across all components involved in the task of multi-target text-and-mask generation, establishing a robust and competitive baseline.

Problem

Research questions and friction points this paper is trying to address.

Generating diagnostic text and segmentation masks simultaneously

Overcoming limitations of current medical vision-language models

Enabling joint abnormality detection, diagnosis, and flexible segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines diagnostic text with pixel-level segmentation masks

Leverages unified hierarchical dataset for multi-target segmentation

Generates descriptive text and segmentation masks simultaneously

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training