RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current medical vision-language models struggle to jointly generate diagnostic text and pixel-level segmentation masks, limiting their clinical utility. To address this, we propose the first unified vision-language framework capable of simultaneous diagnostic report generation and multi-object segmentation. Our approach introduces a task-alignment mechanism and a hierarchical multimodal fusion strategy to enable end-to-end, synergistic reasoning across anomaly detection, textual description, and fine-grained mask prediction. To support training and evaluation, we introduce RadDiagSeg-D, a novel radiology dataset featuring image-text pairs with pixel-level annotations across multiple organs and pathologies. Extensive experiments demonstrate that our method significantly outperforms existing baselines on joint text generation and segmentation tasks, establishing a new benchmark for multimodal joint output in medical imaging.

Technology Category

Application Category

📝 Abstract
Most current medical vision language models struggle to jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions. This represents a major limitation towards clinical application, as assistive systems that fail to provide both modalities simultaneously offer limited value to medical practitioners. To alleviate this limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified and hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is precisely designed to support the development of models that produce descriptive text and corresponding segmentation masks in tandem. Subsequently, we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation. RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong performance across all components involved in the task of multi-target text-and-mask generation, establishing a robust and competitive baseline.
Problem

Research questions and friction points this paper is trying to address.

Generating diagnostic text and segmentation masks simultaneously
Overcoming limitations of current medical vision-language models
Enabling joint abnormality detection, diagnosis, and flexible segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines diagnostic text with pixel-level segmentation masks
Leverages unified hierarchical dataset for multi-target segmentation
Generates descriptive text and segmentation masks simultaneously
🔎 Similar Papers
No similar papers found.
C
Chengrun Li
University of Zurich, Switzerland
Corentin Royer
Corentin Royer
UZH
Large Language Models
Haozhe Luo
Haozhe Luo
University of Bern (ARTORG)
Medical Image AnalysisComputer Vision
B
Bastian Wittmann
University of Zurich, Switzerland
X
Xia Li
ETH Zurich, Switzerland
I
Ibrahim Hamamci
University of Zurich, Switzerland
S
Sezgin Er
University of Zurich, Switzerland
Anjany Sekuboyina
Anjany Sekuboyina
CEO (Bonescreen) & Affiliated Researcher (UZH)
Computer VisionBiomedical ImagingMachine Learning
Bjoern Menze
Bjoern Menze
Universität Zürich
Biomedical Image AnalysisMedical Image AnalysisMedical Image ComputingMachine Learning