Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

๐Ÿ“… 2025-11-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current chest X-ray (CXR) lesion segmentation models suffer from scarce pixel-level annotations and heavy reliance on lengthy, expert-written textual reports, limiting clinical applicability. To address this, we propose an instruction-guided lesion segmentation paradigm and introduce MIMIC-ILSโ€”the first large-scale instruction-driven CXR dataset, comprising 1.1 million instruction-mask pairs that support multi-lesion pixel-level localization and interpretable human-AI interaction. We design a novel vision-languageๅๅŒ framework leveraging cross-modal alignment and a fully automated multimodal generation pipeline to construct high-quality instruction-answer pairs. Our model, ROSALIA, achieves significant improvements over prior methods: +8.2% Dice score in lesion segmentation and markedly enhanced accuracy in natural-language response generation. This work establishes the first instruction-driven benchmark for medical image segmentation and advances the development of clinically deployable, user-friendly AI diagnostic systems.

Technology Category

Application Category

๐Ÿ“ Abstract
The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.
Problem

Research questions and friction points this paper is trying to address.

Overcoming limited lesion labels and complex text inputs in chest X-ray segmentation
Enabling lesion segmentation through simple user instructions rather than expert descriptions
Creating large-scale automated dataset for multi-type chest X-ray lesion analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-guided segmentation using simple user inputs
Automated pipeline generates large-scale dataset from reports
Vision-language model segments lesions and provides explanations
๐Ÿ”Ž Similar Papers
No similar papers found.
G
Geon Choi
KAIST
H
Hangyul Yoon
KAIST
H
Hyunju Shin
Samsung Medical Center
S
Sang Hoon Seo
Samsung Medical Center
Eunho Yang
Eunho Yang
KAIST
Machine LearningStatistics
Edward Choi
Edward Choi
KAIST
Machine LearningArtificial IntelligenceHealthcare