🤖 AI Summary
To address the scarcity of large-scale, volume-level paired multimodal datasets comprising MRI volumes and corresponding clinical text annotations—limiting vision-language co-modeling in brain tumor analysis—this work introduces TextBraTS, the first publicly available, large-scale, volume-aligned multimodal brain tumor dataset. We further propose a text-guided 3D segmentation framework featuring a novel serialized cross-modal cross-attention mechanism that enables fine-grained alignment between BERT-encoded clinical text and hierarchical 3D U-Net features. Systematic evaluation of templated prompting strategies and multiple fusion schemes demonstrates consistent performance gains, yielding significant Dice score improvements of +2.1%–3.7% for whole tumor, tumor core, and enhancing tumor segmentation. The dataset, source code, and pre-trained models are fully open-sourced, establishing a foundational infrastructure and methodological paradigm for medical multimodal research.
📝 Abstract
Deep learning has demonstrated remarkable success in medical image segmentation and computer-aided diagnosis. In particular, numerous advanced methods have achieved state-of-the-art performance in brain tumor segmentation from MRI scans. While recent studies in other medical imaging domains have revealed that integrating textual reports with visual data can enhance segmentation accuracy, the field of brain tumor analysis lacks a comprehensive dataset that combines radiological images with corresponding textual annotations. This limitation has hindered the exploration of multimodal approaches that leverage both imaging and textual data. To bridge this critical gap, we introduce the TextBraTS dataset, the first publicly available volume-level multimodal dataset that contains paired MRI volumes and rich textual annotations, derived from the widely adopted BraTS2020 benchmark. Building upon this novel dataset, we propose a novel baseline framework and sequential cross-attention method for text-guided volumetric medical image segmentation. Through extensive experiments with various text-image fusion strategies and templated text formulations, our approach demonstrates significant improvements in brain tumor segmentation accuracy, offering valuable insights into effective multimodal integration techniques. Our dataset, implementation code, and pre-trained models are publicly available at https://github.com/Jupitern52/TextBraTS.