🤖 AI Summary
This paper addresses zero-shot generalization in free-text-driven 3D medical image segmentation across unseen modalities (CT/MRI/PET), anatomical/pathological categories, and datasets. Methodologically, it introduces a multi-stage vision-language fusion decoder that aligns textual and visual representations at multiple feature scales, coupled with large-scale 3D vision-language joint pretraining to deeply integrate clinical text descriptions with voxel-level imaging features. Its key contributions include flexible prompt support—from single anatomical terms to full clinical sentences—and precise 3D instance segmentation of anatomies and lesions. Experiments demonstrate state-of-the-art zero-shot segmentation performance on unseen datasets, robust cross-modal transfer capability, resilience to linguistic variation in clinical descriptions, and direct applicability to real-world clinical text inputs.
📝 Abstract
We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell