VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses zero-shot generalization in free-text-driven 3D medical image segmentation across unseen modalities (CT/MRI/PET), anatomical/pathological categories, and datasets. Methodologically, it introduces a multi-stage vision-language fusion decoder that aligns textual and visual representations at multiple feature scales, coupled with large-scale 3D vision-language joint pretraining to deeply integrate clinical text descriptions with voxel-level imaging features. Its key contributions include flexible prompt support—from single anatomical terms to full clinical sentences—and precise 3D instance segmentation of anatomies and lesions. Experiments demonstrate state-of-the-art zero-shot segmentation performance on unseen datasets, robust cross-modal transfer capability, resilience to linguistic variation in clinical descriptions, and direct applicability to real-world clinical text inputs.

Technology Category

Application Category

📝 Abstract
We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell
Problem

Research questions and friction points this paper is trying to address.

Segmenting 3D medical images using free-text descriptions as prompts
Achieving zero-shot generalization across CT, MRI, and PET modalities
Aligning clinical language with visual features for accurate volumetric segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language model for medical image segmentation
Multi-stage fusion aligns text and visual features
Zero-shot generalization across modalities and classes
🔎 Similar Papers
Maximilian Rokuss
Maximilian Rokuss
German Cancer Research Center (DKFZ), University of Heidelberg
Computer VisionDeep LearningMedical Image Computing
M
Moritz Langenberg
German Cancer Research Center, Division of Medical Image Computing, Germany; Faculty of Mathematics and Computer Science and Medical Faculty - Heidelberg University; HIDSS4Health, Heidelberg
Yannick Kirchhoff
Yannick Kirchhoff
PhD Student, DKFZ
Computer VisionDeep LearningMedical Image Computing
Fabian Isensee
Fabian Isensee
HIP Applied Computer Vision Lab, Division of Medical Image Computing, German Cancer Research Center
Computer VisionDeep LearningSegmentationMedical Image Computing
Benjamin Hamm
Benjamin Hamm
PhD Student @ German Cancer Research Center (DKFZ)
Computer VisionDeep LearningSecurityMedical Imaging
Constantin Ulrich
Constantin Ulrich
German Cancer Research Center (DKFZ)
Medical Image ComputingMedical physicsComputer Vision
S
S. Regnery
Department of Radiation Oncology, Heidelberg University Hospital, Germany
L
L. Bauer
Department of Radiation Oncology, Heidelberg University Hospital, Germany
E
E. Katsigiannopulos
Department of Radiation Oncology, Heidelberg University Hospital, Germany
T
T. Norajitra
German Cancer Research Center, Division of Medical Image Computing, Germany; Pattern Analysis and Learning Group, Heidelberg University Hospital
K
K. Maier-Hein
German Cancer Research Center, Division of Medical Image Computing, Germany; Faculty of Mathematics and Computer Science and Medical Faculty - Heidelberg University; Helmholtz Imaging; HIDSS4Health, Heidelberg; Pattern Analysis and Learning Group, Heidelberg University Hospital