VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This paper addresses zero-shot generalization in free-text-driven 3D medical image segmentation across unseen modalities (CT/MRI/PET), anatomical/pathological categories, and datasets. Methodologically, it introduces a multi-stage vision-language fusion decoder that aligns textual and visual representations at multiple feature scales, coupled with large-scale 3D vision-language joint pretraining to deeply integrate clinical text descriptions with voxel-level imaging features. Its key contributions include flexible prompt support—from single anatomical terms to full clinical sentences—and precise 3D instance segmentation of anatomies and lesions. Experiments demonstrate state-of-the-art zero-shot segmentation performance on unseen datasets, robust cross-modal transfer capability, resilience to linguistic variation in clinical descriptions, and direct applicability to real-world clinical text inputs.

Technology Category

Application Category

📝 Abstract

We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell

Problem

Research questions and friction points this paper is trying to address.

Segmenting 3D medical images using free-text descriptions as prompts

Achieving zero-shot generalization across CT, MRI, and PET modalities

Aligning clinical language with visual features for accurate volumetric segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language model for medical image segmentation

Multi-stage fusion aligns text and visual features

Zero-shot generalization across modalities and classes

🔎 Similar Papers

One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts