Unified Supervision For Vision-Language Modeling in 3D Computed Tomography

πŸ“… 2025-09-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address insufficient diagnostic accuracy in medical CT imaging and the scarcity and high heterogeneity of publicly available 3D annotations, this paper proposes Uniferumβ€”the first vision-language pretraining framework integrating multi-source heterogeneous supervision: unified classification labels, organ-level segmentation masks, and 3D volumetric alignment. Uniferum jointly optimizes multi-label classification, mask-guided segmentation, contrastive learning, and cross-modal alignment to substantially enhance generalization. On the CT-RATE benchmark, it achieves a 7% AUROC improvement over CLIP and conventional CNNs; it also demonstrates strong zero-shot transfer performance on RAD-CHEST and INSPECT. This work pioneers unified modeling of heterogeneous annotations and anatomical structural priors, establishing a novel paradigm for reliable, generalizable zero-shot diagnosis in clinical radiology.

Technology Category

Application Category

πŸ“ Abstract
General-purpose vision-language models (VLMs) have emerged as promising tools in radiology, offering zero-shot capabilities that mitigate the need for large labeled datasets. However, in high-stakes domains like diagnostic radiology, these models often lack the discriminative precision required for reliable clinical use. This challenge is compounded by the scarcity and heterogeneity of publicly available volumetric CT datasets, which vary widely in annotation formats and granularity. To address these limitations, we introduce Uniferum, a volumetric VLM that unifies diverse supervision signals, encoded in classification labels and segmentation masks, into a single training framework. By harmonizing three public 3D CT datasets with distinct annotations, Uniferum achieves state-of-the-art performance, improving AUROC on the CT-RATE benchmark by 7% compared to CLIP-based and conventional multi-label convolutional models. The model demonstrates robust out-of-distribution generalization, with observed evidence of unexpected zero-shot performance on the RAD-CHEST and INSPECT datasets. Our results highlight the effectiveness of integrating heterogeneous annotations and body segmentation to enhance model performance, setting a new direction for clinically reliable, data-efficient VLMs in 3D medical imaging.
Problem

Research questions and friction points this paper is trying to address.

Lack discriminative precision in radiology VLMs
Scarcity and heterogeneity of CT datasets
Integrating diverse supervision signals into training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified supervision integrating classification and segmentation
Harmonizing heterogeneous 3D CT datasets with distinct annotations
Volumetric VLM framework achieving robust out-of-distribution generalization
H
Hao-Chih Lee
BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai
Z
Zelong Liu
BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai
H
Hamza Ahmed
BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai
S
Spencer Kim
BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai
S
Sean Huver
NVIDIA
Vishwesh Nath
Vishwesh Nath
NVIDIA
Medical Image AnalysisImage ProcessingMachine Learning
Z
Zahi A. Fayad
BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai
T
Timothy Deyer
East River Medical Imaging
X
Xueyan Mei
BioMedical Engineering and Imaging Institute, Icahn School of Medicine at Mount Sinai