Unified Supervision For Vision-Language Modeling in 3D Computed Tomography

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address insufficient diagnostic accuracy in medical CT imaging and the scarcity and high heterogeneity of publicly available 3D annotations, this paper proposes Uniferum—the first vision-language pretraining framework integrating multi-source heterogeneous supervision: unified classification labels, organ-level segmentation masks, and 3D volumetric alignment. Uniferum jointly optimizes multi-label classification, mask-guided segmentation, contrastive learning, and cross-modal alignment to substantially enhance generalization. On the CT-RATE benchmark, it achieves a 7% AUROC improvement over CLIP and conventional CNNs; it also demonstrates strong zero-shot transfer performance on RAD-CHEST and INSPECT. This work pioneers unified modeling of heterogeneous annotations and anatomical structural priors, establishing a novel paradigm for reliable, generalizable zero-shot diagnosis in clinical radiology.

Technology Category

Application Category

📝 Abstract

General-purpose vision-language models (VLMs) have emerged as promising tools in radiology, offering zero-shot capabilities that mitigate the need for large labeled datasets. However, in high-stakes domains like diagnostic radiology, these models often lack the discriminative precision required for reliable clinical use. This challenge is compounded by the scarcity and heterogeneity of publicly available volumetric CT datasets, which vary widely in annotation formats and granularity. To address these limitations, we introduce Uniferum, a volumetric VLM that unifies diverse supervision signals, encoded in classification labels and segmentation masks, into a single training framework. By harmonizing three public 3D CT datasets with distinct annotations, Uniferum achieves state-of-the-art performance, improving AUROC on the CT-RATE benchmark by 7% compared to CLIP-based and conventional multi-label convolutional models. The model demonstrates robust out-of-distribution generalization, with observed evidence of unexpected zero-shot performance on the RAD-CHEST and INSPECT datasets. Our results highlight the effectiveness of integrating heterogeneous annotations and body segmentation to enhance model performance, setting a new direction for clinically reliable, data-efficient VLMs in 3D medical imaging.

Problem

Research questions and friction points this paper is trying to address.

Lack discriminative precision in radiology VLMs

Scarcity and heterogeneity of CT datasets

Integrating diverse supervision signals into training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified supervision integrating classification and segmentation

Harmonizing heterogeneous 3D CT datasets with distinct annotations

Volumetric VLM framework achieving robust out-of-distribution generalization

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training