🤖 AI Summary
Existing medical foundation models typically rely on task-specific pretraining or resource-intensive fine-tuning, limiting their generalizability and plug-and-play applicability. To address this, we propose the first task-agnostic, general-purpose foundation model for 3D CT volumetric data. Our method enhances ViT and DINOv2 architectures with depth-aware 3D patch embedding, voxel-level positional encoding, and a self-supervised contrastive learning framework, enabling end-to-end self-supervised representation learning. Trained on 105,000 CT volumes, the model yields robust frozen feature representations. It achieves state-of-the-art performance across diverse downstream tasks—including classification, segmentation, and detection—with only lightweight fine-tuning, significantly outperforming prior approaches. Crucially, the model and benchmark code are fully open-sourced, facilitating reproducibility and community advancement.
📝 Abstract
Existing foundation models (FMs) in the medical domain often require extensive fine-tuning or rely on training resource-intensive decoders, while many existing encoders are pretrained with objectives biased toward specific tasks. This illustrates a need for a strong, task-agnostic foundation model that requires minimal fine-tuning beyond feature extraction. In this work, we introduce a suite of task-agnostic pretraining of CT foundation models (TAP-CT): a simple yet effective adaptation of Vision Transformers (ViTs) and DINOv2 for volumetric data, enabling scalable self-supervised pretraining directly on 3D CT volumes. Our approach incorporates targeted modifications to patch embeddings, positional encodings, and volumetric augmentations, making the architecture depth-aware while preserving the simplicity of the underlying architectures. We show that large-scale 3D pretraining on an extensive in-house CT dataset (105K volumes) yields stable, robust frozen representations that generalize strongly across downstream tasks. To promote transparency and reproducibility, and to establish a powerful, low-resource baseline for future research in medical imaging, we will release all pretrained models, experimental configurations, and downstream benchmark code at https://huggingface.co/fomofo/tap-ct-b-3d.