TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing medical foundation models typically rely on task-specific pretraining or resource-intensive fine-tuning, limiting their generalizability and plug-and-play applicability. To address this, we propose the first task-agnostic, general-purpose foundation model for 3D CT volumetric data. Our method enhances ViT and DINOv2 architectures with depth-aware 3D patch embedding, voxel-level positional encoding, and a self-supervised contrastive learning framework, enabling end-to-end self-supervised representation learning. Trained on 105,000 CT volumes, the model yields robust frozen feature representations. It achieves state-of-the-art performance across diverse downstream tasks—including classification, segmentation, and detection—with only lightweight fine-tuning, significantly outperforming prior approaches. Crucially, the model and benchmark code are fully open-sourced, facilitating reproducibility and community advancement.

Technology Category

Application Category

📝 Abstract

Existing foundation models (FMs) in the medical domain often require extensive fine-tuning or rely on training resource-intensive decoders, while many existing encoders are pretrained with objectives biased toward specific tasks. This illustrates a need for a strong, task-agnostic foundation model that requires minimal fine-tuning beyond feature extraction. In this work, we introduce a suite of task-agnostic pretraining of CT foundation models (TAP-CT): a simple yet effective adaptation of Vision Transformers (ViTs) and DINOv2 for volumetric data, enabling scalable self-supervised pretraining directly on 3D CT volumes. Our approach incorporates targeted modifications to patch embeddings, positional encodings, and volumetric augmentations, making the architecture depth-aware while preserving the simplicity of the underlying architectures. We show that large-scale 3D pretraining on an extensive in-house CT dataset (105K volumes) yields stable, robust frozen representations that generalize strongly across downstream tasks. To promote transparency and reproducibility, and to establish a powerful, low-resource baseline for future research in medical imaging, we will release all pretrained models, experimental configurations, and downstream benchmark code at https://huggingface.co/fomofo/tap-ct-b-3d.

Problem

Research questions and friction points this paper is trying to address.

Develops a task-agnostic 3D CT foundation model requiring minimal fine-tuning

Adapts Vision Transformers for scalable self-supervised pretraining on volumetric CT data

Creates robust frozen representations that generalize across diverse downstream medical tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts Vision Transformers for 3D CT volumetric data

Uses self-supervised pretraining on large-scale CT dataset

Modifies embeddings and augmentations for depth-aware representations

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training