🤖 AI Summary
3D medical image classification suffers from data scarcity-induced distributional bias, low adaptation efficiency, and narrow task coverage. Method: We propose AnyMC3D—a unified framework that freezes general-purpose 2D vision foundation models (e.g., ViT) as shared backbones and introduces lightweight task-specific adapters (~1M parameters each), enabling multi-view fusion, pixel-level auxiliary supervision, and Grad-CAM-based explainable heatmap generation. Contribution/Results: We theoretically and empirically demonstrate—for the first time—that carefully adapted 2D foundations can outperform dedicated 3D architectures. We introduce the first comprehensive 3D classification benchmark spanning 12 diverse tasks across pathologies, anatomies, and imaging modalities. AnyMC3D establishes a scalable, unified multi-task paradigm, eliminating the “one-task-one-model” approach. It achieves state-of-the-art performance on all 12 tasks, wins the VLM3D challenge, accelerates inference by 3.2×, and reduces parameter count by 98%.
📝 Abstract
3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.