CA-GCL: Cross-Anatomy Global-Local Contrastive Learning for Robust 3D Medical Image Understanding

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
Existing fine-grained vision-language pretraining methods for 3D medical image understanding often suffer from textual embedding collapse, leading to high sensitivity to prompt variations and poor clinical reliability. This work proposes a global-local contrastive learning framework across anatomical structures, which decouples distinct anatomical categories through a global contrastive objective and mitigates embedding collapse via a clinically informed text augmentation strategy grounded in permutation invariance and partial completeness. By integrating cross-anatomical global contrastive learning with domain knowledge–driven textual augmentation—a first in the field—the proposed method substantially outperforms current approaches on CT-RATE and Rad-ChestCT benchmarks. It achieves superior zero-shot anomaly detection performance, enhanced cross-dataset generalization, and significantly reduced variance across different prompt templates.
📝 Abstract
Fine-grained Vision-Language Pre-training (FVLP) demonstrates significant potential in 3D medical image understanding by aligning anatomy-level visual representations with corresponding textual descriptions. However, existing FVLP paradigms often suffer from severe representation collapse in the textual embedding space, where text embeddings of distinct anatomical structures become highly clustered and indistinguishable. This distributional degeneracy renders the model hypersensitive to prompt variations, hindering reliable clinical deployment. To address these challenges, we propose a novel Cross-Anatomy Global-Local Contrastive Learning framework (CA-GCL). CA-GCL introduces a global contrastive objective that enforces separation between anatomical categories in the latent space, effectively counteracting the aggregation tendency induced by local alignment. Furthermore, we incorporate a clinical-aware text augmentation strategy based on permutation invariance and partial completeness to enhance robustness against descriptive incompleteness. Extensive evaluations on the CT-RATE and Rad-ChestCT datasets demonstrate that CA-GCL consistently outperforms existing VLP paradigms in zero-shot abnormality detection, achieving superior performance while exhibiting strong cross-dataset generalization. Crucially, CA-GCL reduces performance variance across diverse prompt templates, transforming the collapsed textual similarity distribution into a bell-shaped distribution. These results validate CA-GCL as an effective framework for robust 3D medical image understanding.
Problem

Research questions and friction points this paper is trying to address.

representation collapse
textual embedding
anatomical structures
prompt sensitivity
3D medical image understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Anatomy Contrastive Learning
Global-Local Alignment
Text Embedding Collapse
Clinical-Aware Text Augmentation
3D Medical Vision-Language Pre-training