Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes NOVA, a non-contrastive learning framework for vision–language alignment that eliminates the need for large batch sizes, negative sampling, momentum encoders, or gradient clipping—common requirements in existing contrastive methods that hinder training efficiency and stability. NOVA directly predicts the embeddings of a frozen ClinicalBERT text encoder from augmented image views, significantly simplifying the training pipeline. To regularize the learned representation distribution, the method introduces Sketched Isotropic Gaussian Regularization (SIGReg), which requires only a single hyperparameter. When combined with a Vision Transformer trained from scratch, NOVA achieves state-of-the-art performance on three zero-shot chest X-ray classification benchmarks using the MIMIC-CXR dataset, demonstrating both superior accuracy and enhanced training stability compared to existing baselines.

Technology Category

Application Category

📝 Abstract
Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.
Problem

Research questions and friction points this paper is trying to address.

contrastive learning
vision-language models
negative sampling
hyperparameter tuning
training stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

non-contrastive learning
vision-language alignment
predictive embedding
isotropic Gaussian regularization
zero-shot classification
🔎 Similar Papers
No similar papers found.
Lukas Kuhn
Lukas Kuhn
Researcher, DKFZ
Machine LearningNeuroscience
G
Giuseppe Serra
Goethe University Frankfurt, Frankfurt, Germany; German Cancer Research Center (DKFZ), Heidelberg, Germany; German Cancer Consortium (DKTK), Frankfurt, Germany
Florian Buettner
Florian Buettner
Frankfurt University/DKFZ