Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings

πŸ“… 2025-03-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

191K/year
πŸ€– AI Summary
In few-shot vision-language alignment, contrastive learning suffers from overfitting and training instability. To address this, we propose a dynamic loss scheduling method based on predictive variance in the embedding space: for the first time, cross-modal embedding prediction variance is adopted as an uncertainty metric to adaptively weight contrastive losses. This strategy outperforms conventional entropy- or cosine-similarity-based scheduling mechanisms, significantly enhancing alignment robustness and discriminability under data-scarce conditions. Experiments on a Flickr8k subset demonstrate substantial improvements in image–text retrieval accuracy; t-SNE visualizations reveal tighter cross-modal clustering; and noise robustness evaluations confirm slower recall degradation. Our work establishes an interpretable, scalable, uncertainty-driven optimization paradigm for few-shot vision-language modeling.

Technology Category

Application Category

πŸ“ Abstract
Training vision-language models for image-text alignment typically requires large datasets to achieve robust performance. In low-data scenarios, standard contrastive learning can struggle to align modalities effectively due to overfitting and unstable training dynamics. In this paper, we propose a variance-aware loss scheduling approach that dynamically adjusts the weighting of the contrastive loss based on the statistical variability (uncertainty) in the model's alignment predictions. Using a subset of the Flickr8k image-caption dataset to simulate limited data conditions, we demonstrate that our approach improves image-text retrieval accuracy compared to a fixed-weight baseline. We also compare against other adaptive weighting strategies (using output entropy and cosine similarity spread) and find that variance-aware scheduling provides the best overall trade-off. Qualitatively, our method yields more distinct multimodal embeddings as shown by t-SNE visualizations. Moreover, in a stress test with noise-injected captions and images, the variance-guided loss proves more robust, maintaining higher recall when random perturbations are introduced. These results highlight the benefit of adaptive loss weighting for multimodal alignment in low-data regimes.
Problem

Research questions and friction points this paper is trying to address.

Improves image-text alignment in low-data settings
Dynamically adjusts contrastive loss based on prediction variability
Enhances robustness against noise in multimodal embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variance-aware loss scheduling for alignment
Dynamic contrastive loss weighting adjustment
Improved robustness in low-data scenarios
πŸ”Ž Similar Papers
No similar papers found.