MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of semantic structure imbalance and insufficient negative sample repulsion in multimodal contrastive learning under long-tailed data distributions. The authors propose a sample-adaptive dynamic temperature scheduling mechanism that, for the first time, jointly integrates a temperature parameter with a max-margin framework into the multimodal contrastive loss. This approach adaptively modulates the attraction and repulsion strengths between positive and negative samples based on local density, thereby unifying the optimization objectives of InfoNCE and max-margin learning. The method achieves new state-of-the-art performance across four benchmark datasets—Flickr30K, MSCOCO, EPIC-KITCHENS-100, and YouCook2—demonstrating significantly enhanced multimodal representation capabilities in long-tailed scenarios.

Technology Category

Application Category

📝 Abstract
Contrastive learning has become a fundamental approach in both uni-modal and multi-modal frameworks. This learning paradigm pulls positive pairs of samples closer while pushing negatives apart. In the uni-modal setting (e.g., image-based learning), previous research has shown that the strength of these forces can be controlled through the temperature parameter. In this work, we propose Multi-Modal Temperature and Margin Schedules (MM-TS), extending the concept of uni-modal temperature scheduling to multi-modal contrastive learning. Our method dynamically adjusts the temperature in the contrastive loss during training, modulating the attraction and repulsion forces in the multi-modal setting. Additionally, recognizing that standard multi-modal datasets often follow imbalanced, long-tail distributions, we adapt the temperature based on the local distribution of each training sample. Specifically, samples from dense clusters are assigned a higher temperature to better preserve their semantic structure. Furthermore, we demonstrate that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective. We evaluate our approach on four widely used image- and video-language datasets, Flickr30K, MSCOCO, EPIC-KITCHENS-100, and YouCook2, and show that our dynamic temperature and margin schedules improve performance and lead to new state-of-the-art results in the field.
Problem

Research questions and friction points this paper is trying to address.

contrastive learning
long-tail data
multi-modal learning
temperature scheduling
class imbalance
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal contrastive learning
temperature scheduling
long-tail distribution
max-margin framework
dynamic temperature adjustment
🔎 Similar Papers
No similar papers found.