🤖 AI Summary
This work addresses a critical vulnerability in multimodal contrastive learning when extending beyond image-text pairs: the multiplicative interaction mechanism is highly sensitive to unreliable or missing modalities due to its failure to account for inter-modal reliability differences, often leading to subtle performance degradation. To mitigate this, the authors propose Gated Symile, which introduces an attention-based gating mechanism to dynamically modulate each modality’s contribution and suppress unreliable inputs at the candidate sample level. Additionally, they incorporate learnable neutral embedding directions and an explicit NULL option to handle weakly aligned or partially missing modalities. Extensive experiments demonstrate that Gated Symile consistently outperforms carefully tuned Symile and CLIP baselines across a synthetic benchmark and three real-world trimodal datasets, establishing robust contrastive learning with more than two modalities.
📝 Abstract
Multimodal contrastive learning is increasingly enriched by going beyond image-text pairs. Among recent contrastive methods, Symile is a strong approach for this challenge because its multiplicative interaction objective captures higher-order cross-modal dependence. Yet, we find that Symile treats all modalities symmetrically and does not explicitly model reliability differences, a limitation that becomes especially present in trimodal multiplicative interactions. In practice, modalities beyond image-text pairs can be misaligned, weakly informative, or missing, and treating them uniformly can silently degrade performance. This fragility can be hidden in the multiplicative interaction: Symile may outperform pairwise CLIP even if a single unreliable modality silently corrupts the product terms. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions and incorporating an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets for which such failures could be masked by averages, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned Symile and CLIP models. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning under imperfect and more than two modalities.