🤖 AI Summary
This work addresses the challenge of learning unified cross-modal representations in image–text contrastive pretraining, where modality disconnection often hinders effective alignment. To overcome this limitation, the authors propose a lightweight fusion and multi-level alignment mechanism that operates during training: it leverages fine-grained image–text correspondences to enhance alignment and introduces a structured interaction module to mitigate early saturation in contrastive learning and improve training stability. Notably, this module is removed at inference time, preserving the efficiency of the dual-encoder architecture. Experimental results demonstrate that the proposed approach significantly outperforms strong baselines across image–text retrieval, classification, and multimodal benchmark tasks, effectively bridging the modality gap while maintaining both discriminative representation quality and inference efficiency.
📝 Abstract
Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.