ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of learning unified cross-modal representations in image–text contrastive pretraining, where modality disconnection often hinders effective alignment. To overcome this limitation, the authors propose a lightweight fusion and multi-level alignment mechanism that operates during training: it leverages fine-grained image–text correspondences to enhance alignment and introduces a structured interaction module to mitigate early saturation in contrastive learning and improve training stability. Notably, this module is removed at inference time, preserving the efficiency of the dual-encoder architecture. Experimental results demonstrate that the proposed approach significantly outperforms strong baselines across image–text retrieval, classification, and multimodal benchmark tasks, effectively bridging the modality gap while maintaining both discriminative representation quality and inference efficiency.

Technology Category

Application Category

📝 Abstract
Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.
Problem

Research questions and friction points this paper is trying to address.

image-text contrastive pretraining
modality gap
cross-modal representation
visual representation learning
multimodal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal multiple alignment
training-time fusion
modality gap elimination
contrastive pretraining
dual-encoder architecture
🔎 Similar Papers
No similar papers found.
H
HanZpeng Liu
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Yaqian Li
Yaqian Li
Li Auto
computer vision
Z
Zidan Wang
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
S
Shuoxi Zhang
Institute of AI for Industries, Chinese Academy of Sciences
Z
Zonglin Zhao
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Z
Zihao Bo
Li Auto Inc.
R
Rinyoichi Takezoe
Li Auto Inc.
K
Kaiwen Long
Li Auto Inc.
Kun He
Kun He
Professor, Huazhong University of Science and Technology
AI SecurityGraph data miningOptimizationDeep learningAI4Sci