🤖 AI Summary
This work addresses the challenge that multimodal encoders (e.g., CLIP) rely heavily on massive image–text paired data, limiting their deployment in low-resource settings. To this end, we propose Cross-modal Similarity Alignment (CSA), a training-free, unimodal-to-multimodal semantic space mapping method. CSA requires only pre-trained unimodal encoders (e.g., ViT, BERT) and a minimal set of cross-modal samples—approximately 1/50,000 the scale of CLIP’s training data—and leverages similarity-preserving constraints coupled with cubic-complexity matrix decomposition to achieve efficient, parameter-free alignment. It avoids end-to-end fine-tuning and GPU-intensive training, and supports arbitrary modality pairs (e.g., image–text, LiDAR–text). Evaluated on ImageNet zero-shot classification and misleading news headline detection, CSA substantially outperforms CLIP and existing unimodal-to-multimodal mapping approaches, demonstrating strong generalization and practical utility in resource-constrained scenarios.
📝 Abstract
Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring $50,000 imes$ fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on ImageNet classification and misinformative news caption detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.