🤖 AI Summary
This work proposes the first fully Mamba-based contrastive vision-language pretraining model to address key limitations of conventional CLIP architectures, which rely on Vision Transformers and suffer from sensitivity to spurious correlations, high computational costs, and fixed context constraints. The proposed approach employs VMamba as the visual encoder and an autoregressive Mamba as the text encoder, eliminating positional encoding interpolation and enabling variable-resolution inputs. This design overcomes context length limitations while enhancing cross-modal alignment and out-of-distribution robustness. Experimental results demonstrate that the model surpasses CLIP-ViT-B by 7.5% on ImageNet-O, achieves a 6.6% improvement in retrieval accuracy at 16× training resolution, and reduces memory consumption by 5× and FLOPs by 1.8×.
📝 Abstract
Contrastive Language-Image Pre-training (CLIP) relies on Vision Transformers whose attention mechanism is susceptible to spurious correlations, and scales quadratically with resolution. To address these limitations, We present CLIMP, the first fully Mamba-based contrastive vision-language model that replaces both the vision and text encoders with Mamba. The new architecture encodes sequential structure in both vision and language, with VMamba capturing visual spatial inductive biases, reducing reliance on spurious correlations and producing an embedding space favorable for cross-modal retrieval and out-of-distribution robustness-surpassing OpenAI's CLIP-ViT-B by 7.5% on ImageNet-O. CLIMP naturally supports variable input resolutions without positional encoding interpolation or specialized training, achieving up to 6.6% higher retrieval accuracy at 16x training resolution while using 5x less memory and 1.8x fewer FLOPs. The autoregressive text encoder further overcomes CLIP's fixed context limitation, enabling dense captioning retrieval. Our findings suggest that Mamba exhibits advantageous properties for vision-language learning, making it a compelling alternative to Transformer-based CLIP.