🤖 AI Summary
CLIP-like models rely solely on global contrastive loss, limiting their ability to capture token-level fine-grained semantics—especially under long caption scenarios—thereby degrading image–text alignment. To address this, we propose Token-level Classification-enhanced CLIP (TC-CLIP): a lightweight linear classification head is appended to the visual encoder’s output, supervised directly by text token embeddings to enable unsupervised token-level alignment. This design introduces only 0.077% additional FLOPs, preserves CLIP’s original architecture, and mitigates mini-batch degradation. TC-CLIP consistently outperforms CLIP across zero-shot classification, cross-modal retrieval, and pure vision tasks. It demonstrates robustness on both original Web-scale data and re-annotated benchmarks. To our knowledge, this is the first work to empirically validate that lightweight classification supervision significantly enhances the fine-grained representational capability of contrastive learning—without requiring extra annotations or architectural overhaul.
📝 Abstract
Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP's ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP's small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes. Code and models will be made open source.