🤖 AI Summary
Natural image-text data suffer from loose semantic alignment due to weak supervision, whereas medical data exhibit tight alignment but limited diversity—both impeding the robustness and generalization of CLIP models. To address this, we propose CLIPin, a plug-and-play non-contrastive plugin framework that enhances semantic alignment without modifying the backbone architecture or introducing significant parameter overhead. CLIPin seamlessly integrates contrastive and non-contrastive objectives via a cross-modal shared pre-projector and a unified non-contrastive learning module. It is fully compatible with any CLIP variant and requires no retraining of visual or language encoders. Extensive experiments across diverse downstream tasks—including cross-domain retrieval, zero-shot classification, and medical image-text matching—demonstrate consistent and significant performance gains, validating CLIPin’s generality, effectiveness, and practicality.
📝 Abstract
Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.