🤖 AI Summary
Existing CLIP-based zero-shot classification methods rely on fine-tuning, suffering from poor generalization and high computational overhead. This paper introduces the first training-free collaborative paradigm integrating Large Multimodal Models (LMMs) and CLIP: in the first stage, a pre-trained LMM (e.g., Gemini) generates fine-grained object descriptions for input images; in the second stage, CLIP’s text encoder maps these descriptions into the category semantic space, enabling zero-shot classification via similarity matching. Our approach entirely eliminates prompt engineering, adapter modules, and any parameter updates. It significantly enhances cross-dataset and cross-domain robustness. Evaluated on 13 benchmarks, it achieves a mean accuracy of 83.6%, outperforming the strongest training-free baseline by 9.7%. On domain-shifted datasets—including ImageNet-V2, ImageNet-R, and ImageNet-S—it improves performance by 3.6%–16.96%.
📝 Abstract
Contrastive Language-Image Pretraining (CLIP) has shown impressive zero-shot performance on image classification. However, state-of-the-art methods often rely on fine-tuning techniques like prompt learning and adapter-based tuning to optimize CLIP's performance. The necessity for fine-tuning significantly limits CLIP's adaptability to novel datasets and domains. This requirement mandates substantial time and computational resources for each new dataset. To overcome this limitation, we introduce simple yet effective training-free approaches, Single-stage LMM Augmented CLIP (SLAC) and Two-stage LMM Augmented CLIP (TLAC), that leverages powerful Large Multimodal Models (LMMs), such as Gemini, for image classification. The proposed methods leverages the capabilities of pre-trained LMMs, allowing for seamless adaptation to diverse datasets and domains without the need for additional training. Our approaches involve prompting the LMM to identify objects within an image. Subsequently, the CLIP text encoder determines the image class by identifying the dataset class with the highest semantic similarity to the LLM predicted object. We evaluated our models on 11 base-to-novel datasets and they achieved superior accuracy on 9 of these, including benchmarks like ImageNet, SUN397 and Caltech101, while maintaining a strictly training-free paradigm. Our overall accuracy of 83.44% surpasses the previous state-of-the-art few-shot methods by a margin of 6.75%. Our method achieved 83.6% average accuracy across 13 datasets, a 9.7% improvement over the previous 73.9% state-of-the-art for training-free approaches. Our method improves domain generalization, with a 3.6% gain on ImageNetV2, 16.96% on ImageNet-S, and 12.59% on ImageNet-R, over prior few-shot methods.