TLAC: Two-stage LMM Augmented CLIP for Zero-Shot Classification

📅 2025-03-15

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing CLIP-based zero-shot classification methods rely on fine-tuning, suffering from poor generalization and high computational overhead. This paper introduces the first training-free collaborative paradigm integrating Large Multimodal Models (LMMs) and CLIP: in the first stage, a pre-trained LMM (e.g., Gemini) generates fine-grained object descriptions for input images; in the second stage, CLIP’s text encoder maps these descriptions into the category semantic space, enabling zero-shot classification via similarity matching. Our approach entirely eliminates prompt engineering, adapter modules, and any parameter updates. It significantly enhances cross-dataset and cross-domain robustness. Evaluated on 13 benchmarks, it achieves a mean accuracy of 83.6%, outperforming the strongest training-free baseline by 9.7%. On domain-shifted datasets—including ImageNet-V2, ImageNet-R, and ImageNet-S—it improves performance by 3.6%–16.96%.

Technology Category

Application Category

📝 Abstract

Contrastive Language-Image Pretraining (CLIP) has shown impressive zero-shot performance on image classification. However, state-of-the-art methods often rely on fine-tuning techniques like prompt learning and adapter-based tuning to optimize CLIP's performance. The necessity for fine-tuning significantly limits CLIP's adaptability to novel datasets and domains. This requirement mandates substantial time and computational resources for each new dataset. To overcome this limitation, we introduce simple yet effective training-free approaches, Single-stage LMM Augmented CLIP (SLAC) and Two-stage LMM Augmented CLIP (TLAC), that leverages powerful Large Multimodal Models (LMMs), such as Gemini, for image classification. The proposed methods leverages the capabilities of pre-trained LMMs, allowing for seamless adaptation to diverse datasets and domains without the need for additional training. Our approaches involve prompting the LMM to identify objects within an image. Subsequently, the CLIP text encoder determines the image class by identifying the dataset class with the highest semantic similarity to the LLM predicted object. We evaluated our models on 11 base-to-novel datasets and they achieved superior accuracy on 9 of these, including benchmarks like ImageNet, SUN397 and Caltech101, while maintaining a strictly training-free paradigm. Our overall accuracy of 83.44% surpasses the previous state-of-the-art few-shot methods by a margin of 6.75%. Our method achieved 83.6% average accuracy across 13 datasets, a 9.7% improvement over the previous 73.9% state-of-the-art for training-free approaches. Our method improves domain generalization, with a 3.6% gain on ImageNetV2, 16.96% on ImageNet-S, and 12.59% on ImageNet-R, over prior few-shot methods.

Problem

Research questions and friction points this paper is trying to address.

Enhance zero-shot image classification without fine-tuning.

Improve adaptability to diverse datasets and domains.

Achieve superior accuracy with training-free methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free CLIP augmentation using LMMs

LMM identifies objects, CLIP classifies images

Superior accuracy without dataset-specific fine-tuning

🔎 Similar Papers

ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models