🤖 AI Summary
To address insufficient vision-language modality alignment in zero-shot image classification, this paper proposes a collaborative iterative transductive learning framework. Methodologically: (1) it introduces large language model (LLM)-supervised guidance into transductive vision-language model (VLM) training for language-driven attribute discovery; (2) it establishes an attribute-space incremental generation mechanism with bidirectional feedback adaptation; and (3) it integrates attribute-augmented transductive inference, joint fine-tuning of cross-modal encoders, and iterative label distillation. Evaluated on 12 zero-shot benchmarks, our method achieves an average accuracy gain of +8.6% over CLIP and +3.7% over transductive CLIP; it also demonstrates strong few-shot generalization. Ablation studies confirm the indispensability of each component. The core contribution lies in establishing a language-model-driven, attribute-evolvable paradigm for cross-modal collaborative optimization.
📝 Abstract
Transductive zero-shot learning with vision-language models leverages image-image similarities within the dataset to achieve better classification accuracy compared to the inductive setting. However, there is little work that explores the structure of the language space in this context. We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. Our approach is iterative and consists of three steps: (i) incrementally exploring the attribute space by querying language models, (ii) an attribute-augmented transductive inference procedure, and (iii) fine-tuning the language and vision encoders based on inferred labels within the dataset. Through experiments with CLIP encoders, we demonstrate that GTA-CLIP, yields an average performance improvement of 8.6% and 3.7% across 12 datasets and 3 encoders, over CLIP and transductive CLIP respectively in the zero-shot setting. We also observe similar improvements in a few-shot setting. We present ablation studies that demonstrate the value of each step and visualize how the vision and language spaces evolve over iterations driven by the transductive learning.