🤖 AI Summary
Zero-shot generative model adaptation (ZSGM) aims to transfer a pre-trained generator to a target domain without any samples, using only textual prompts. However, existing methods enforce strict image-text alignment in CLIP embedding space, ignoring the heterogeneous and concept-dependent nature of their semantic offsets—leading to degraded generation quality. This work is the first to empirically identify a strong correlation between the magnitude of CLIP image-text offset misalignment and the semantic distance between concepts. Building upon this insight, we propose the first generation-quality-centric iterative refinement framework for ZSGM. Our approach introduces a concept-distance-aware dynamic alignment strategy and replaces rigid full alignment with a progressive image-offset calibration mechanism. Evaluated across 26 benchmark settings, our method consistently outperforms state-of-the-art approaches, achieving significant improvements in quantitative metrics, visual fidelity, and human evaluation—enabling more precise and natural zero-shot domain transfer.
📝 Abstract
Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space, resulting in quality degrade in generated images. Our work makes two main contributions. Inspired by the offset misalignment studies in NLP, as our first contribution, we perform an empirical study to analyze the misalignment between text offset and image offset in CLIP embedding space for various large publicly available datasets. Our important finding is that offset misalignment in CLIP embedding space is correlated with concept distance, i.e., close concepts have a less offset misalignment. To address the limitations of the current approaches, as our second contribution, we propose Adaptation with Iterative Refinement (AIR) which is the first ZSGM approach to focus on improving target domain image quality based on our new insight on offset misalignment.Qualitative, quantitative, and user study in 26 experiment setups consistently demonstrate the proposed AIR approach achieves SOTA performance. Additional experiments are in Supp.