🤖 AI Summary
This paper addresses the fundamental question: “Can high-fidelity text-to-image generation be achieved using only a frozen CLIP model, without any decoder, training, or fine-tuning?” To this end, we propose a purely discriminative, zero-parameter-update CLIP-based image synthesis method. Our approach leverages Frequency-layered Implicit Neural Representations (F-INS) and jointly optimizes an unsupervised inverse mapping via a hybrid loss comprising adversarial robust initialization, orthogonal Procrustes alignment, and natural image prior constraints. Crucially, our method is the first to empirically uncover and activate CLIP’s latent generative capacity—previously assumed strictly discriminative—thereby transcending its conventional paradigm. Extensive experiments demonstrate state-of-the-art performance across text-guided synthesis, style transfer, and image reconstruction, all without modifying CLIP’s parameters. This work provides new empirical evidence for unlocking untapped generative potential in large pre-trained vision-language models. (149 words)
📝 Abstract
CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.