Implicit Inversion turns CLIP into a Decoder

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the fundamental question: “Can high-fidelity text-to-image generation be achieved using only a frozen CLIP model, without any decoder, training, or fine-tuning?” To this end, we propose a purely discriminative, zero-parameter-update CLIP-based image synthesis method. Our approach leverages Frequency-layered Implicit Neural Representations (F-INS) and jointly optimizes an unsupervised inverse mapping via a hybrid loss comprising adversarial robust initialization, orthogonal Procrustes alignment, and natural image prior constraints. Crucially, our method is the first to empirically uncover and activate CLIP’s latent generative capacity—previously assumed strictly discriminative—thereby transcending its conventional paradigm. Extensive experiments demonstrate state-of-the-art performance across text-guided synthesis, style transfer, and image reconstruction, all without modifying CLIP’s parameters. This work provides new empirical evidence for unlocking untapped generative potential in large pre-trained vision-language models. (149 words)

Technology Category

Application Category

📝 Abstract
CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.
Problem

Research questions and friction points this paper is trying to address.

Enables image synthesis using CLIP without a decoder
Optimizes implicit neural representation for coarse-to-fine generation
Stabilizes inverse mapping with robust initialization and alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-aware implicit neural representation
Adversarially robust initialization
Orthogonal Procrustes projection
🔎 Similar Papers
No similar papers found.