Controlling Latent Diffusion Using Latent CLIP

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing CLIP models operate in pixel space, requiring repeated VAE decoding of latent representations from latent diffusion models (LDMs), incurring substantial computational overhead. To address this, we propose Latent-CLIP—the first contrastive language–image model trained and deployed end-to-end entirely within the VAE latent space. Our approach integrates latent-space contrastive learning, large-scale pretraining on 2.7B latent text–image pairs, Reward-based Noise Optimization (ReNO), and joint fine-tuning of VAE and CLIP components, enabling zero-decoding semantic evaluation. Experiments demonstrate that Latent-CLIP matches pixel-level CLIP in zero-shot classification accuracy; improves text-to-image generation controllability by 21%; and achieves efficient, decoder-free harmful content filtering on benchmarks such as I2P. By eliminating intermediate image decoding, Latent-CLIP simultaneously enhances inference efficiency, generation controllability, and safety assurance.

Technology Category

Application Category

📝 Abstract

Instead of performing text-conditioned denoising in the image domain, latent diffusion models (LDMs) operate in latent space of a variational autoencoder (VAE), enabling more efficient processing at reduced computational costs. However, while the diffusion process has moved to the latent space, the contrastive language-image pre-training (CLIP) models, as used in many image processing tasks, still operate in pixel space. Doing so requires costly VAE-decoding of latent images before they can be processed. In this paper, we introduce Latent-CLIP, a CLIP model that operates directly in the latent space. We train Latent-CLIP on 2.7B pairs of latent images and descriptive texts, and show that it matches zero-shot classification performance of similarly sized CLIP models on both the ImageNet benchmark and a LDM-generated version of it, demonstrating its effectiveness in assessing both real and generated content. Furthermore, we construct Latent-CLIP rewards for reward-based noise optimization (ReNO) and show that they match the performance of their CLIP counterparts on GenEval and T2I-CompBench while cutting the cost of the total pipeline by 21%. Finally, we use Latent-CLIP to guide generation away from harmful content, achieving strong performance on the inappropriate image prompts (I2P) benchmark and a custom evaluation, without ever requiring the costly step of decoding intermediate images.

Problem

Research questions and friction points this paper is trying to address.

Latent-CLIP operates directly in latent space, reducing computational costs.

Latent-CLIP matches CLIP's zero-shot classification performance on benchmarks.

Latent-CLIP guides generation away from harmful content without decoding.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent-CLIP operates directly in latent space

Latent-CLIP trained on 2.7B latent-text pairs

Latent-CLIP reduces pipeline cost by 21%

🔎 Similar Papers

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention